How LLMs Decide What to Cite: The Retrieval Mechanics Behind AI Search

LLMs with search capabilities (ChatGPT, Perplexity, Gemini, Grok, Claude) use a retrieval-augmented generation (RAG) pipeline that follows a predictable sequence: decompose the user's query into sub-queries, run each sub-query against a search index using both keyword and semantic matching, rerank the results with cross-encoder models, extract the top-scoring passages, and synthesize an answer with inline citations. The retrieval set for each sub-query is roughly 5 to 10 documents. If your content isn't in that set, the LLM never sees it, and there is no equivalent of page 2.

This isn't a black box. The pipeline has well-documented stages, each with specific mechanics that determine whether your content gets retrieved, scored, extracted, and ultimately cited. Understanding these mechanics is the difference between optimizing for something real and guessing.

The RAG pipeline is not optional

Every AI search citation depends on the RAG pipeline firing, and it fires on roughly 40% of all ChatGPT queries plus nearly all queries involving current information, product comparisons, or specific claims. An analysis by The Digital Bloom found that roughly 60% of ChatGPT queries are answered from parametric knowledge alone, meaning the model answers from its training data without searching the web at all. But the remaining 40%, and nearly all queries involving current information, product comparisons, or specific claims, trigger the retrieval pipeline.

When retrieval kicks in, the LLM stops relying on what it "knows" and starts relying on what it can find. The answer it generates, and the sources it cites, come almost entirely from the documents returned by the retrieval system. Your content either survives this pipeline or it doesn't exist to the model.

This creates a fundamental asymmetry. Content creators spend months crafting articles, building authority, and publishing on their own domains. The LLM spends milliseconds deciding whether any of it matters. The pipeline doesn't care about your effort. It cares about whether your content matches the query, passes the scoring filters, and contains an extractable passage worth citing.

Stage 1: Query decomposition

The first thing the retrieval system does is break the user's query apart. A question like "what's the best way to get my B2B SaaS startup cited by AI search engines" doesn't get sent to the search index as a single string. The system decomposes it into multiple sub-queries, each targeting a different facet of the question.

For that example, the decomposition might look something like:

"B2B SaaS AI search optimization"
"how to get cited by AI search engines"
"startup AEO strategy"
"best practices AI citation B2B"

Each sub-query gets sent to the search index independently. The results are then merged, deduplicated, and ranked together. This process, sometimes called query fan-out, is why a single user question can pull content from wildly different source domains. The sub-queries cast a wider net than the original question would on its own.

The practical implication is important: your content doesn't need to match the exact query a user types. It needs to match one of the sub-queries the system generates from that query. A page titled "AEO for B2B SaaS" might never appear for the original question in traditional search, but if the decomposition generates "B2B SaaS AI search optimization" as a sub-query, that page enters the candidate pool.

This is also why AEO and traditional SEO diverge so sharply in practice. SEO optimizes for the literal query. AEO optimizes for the sub-queries the LLM will generate, which requires understanding how decomposition works for your topic space.

Stage 2: Hybrid retrieval

Once the sub-queries are generated, each one hits the search index. As of February 2026, every major AI search engine uses some form of hybrid retrieval, combining two fundamentally different search methods.

Keyword-based retrieval (BM25) is the traditional method. It scores documents based on term frequency and inverse document frequency. If your page contains the exact words in the sub-query, BM25 finds it. This is fast, well-understood, and effective for queries with specific terms ("Perplexity AI citation rate" or "ChatGPT source selection").

Semantic retrieval (embedding search) converts both the query and document passages into high-dimensional vectors, then measures how closely the meanings align. This catches content that's relevant but doesn't use the exact query terms. A page about "how answer engines pick their sources" would score well on a semantic search for "LLM citation mechanics" even though the words don't match.

Hybrid retrieval runs both methods in parallel and merges the results. This matters because neither method alone is sufficient. BM25 misses semantically relevant content that uses different terminology. Semantic search misses content where the exact terminology is the point (specific product names, pricing figures, technical specifications). The combination produces a candidate set that's both precise and comprehensive.

Perplexity has published the most transparent account of their retrieval architecture. Their search index tracks over 200 billion unique URLs and processes roughly 200 million queries per day. Their hybrid retrieval merges lexical and semantic results into a single candidate set before applying any reranking. Crucially, Perplexity retrieves and scores at both the document level and the sub-document level, surfacing what they describe as "the most atomic units possible" to the model.

ChatGPT's retrieval layer works differently in the details but follows the same hybrid pattern. It pulls from Bing's search index plus its own crawled data (via OAI-SearchBot), runs hybrid matching, and passes the results through a reranking stage before the model ever sees them. The typical retrieval set is 3 to 8 source documents per query, compared to the 10 organic results you'd see in traditional Google search.

Stage 3: Reranking

The hybrid retrieval stage returns hundreds or thousands of candidate passages. The reranking stage narrows that to the handful the model will actually consider.

Reranking uses cross-encoder models, which are different from the embedding models used in semantic retrieval. Where embedding models encode the query and document separately and compare vectors, cross-encoders process the query and a candidate passage together as a single input. This joint processing is more computationally expensive but significantly more accurate at judging relevance.

The cross-encoder evaluates each candidate passage against the original query (not just the sub-query that retrieved it) and produces a relevance score. The top-scoring passages, typically 5 to 10 per sub-query, advance to the synthesis stage. Everything else is discarded.

This stage is where a lot of content dies silently. A page might survive hybrid retrieval because it contains the right keywords and discusses the right topic. But if the reranker determines that the specific passage doesn't directly address the query with sufficient precision, it drops out. The reranker is looking for passages that could serve as a citation: self-contained, specific, and directly responsive to what the user asked.

Research from AirOps' 2026 State of AI Search report found that 68.7% of pages cited by ChatGPT follow logical heading hierarchies, and nearly 80% include structured lists. This isn't because LLMs "prefer" lists. It's because well-structured content produces cleaner passage boundaries, which makes it easier for the reranker to identify and score individual passages. A wall of unstructured prose might contain the perfect answer, but the reranker has a harder time extracting it.

Stage 4: Passage extraction and the retrieval set

After reranking, the model has its retrieval set: the small collection of passages that will inform the generated answer. This set is typically 5 to 15 passages drawn from 3 to 10 unique source documents, depending on the engine and query complexity.

The retrieval set is the single most important concept in understanding LLM citations. Everything upstream in the pipeline exists to produce this set. Everything downstream, the synthesis, the citations, the formatted answer the user sees, is constrained by it.

If your content is in the retrieval set, you have a chance at being cited. If it isn't, you don't. There is no partial credit. There is no "we ranked 11th." The model simply does not have access to content outside the retrieval set when generating its answer. This is the mechanism behind the often-repeated claim that "there's no page 2 in AI search," and it's literally true, not a metaphor.

The passage extraction itself is granular. The model doesn't ingest whole pages. It works with specific chunks, typically 50 to 150 words, that the retrieval system has identified as relevant. Research from Wellows found that sources with clear, self-contained chunks of 50 to 150 words receive 2.3 times more citations than long-form unstructured content. The chunk boundaries matter: a passage that starts mid-thought or references "as mentioned above" doesn't extract cleanly and gets scored lower.

This explains a counterintuitive finding from the same research: 28.3% of ChatGPT's most-cited pages have zero organic visibility in traditional Google search. They don't rank for anything in regular search. But they contain passages that are so precisely relevant to specific sub-queries that the retrieval system pulls them in anyway. The retrieval set doesn't care about your Google ranking. It cares about passage-level relevance to decomposed sub-queries.

Where citations come from within your content

The position of information within a page has a measurable effect on citation probability. An analysis of 17 million AI citations found that 44.2% of all LLM citations come from the first 30% of text on a page, 31.1% from the middle section, and 24.7% from the final third.

This top-heavy distribution exists because retrieval systems process content sequentially and assign higher confidence to information that appears earlier. An answer placed in the first paragraph after a heading is more likely to be extracted than the same answer buried in paragraph eight. The model treats early placement as a weak signal that the information is the primary point of the page rather than a tangential mention.

For content creators, this means the answer capsule pattern isn't just good writing advice. It's an engineering optimization. Putting your most citable claim, the specific, fact-dense sentence you want AI engines to extract, at the very top of the relevant section gives it the highest probability of surviving every stage of the pipeline.

The source bias inheritance problem

Every AI search engine inherits the ranking biases of the conventional search index it retrieves from: ChatGPT inherits Bing's domain authority signals, Gemini inherits Google's, and the LLM's citation preferences are constrained by those upstream filters before semantic matching even begins. The retrieval system doesn't build its index from scratch. It inherits the biases of whatever search infrastructure it sits on top of.

ChatGPT pulls from Bing's index. Gemini pulls from Google's index. These underlying search engines have their own ranking algorithms, their own authority models, their own biases. When ChatGPT's retrieval system searches for candidate passages, it's searching within results that have already been filtered and ranked by Bing's algorithms. The LLM's citation preferences are downstream of, and constrained by, the traditional search engine's preferences.

This is why ChatGPT behaves more like a traditional search engine than any other AI search platform. It inherits Bing's domain authority signals, which means it disproportionately surfaces high-authority domains: Wikipedia, Forbes, Business Insider, Reddit. A Semrush study of 150,000 citations from June 2025 found that Reddit alone appeared in 40.1% of LLM citations, Wikipedia in 26.3%, and Google/YouTube properties in 23%.

For any content creator outside those mega-domains, this creates a structural barrier. Your content isn't just competing against other content on the same topic. It's competing against the authority signals that the underlying search engine assigns to major platforms. Even if your passage is more relevant and more specific, a vaguer passage from a higher-authority domain can outrank it because the retrieval system's candidate pool was already tilted before semantic matching even started.

Gemini's relationship with Google Search creates a parallel dynamic but with different biases. Gemini weights recency signals more aggressively than any other engine, and a Yext study of 6.8 million citations found that Gemini pulls 52.15% of its citations from brand-owned websites, far higher than ChatGPT's 48.73% third-party source preference. Each engine inherits different biases because each sits on a different search infrastructure.

The original source advantage

The single most impactful factor for surviving the full retrieval pipeline is being the original source of a specific, citable fact or data point. A statistic, a named framework, a proprietary benchmark, a unique research finding. When the LLM needs to support a claim, it looks for something concrete to anchor the citation to. If your page is where that data originates, you become very difficult to route around.

Kevin Indig's State of AI Search Optimization 2026 report, analyzing 1.2 million ChatGPT responses, found that citation winners are nearly twice as likely (36.2% vs. 20.2%) to contain definitive language like "is defined as" or "refers to." Pages with original data, surveys, benchmarks, or proprietary datasets consistently outperform pages that aggregate or summarize other sources.

This makes mechanical sense within the pipeline. When the reranker evaluates two passages that both address the same sub-query, the passage from the original source has a natural advantage: it contains the primary claim in its original form, without the paraphrasing or context-stripping that happens when secondary sources rewrite it. The original passage is typically more specific, more self-contained, and more directly citable.

This is also why aggregators and content rewriters face a structural ceiling in AI search. They can match your keywords, your structure, even your heading patterns. But they can't be the original source of your data. In a retrieval system designed to find the most authoritative passage for each claim, the original source holds an advantage that's difficult to replicate.

The freshness signal

AI retrieval systems weight content freshness more heavily than most content creators realize. Research published in Kevin Indig's Growth Memo found that content less than 3 months old is 3 times more likely to get cited by LLMs, and pages going 3 or more months without an update are 3 times more likely to lose visibility.

This isn't surprising given how the pipeline works. The retrieval system can read temporal markers in content ("As of February 2026," "Updated for Q1 2026") and uses them as one input to its relevance scoring. Content without temporal signals gradually loses its scoring position because the system can't verify whether the information is still accurate. Perplexity's system, in particular, heavily biases retrieval toward content with recent "Last Modified" dates, meaning a competitor's article from last week can beat your article from 2023 even if your domain authority is higher.

The operational implication is clear: content maintenance isn't optional. Unlike traditional SEO, where a well-ranked page can hold its position for months or even years without updates, AI citation requires ongoing freshness signals. The specific refresh cadences vary by engine, but the general pattern is a 48-hour monitoring window across most platforms.

How the engines differ in practice

The same RAG architecture produces very different citation behavior depending on which engine runs it. As of February 2026, each engine's retrieval system has distinct biases, thresholds, and source preferences that stem from their underlying infrastructure and design choices.

ChatGPT uses the largest retrieval set per query among the mainstream engines (typically 8 to 10 sources), has the strongest domain authority bias, and inherits Bing's preference for high-authority third-party sources. It cites Wikipedia, Reddit, and major publications at rates that dwarf citations of independent domains. For startups and smaller publishers, ChatGPT is the hardest engine to crack.

Perplexity has the lowest authority threshold and will readily cite smaller, niche sites if the passage relevance is high enough. It processes 200 million queries daily across a 200-billion-URL index. It's also the most volatile: the same query run twice can produce different citation sets, likely because its retrieval system introduces some stochastic variation in the reranking stage.

Gemini weights recency more aggressively than any other engine, favors brand-owned content (52% of citations), and benefits from Google's search quality signals. Its citation volume per answer (roughly 20 sources) is second only to Grok's.

Grok cites more sources per answer than any other engine (roughly 24), with balanced platform coverage across YouTube, Reddit, and Medium. This high citation volume means more slots available for any given piece of content.

Claude applies the strictest quality filter. It almost exclusively cites individual company websites and blogs, with near-zero citations from aggregator platforms like Reddit, YouTube, or Medium. This makes Claude the one engine where your own domain content has the strongest advantage, provided it meets a high bar for depth and expertise.

These differences aren't quirks. They're architectural consequences of each engine's retrieval infrastructure, reranking models, and design philosophy. A single optimization strategy cannot account for them, which is why multi-engine monitoring is table stakes for anyone serious about AI search visibility.

What this means for anyone trying to get cited

The retrieval pipeline is mechanical. It follows rules, even if those rules are complex and engine-specific. Understanding the mechanics points to a specific set of actions:

Get into the retrieval set. Nothing else matters if you don't survive hybrid retrieval. This means your content needs to contain the terminology and semantic meaning that matches the sub-queries an LLM will generate for your target topics. It also means your domain needs enough authority (or your content needs enough specificity on a narrow topic) to survive the candidate filtering stage.

Make your passages extractable. Self-contained chunks of 50 to 150 words, organized under descriptive headings, with specific claims that make sense without surrounding context. The retrieval system extracts passages, not pages. If your best content can't be cleanly extracted, it can't be cited.

Lead with the answer. 44.2% of citations come from the first third of a page. Put your most citable content at the top. Not your introduction, not your context-setting, not your credentials. The answer itself.

Be the original source. Publish data, research, benchmarks, and frameworks that can't be found elsewhere. Original sources earn citations at nearly twice the rate of secondary sources.

Stay fresh. Update content regularly with temporal markers. Content under 3 months old is 3 times more likely to be cited. Set a cadence and maintain it.

Optimize per-engine, not generically. Each engine has different thresholds, biases, and source preferences. Content that works on Perplexity (low authority threshold, high relevance weight) may be invisible to ChatGPT (high authority threshold, strong third-party preference). Per-engine diagnosis is the only way to know what's failing where.

The FogTrail AEO platform ($499/month) automates this by querying all five engines simultaneously, extracting per-engine narrative intelligence, and generating content through a 6-stage intelligence cycle designed to survive each stage of the retrieval process. But the mechanics described here apply regardless of tooling. The pipeline is the pipeline. Understanding it is the first step to engineering content that survives it.

Frequently Asked Questions

What is the LLM retrieval set?

The retrieval set is the small collection of documents (typically 5 to 10 per sub-query) that an LLM's retrieval system selects as candidates for citation. The model can only cite sources that appear in its retrieval set. Content outside this set is invisible to the model during answer generation, regardless of its quality or relevance. There is no equivalent of "page 2" in AI search.

How does query decomposition affect what gets cited?

LLMs break complex user queries into multiple sub-queries, each targeting a different facet of the question. Each sub-query runs against the search index independently. This means your content doesn't need to match the exact user query. It needs to match one of the sub-queries the system generates, which broadens the surface area for citation but requires understanding what sub-queries your target topics produce.

Why do different AI engines cite different sources for the same query?

Each engine sits on different search infrastructure (ChatGPT uses Bing, Gemini uses Google), applies different reranking models, and weights signals like authority, recency, and relevance differently. ChatGPT inherits Bing's domain authority bias and favors high-authority third-party sources. Perplexity has the lowest authority threshold. Claude ignores aggregator platforms entirely. These are architectural differences, not random variation.

How quickly do AI search engines refresh their retrieval index?

Most AI search engines refresh their indexed knowledge bases approximately every 48 hours, though exact timing varies by engine and content type. Content freshness is a scoring signal in the retrieval pipeline: pages updated within the last 3 months are roughly 3 times more likely to be cited than older content. Perplexity specifically biases retrieval toward content with recent "Last Modified" dates.

Can low-authority domains get cited by LLMs?

Yes. Research shows that 28.3% of ChatGPT's most-cited pages have zero organic visibility in traditional Google search. Low-authority domains can earn citations by owning narrow topics that high-authority domains haven't covered, being the original source of specific data points or frameworks, and structuring content with extractable passages that precisely match decomposed sub-queries. Perplexity and Grok are the most accessible engines for smaller publishers.