Back to blog
AEOAI CitationsOriginal ResearchState of AI CitationsNondeterministic
FogTrail Team·

We Ran the Same 20 Queries 3 Times Across 5 AI Engines. Here's How Much the Results Changed Each Week.

AI search engine recommendations are not stable. We sent the same 20 B2B software queries to ChatGPT, Perplexity, Gemini, Grok, and Claude once per week for three consecutive weeks, and the results shifted every time. ChatGPT's brand citation count swung from 23 to 12 to 14. Cross-engine consensus on the top recommendation oscillated from 50% to 55% back to 50%. One midmarket brand, ActiveCampaign, went from being cited with direct links on ChatGPT to completely invisible on that engine in a single week.

If you are monitoring your AI search visibility with a single snapshot, you are measuring noise. The signal only emerges across multiple observations.

Why AI Search Results Change Between Identical Queries

Large language models use temperature-based sampling when generating responses. This means the same prompt can produce different outputs on different runs, even when the underlying model and retrieval index have not changed. Every recommendation an AI engine makes is, to some degree, a dice roll. The retrieval step (which sources get pulled in) adds another layer of variability: different documents may be surfaced on different runs depending on index freshness and ranking thresholds.

This is fundamentally different from traditional search engines, where the same Google query returns essentially the same results for days or weeks at a time. AI search engines look deterministic to the end user because each query produces a confident, well-structured answer. But run that query again tomorrow, and the answer may cite different brands, recommend a different product first, or drop a brand entirely.

Our three-wave study, covering 300 engine-query pairs tracked across the same 25 B2B SaaS brands, quantifies exactly how much this matters.

The Data: Three Waves, Same Queries, Different Results

Brand Citation Counts Per Engine Across 3 Waves

This table shows how many tracked brands each engine linked to with a direct URL. These are not mentions (brand name appearing in text), these are citations (brand website URL included in the response).

EngineWave 1 (Mar 6)Wave 2 (Mar 10)Wave 3 (Mar 15)Change W1 to W2Change W2 to W3
ChatGPT231214-48%+17%
Perplexity754-29%-20%
Gemini765-14%-17%
Grok277+250%0%
Claude6660%0%

ChatGPT is the most volatile. Its citation count nearly halved in one week, then partially recovered the next. Claude is the most deterministic engine in the dataset: exactly 6 brand citations in all three waves. Grok jumped from 2 to 7 between Wave 1 and Wave 2, then held steady.

Cross-Engine Consensus Over Time

Consensus here means 4 or more of 5 engines agreeing on which brand should be the #1 recommendation for a given query.

MetricWave 1Wave 2Wave 3Pattern
Strong consensus (4+/5 agree)50%55%50%Oscillating
Unanimous (5/5 agree)20%30%30%Partial improvement
Pairwise overlap floor58%63%58%Oscillating

The strong consensus rate went up 5 points in Wave 2, then dropped back down to its Wave 1 level. The pairwise overlap floor (the lowest agreement between any two engines) followed the same pattern: 58%, 63%, 58%. The engines are not converging on shared recommendations. They are oscillating.

Pairwise Engine Agreement: Who Agrees With Whom?

Engine PairWave 1Wave 2Wave 3Trend
ChatGPT, Gemini62%67%58%Diverging
ChatGPT, Grok58%70%71%Converging
Grok, Claude62%72%75%Converging
Perplexity, Gemini71%72%67%Declining
ChatGPT, Claude61%67%62%Oscillating

ChatGPT and Gemini agree on only 58% of brand mentions, the lowest overlap between any two engines across all three waves. Meanwhile, Grok and Claude have quietly converged to 75% agreement. The engines are forming shifting alliances rather than moving toward a shared consensus.

The ActiveCampaign Disappearance: From Cited to Invisible in One Week

The most dramatic example of nondeterminism in our dataset involves ActiveCampaign, a midmarket email marketing platform with over 185,000 customers.

EngineActiveCampaign in Wave 1Wave 2Wave 3
GeminiMentioned (4 queries)Mentioned (4)Mentioned (4)
ClaudeMentioned (3 queries)Mentioned (3)Mentioned (3)
PerplexityMentioned (3 queries)Mentioned (3)Mentioned (2)
GrokMentioned (4 queries)Mentioned (4)Mentioned (2)
ChatGPTMentioned (2 queries)Cited (2 citations)0 mentions

In Wave 2, ChatGPT actually increased its ActiveCampaign engagement, giving it cited links in 2 responses. One week later, ActiveCampaign disappeared from ChatGPT entirely. Zero mentions across all 4 email marketing queries. The other four engines continued to mention it. Nothing about ActiveCampaign's website, content, or market position changed in those 7 days. This is what nondeterministic citation behavior looks like at the brand level.

Supporting Evidence: Engine-Level Instability

When we asked ChatGPT "best email marketing platform for startups" in Wave 2, it recommended ConvertKit first. In Wave 3, it led with Mailchimp. Same query, same engine, different week, different answer.

When we asked all five engines "best platform for deploying web apps," Vercel held position #1 across every engine in Waves 1 and 2 (14 out of 14 responses, then 15 out of 16). In Wave 3, ChatGPT switched to recommending Netlify first, breaking the strongest structural pattern in the entire dataset. The other four engines continued to lead with Vercel. Whether this sticks or reverts in Wave 4 is anyone's guess, which is exactly the point.

Even the consensus metrics oscillate. CRM, one of the most stable categories in Waves 1 and 2 (3 of 4 queries with strong agreement), dropped to 1 of 4 in Wave 3 as Perplexity and Grok rotated their HubSpot and Salesforce preferences.

What This Means for AEO Strategy

The implications for anyone tracking AI search visibility are uncomfortable but important.

First, single-snapshot monitoring is unreliable. Any brand tracking tool that checks your AI citations once and reports a score is giving you one sample from a distribution, not a measurement. ChatGPT could show you cited today and invisible tomorrow without any change on your end. The monitoring frequency matters as much as the monitoring itself.

Second, week-to-week changes in citation counts are not actionable. If your ChatGPT citations drop by 40% between checks, the correct response is not to panic and overhaul your content strategy. It might be temperature noise. It might revert next week. You need multiple data points before you can distinguish a real trend from stochastic variation. Claude's perfect 6, 6, 6 citation count across three waves shows that some engines are more stable than others, but you cannot know which pattern applies to your brand without longitudinal data.

Third, multi-engine monitoring is non-negotiable. ActiveCampaign's ChatGPT disappearance would look catastrophic in a single-engine dashboard. In a multi-engine view, it is clearly a ChatGPT-specific anomaly: the brand still appears on 4 of 5 engines. Without multi-engine context, you cannot tell whether a drop is universal (a real problem) or engine-specific (a nuisance). As of March 2026, AEO platforms that monitor fewer than 5 engines are giving you an incomplete picture by design.

Fourth, the engines are not converging. Early in our study, Wave 2 data suggested the engines might be aligning over time. Wave 3 disproved that. The 50%, 55%, 50% consensus pattern and the 58%, 63%, 58% pairwise floor show oscillation, not convergence. There is no reason to expect AI search recommendations to stabilize anytime soon.

What You Can Do About It

  • Monitor across multiple waves before acting. A citation drop in one snapshot is not a signal. Three consecutive waves of decline is. Set your monitoring cadence to at least weekly and require 2 or more consistent observations before changing strategy.
  • Track all 5 major AI engines simultaneously. ChatGPT, Perplexity, Gemini, Grok, and Claude each behave differently. A brand can be cited on four engines and invisible on the fifth. Single-engine monitoring hides this.
  • Distinguish between mentions and citations. ActiveCampaign was mentioned 11 times across 4 engines in Wave 3 but cited zero times. Mentions without citations mean the engines know you exist but are not willing to vouch for you with a link.
  • Prioritize the engines that matter most to your audience. Claude is the most deterministic (stable citations, predictable behavior). ChatGPT is the most volatile but also the highest traffic. Your AEO effort should account for each engine's sourcing strategy and volatility profile.
  • Accept that some variation is permanent. Temperature sampling is a feature of LLMs, not a bug. Perfect citation stability does not exist in AI search. The goal is sustained visibility across engines and waves, not a perfect score on any single check.

Methodology

We ran 20 queries across 5 AI search engines: ChatGPT, Perplexity, Gemini, Grok, and Claude. Each query was sent as a real-time API call, simulating how actual users interact with these platforms. We tracked 25 B2B SaaS brands across 5 categories (CRM, Project Management, Email Marketing, Analytics, Dev Tools). The same 20 queries were repeated identically in three waves: March 6, March 10, and March 15, 2026. Citation counts, brand mentions, position rankings, and pairwise engine overlap were compared across all three waves.

Frequently Asked Questions

Are AI search engine results truly random?

Not random, but nondeterministic. LLMs use temperature-based sampling that introduces controlled variability into outputs. The same query will generally surface similar brands and topics, but the specific ordering, citation links, and which brands appear or disappear can shift between runs. The underlying retrieval index also changes as engines re-crawl the web.

How often should I check my AI search citations?

At minimum, weekly. Our data shows meaningful shifts occurring between weekly snapshots, including complete brand disappearances. A single monthly check cannot distinguish real trends from stochastic noise. As of March 2026, continuous monitoring platforms like the FogTrail AEO platform use 48-hour refresh cycles to build statistically meaningful visibility baselines.

Which AI engine is most stable for brand citations?

Claude is the most deterministic engine in our dataset, producing exactly 6 brand citations across all three waves. ChatGPT is the most volatile, with citation counts swinging from 23 to 12 to 14. Grok stabilized after an initial jump (2 to 7 to 7). Perplexity and Gemini showed slow, steady declines.

Can a brand disappear from an AI engine without doing anything wrong?

Yes. ActiveCampaign went from being cited with direct links on ChatGPT to completely absent in one week, with no changes to its website or content. The other four engines continued to mention it. This kind of single-engine disappearance is a documented feature of nondeterministic AI search.

Does this mean AEO monitoring tools are useless?

Monitoring is essential, but only if it accounts for nondeterminism. Tools that provide a single snapshot without historical comparison or multi-engine coverage cannot distinguish signal from noise. Effective AEO monitoring requires longitudinal data across multiple engines and multiple observation windows.

Related Resources