We Ran the Same 20 Queries 3 Times Across 5 AI Engines. Here's How Much the Results Changed Each Week.

AI search engine recommendations are not stable. We sent the same 20 B2B software queries to ChatGPT, Perplexity, Gemini, Grok, and Claude once per week for three consecutive weeks, and the results shifted every time. ChatGPT's brand citation count swung from 23 to 12 to 14, a 48% drop followed by a partial recovery. Cross-engine consensus on the top recommendation oscillated from 50% to 55% back to 50%. One midmarket brand, ActiveCampaign, went from being cited with direct links on ChatGPT to completely invisible on that engine in a single week. If you are monitoring your AI search visibility with a single snapshot, you are measuring noise.

The signal only emerges across multiple observations. Our three-wave study, covering 300 engine-query pairs across 25 tracked B2B SaaS brands, quantifies exactly how much this matters.

Why AI Search Results Change Between Identical Queries

Large language models use temperature-based sampling when generating responses. This means the same prompt can produce different outputs on different runs, even when the underlying model and retrieval index have not changed. Every recommendation an AI engine makes is, to some degree, a dice roll. The retrieval step (which sources get pulled in) adds another layer of variability: different documents may be surfaced on different runs depending on index freshness and ranking thresholds.

This is fundamentally different from traditional search engines, where the same Google query returns essentially the same results for days or weeks at a time. AI search engines look deterministic to the end user because each query produces a confident, well-structured answer. But run that query again tomorrow, and the answer may cite different brands, recommend a different product first, or drop a brand entirely.

The Data: Three Waves, Same Queries, Different Results

Brand Citation Counts Per Engine Across 3 Waves

This table shows how many tracked brands each engine linked to with a direct URL across all 20 queries. These are not mentions (brand name appearing in text), these are citations (brand website URL included in the response).

Engine	Wave 1 (Mar 6)	Wave 2 (Mar 10)	Wave 3 (Mar 15)	Change W1 to W2	Change W2 to W3
ChatGPT	23	12	14	-48%	+17%
Perplexity	7	5	4	-29%	-20%
Gemini	7	6	5	-14%	-17%
Grok	2	7	7	+250%	0%
Claude	6	6	6	0%	0%

ChatGPT is the most volatile engine in the dataset. Its citation count nearly halved between Wave 1 and Wave 2, then partially recovered. Claude is the sharpest contrast: exactly 6 brand citations in all three waves, with an 8% volatility score compared to ChatGPT's 39%. Grok jumped from 2 to 7 between Wave 1 and Wave 2, then held steady.

Engine Volatility Scores

As of March 2026, each engine shows a distinct volatility profile based on the percentage of brand-engine-query combinations that changed between consecutive waves.

Engine	Volatility Score	Pattern
Claude	8%	Most deterministic
Gemini	18%	Low volatility
Perplexity	21%	Moderate, steady decline
Grok	27%	Moderate, stabilized after initial jump
ChatGPT	39%	Most volatile

Claude's 8% volatility means that 92% of its brand-engine-query combinations stayed the same across all three waves. ChatGPT's 39% means more than a third of its citations shuffled between any two consecutive runs.

Cross-Engine Consensus Over Time

Consensus here means 4 or more of 5 engines agreeing on which brand should be the #1 recommendation for a given query.

Metric	Wave 1	Wave 2	Wave 3	Pattern
Strong consensus (4+/5 agree)	50%	55%	50%	Oscillating
Unanimous (5/5 agree)	20%	30%	30%	Partial improvement
Pairwise overlap floor	58%	63%	58%	Oscillating

The strong consensus rate went up 5 points in Wave 2, then dropped back to its Wave 1 level. The pairwise overlap floor (the lowest agreement between any two engines) followed the same reversion: 58%, 63%, 58%. The engines are not converging on shared recommendations. They are oscillating. This is the single most important methodology finding from our March dataset: single-wave measurements are unreliable because the landscape shifts week to week.

The ActiveCampaign Disappearance: Cited to Invisible in One Week

The most dramatic example of nondeterminism in our dataset involves ActiveCampaign, a midmarket email marketing platform with over 185,000 customers.

Engine	Wave 1	Wave 2	Wave 3
Gemini	Mentioned (4 queries)	Mentioned (4)	Mentioned (4)
Claude	Mentioned (3 queries)	Mentioned (3)	Mentioned (3)
Grok	Mentioned (4 queries)	Mentioned (4)	Mentioned (2)
Perplexity	Mentioned (3 queries)	Mentioned (3)	Mentioned (2)
ChatGPT	Mentioned (2 queries)	Cited (2 citations)	0 mentions

In Wave 2, ChatGPT actually increased its ActiveCampaign engagement, giving it cited links in 2 responses. One week later, ActiveCampaign disappeared from ChatGPT entirely. Zero mentions across all 4 email marketing queries. The other four engines continued to mention it. Nothing about ActiveCampaign's website, content, or market position changed in those 7 days.

This is not an isolated incident. It is a feature of how ChatGPT behaves as an outlier engine, diverging from the other four engines more than any other pair in the dataset (ChatGPT-Gemini overlap hit 58% in Wave 3, the lowest pairwise agreement across all three waves).

Supporting Evidence: Engine-Level Instability

When we asked ChatGPT "best email marketing platform for startups" in Wave 2, it recommended ConvertKit first. In Wave 3, it led with Mailchimp. Same query, same engine, different week, different answer.

When we asked all five engines "best platform for deploying web apps," Vercel held position #1 across every engine in Waves 1 and 2. In Wave 3, ChatGPT switched to recommending Netlify first, breaking the strongest structural pattern in the entire dataset. The other four engines continued to lead with Vercel. Whether this sticks or reverts in Wave 4 is anyone's guess, which is exactly the point.

Even stable categories can crack. CRM, one of the most settled categories in Waves 1 and 2 (3 of 4 queries with strong consensus), dropped to 1 of 4 in Wave 3 as Perplexity and Grok rotated their HubSpot and Salesforce preferences. Meanwhile, Analytics consensus moved in the opposite direction, improving from 25% to 75% over the same three waves.

What This Means for AEO Strategy

The implications for anyone tracking AI search visibility are uncomfortable but important.

Single-snapshot monitoring is unreliable. Any brand tracking tool that checks your AI citations once and reports a score is giving you one sample from a distribution, not a measurement. ChatGPT could show you cited today and invisible tomorrow without any change on your end. The monitoring frequency matters as much as the monitoring itself.

Week-to-week changes in citation counts are not actionable. If your ChatGPT citations drop by 40% between checks, the correct response is not to panic and overhaul your content strategy. It might be temperature noise. It might revert next week. You need multiple data points before you can distinguish a real trend from stochastic variation. Claude's perfect 6, 6, 6 citation count across three waves shows that some engines are more stable than others, but you cannot know which pattern applies to your brand without longitudinal data.

Multi-engine monitoring is non-negotiable. ActiveCampaign's ChatGPT disappearance would look catastrophic in a single-engine dashboard. In a multi-engine view, it is clearly a ChatGPT-specific anomaly: the brand still appears on 4 of 5 engines. Without that context, you cannot tell whether a drop is universal (a real problem) or engine-specific (a nuisance).

The engines are not converging. Early in our study, Wave 2 data suggested the engines might be aligning over time. Wave 3 disproved that. The 50%, 55%, 50% consensus pattern and the 58%, 63%, 58% pairwise floor show oscillation, not convergence. There is no reason to expect AI search recommendations to stabilize anytime soon.

What You Can Do About It

Monitor across multiple waves before acting. A citation drop in one snapshot is not a signal. Three consecutive waves of decline is. Set your monitoring cadence to at least weekly and require 2 or more consistent observations before changing strategy.
Track all 5 major AI engines simultaneously. ChatGPT, Perplexity, Gemini, Grok, and Claude each behave differently. A brand can be cited on four engines and invisible on the fifth. Single-engine monitoring hides this.
Distinguish between mentions and citations. ActiveCampaign was mentioned 11 times across 4 engines in Wave 3 but cited zero times. Mentions without citations mean the engines know you exist but are not willing to vouch for you with a link.
Prioritize engines by stability and traffic. Claude is the most deterministic (stable citations, predictable behavior). ChatGPT is the most volatile but also the highest traffic. Your AEO effort should account for each engine's volatility profile.
Accept that some variation is permanent. Temperature sampling is a feature of LLMs, not a bug. Perfect citation stability does not exist in AI search. The goal is sustained visibility across engines and waves, not a perfect score on any single check.

Methodology

We ran 20 queries across 5 AI search engines: ChatGPT, Perplexity, Gemini, Grok, and Claude. Each query was sent as a real-time API call, simulating how actual users interact with these platforms. We tracked 25 B2B SaaS brands across 5 categories (CRM, Project Management, Email Marketing, Analytics, Dev Tools). The same 20 queries were repeated identically in three waves: March 6, March 10, and March 15, 2026. Citation counts, brand mentions, position rankings, and pairwise engine overlap were compared across all three waves. Total dataset: 300 engine-query pairs.

Frequently Asked Questions

Are AI search engine results truly random?

Not random, but nondeterministic. LLMs use temperature-based sampling that introduces controlled variability into outputs. The same query will generally surface similar brands and topics, but the specific ordering, citation links, and which brands appear or disappear can shift between runs. The underlying retrieval index also changes as engines re-crawl the web.

How often should I check my AI search citations?

At minimum, weekly. Our data shows meaningful shifts occurring between weekly snapshots, including complete brand disappearances. A single monthly check cannot distinguish real trends from stochastic noise. As of March 2026, continuous monitoring with 48-hour or shorter refresh cycles provides the baseline needed to separate signal from variance.

Which AI engine is most stable for brand citations?

Claude is the most deterministic engine in our dataset, producing exactly 6 brand citations across all three waves with an 8% volatility score. ChatGPT is the most volatile at 39%, with citation counts swinging from 23 to 12 to 14. Grok stabilized after an initial jump (2 to 7 to 7). Perplexity and Gemini showed slow, steady declines.

Can a brand disappear from an AI engine without doing anything wrong?

Yes. ActiveCampaign went from being cited with direct links on ChatGPT to completely absent in one week, with no changes to its website or content. The other four engines continued to mention it. This kind of single-engine disappearance is a documented feature of nondeterministic AI search.

Does this mean AEO monitoring tools are useless?

Monitoring is essential, but only if it accounts for nondeterminism. Tools that provide a single snapshot without historical comparison or multi-engine coverage cannot distinguish signal from noise. Effective AEO monitoring requires longitudinal data across multiple engines and multiple observation windows.