We Asked 5 AI Engines the Same 20 Questions. They Disagreed on the #1 Answer 50% of the Time.

When five AI search engines are asked the same 20 B2B software questions, they only agree on the #1 recommended brand 50% of the time. The other half of the time, engines recommend completely different products at position one, with some queries producing four or five different answers from five engines. This means any brand monitoring only one AI engine is flying blind on half the queries their buyers are asking.

We tracked this disagreement across three weekly waves in March 2026, collecting 300 engine-query data points. The disagreement rate did not improve over time. It oscillated: 50%, 55%, 50%. The engines are not converging on shared answers. They are shifting independently, forming temporary alliances that dissolve the following week.

The Scale of Disagreement

We ran 20 queries across five B2B SaaS categories (CRM, project management, email marketing, analytics, and dev tools) on ChatGPT, Perplexity, Gemini, Grok, and Claude. For each query, we recorded which brand each engine placed at position one.

As of March 2026, only 20-30% of queries produce unanimous agreement (all five engines naming the same brand first). Another 20-25% produce strong consensus (four of five agreeing). The remaining 50% are split, with two, three, or even four different brands appearing at #1 across the five engines.

Consensus Level	Wave 1	Wave 2	Wave 3
Unanimous (5/5)	4 queries (20%)	6 queries (30%)	6 queries (30%)
Strong (4/5)	6 queries (30%)	5 queries (25%)	4 queries (20%)
Majority (3/5)	3 queries (15%)	3 queries (15%)	5 queries (25%)
Split (2/5 or less)	7 queries (35%)	6 queries (30%)	5 queries (25%)
Strong or better (4+/5)	10 (50%)	11 (55%)	10 (50%)

The strong-or-better consensus rate looked like it was improving in Wave 2 (50% to 55%). Then it dropped right back to 50% in Wave 3. This is the single most important finding: the engines are oscillating, not converging.

Where They Agree, and Where They Don't

Some categories have clear consensus leaders. Dev tools is the most settled: Vercel holds position one on 88-100% of engine responses for deployment queries. "Best alternative to Mailchimp" produced unanimous agreement across all three waves. Every engine says Mailchimp.

Other categories are chaos. Project management produced zero queries with strong consensus in two consecutive waves. When we asked "best PM tool for engineering teams," Wave 1 returned Monday.com, Linear, Asana, and Monday.com across four engines (Claude recommended none of our tracked brands at all, suggesting Jira instead). By Wave 3, the answers had reshuffled to Asana, Asana, Linear, ClickUp, and Monday.com. No brand has ever held majority consensus for a single PM query.

Category	W1 Consensus (4+/5)	W2	W3	Trend
Dev Tools	3/4 queries	3/4	3/4	Stable
CRM	3/4	3/4	1/4	Declining
Email Marketing	2/4	2/4	2/4	Stable
Analytics	1/4	2/4	3/4	Improving
Project Management	1/4	0/4	0/4	Stuck at zero

CRM looked stable for two waves, then cracked. Analytics is converging. PM is maximally fragmented. The gap between the most consolidated category and the most fragmented is widening, not narrowing.

Engine Pairs Disagree More Than You'd Expect

We measured pairwise overlap, the percentage of brand mentions shared between any two engines across all 20 queries. The lowest pair in the entire three-wave dataset is ChatGPT and Gemini at 58%. That means these two engines disagree on more than four out of every ten brand mentions.

Engine Pair	Wave 1	Wave 2	Wave 3
Grok + Claude	62%	72%	75%
Gemini + Grok	67%	79%	74%
Perplexity + Gemini	71%	72%	67%
ChatGPT + Grok	58%	70%	71%
ChatGPT + Gemini	62%	67%	58%

The floor of pairwise overlap oscillated: 58%, 63%, 58%. The Wave 2 "convergence" was a one-wave artifact. The engines formed different alliances each week. In Wave 1, Perplexity and Gemini were the closest pair. In Wave 2, Grok and Gemini surged to 79%. In Wave 3, Grok and Claude became the tightest pair at 75%. ChatGPT, meanwhile, diverged from nearly everyone except Grok.

What It Looks Like in Practice

When we asked "best analytics tool for SaaS," Wave 1 produced three different #1 answers: Amplitude (Perplexity, Gemini, Grok), PostHog (ChatGPT), and Mixpanel (Claude). By Wave 3, Amplitude had consolidated to 3/5 engines, but PostHog and Mixpanel still held one engine each. A marketing director checking only ChatGPT would think PostHog is the consensus winner. A director checking only Perplexity would think it is Amplitude.

The PM category is even more extreme. For "what project management software should I use," Wave 3 returned Monday.com (Perplexity), ClickUp (ChatGPT, Gemini), and Asana (Grok, Claude). No brand reached even 3/5. Every engine has a different opinion, and those opinions change week to week.

ChatGPT's behavior is particularly unpredictable. Its citation count swung from 23 to 12 to 14 across three waves. It dropped ActiveCampaign from all email marketing responses between Wave 2 and Wave 3 with no obvious cause. It gave Netlify its first-ever #1 position, something no other engine did. The highest-traffic AI engine is also the least stable. Meanwhile, Claude produced exactly 6 brand citations in all three waves, making it the most deterministic engine in our dataset.

What This Means

The implication is straightforward. If you track your brand's AI visibility on a single engine, you are seeing at best 60% of the picture. The other 40% is a different set of recommendations on different engines.

This is not a temporary problem that will resolve as engines mature. Three waves of data show oscillation, not convergence. The strong-or-better consensus rate hit the same 50% mark in Wave 3 that it started at in Wave 1. The pairwise overlap floor returned to the same 58% it began at. If anything, the engines are becoming more distinct over time. ChatGPT is diverging from Gemini and Claude. Grok and Claude are forming their own cluster.

For brands in settled categories like dev tools (Vercel dominance) or certain email queries (Mailchimp dominance), single-engine monitoring might be adequate. But for brands in fragmented categories like project management, or contested ones like CRM and analytics, multi-engine monitoring is the only way to understand where you actually stand.

The window of opportunity matters too. PM brands have the widest opening right now because no consensus leader exists. Analytics challengers have a narrowing window as Amplitude and Google Analytics lock in positions. And as CRM showed when it cracked from 3/4 to 1/4 consensus in a single wave, even "settled" categories can destabilize overnight.

What You Can Do About It

Monitor all five major engines, not just ChatGPT. A brand that is #1 on one engine can be absent on another for the same query.
Check weekly, not monthly. The landscape shifts meaningfully between waves. Monthly snapshots miss the oscillation entirely.
Prioritize by category. If your category has strong consensus, defend your position. If your category is fragmented, invest in content that engines trust to build your position before a competitor locks it in.
Watch for engine-specific drops. ActiveCampaign went from cited to invisible on ChatGPT in one week. A single-engine disappearance can happen without warning and without any change on other engines.
Treat each engine as a separate channel. They have different sourcing strategies, different biases toward brand size, and different volatility profiles. A strategy that works for Claude (deterministic, stable) will not necessarily work for ChatGPT (volatile, Wikipedia-heavy).

Methodology

We ran 20 queries across 5 AI search engines: ChatGPT, Perplexity, Gemini, Grok, and Claude. Each query was sent as a real-time API call, simulating how actual users interact with these platforms. We tracked 25 B2B SaaS brands across 5 categories (CRM, project management, email marketing, analytics, dev tools) over three weekly waves in March 2026, producing 300 engine-query data points. Consensus was defined as 4 or more of 5 engines agreeing on the same #1 brand. Pairwise overlap was calculated as the Jaccard index of brand mentions shared between two engines across all 20 queries.

FAQ

How often do all 5 AI engines agree on the #1 recommendation?

Only 20-30% of B2B software queries produce unanimous (5/5) agreement across ChatGPT, Perplexity, Gemini, Grok, and Claude. As of March 2026, full agreement is limited to well-established categories like dev tools deployment (Vercel) and specific "alternative to" queries where the incumbent dominates.

Are AI search engines converging on the same answers over time?

No. Three waves of data show oscillation, not convergence. The strong consensus rate went 50%, 55%, 50% across three weekly measurements. The pairwise overlap floor followed the same oscillation pattern: 58%, 63%, 58%.

Which AI engine disagrees the most with the others?

ChatGPT has the lowest pairwise overlap with 3 of 4 other engines as of March 2026. ChatGPT and Gemini agree on only 58% of brand mentions, the lowest overlap in the entire dataset. ChatGPT is also the most volatile engine, with citation counts swinging from 23 to 12 to 14 across three consecutive weeks.

Which B2B categories have the most engine disagreement?

Project management has the most disagreement, with zero queries reaching strong consensus in two consecutive waves. CRM looked stable but destabilized in Wave 3, dropping from 3/4 to 1/4 consensus queries. Analytics is the only category trending toward more agreement.

Why does multi-engine monitoring matter for AEO?

Because AI engines disagree on the #1 recommendation 50% of the time, tracking a single engine gives you an incomplete and potentially misleading view of your AI visibility. A brand can be the top recommendation on one engine and completely absent on another for the same buyer query.