Five frontier LLMs disagree on 67% of 1k real-world fact-check claims
Recorded: May 28, 2026, 1:01 p.m.
| Original | Summarized |
Beyond Benchmarks: Frontier LLM Disagreement on Fact-Checks Developers Sign up Lenz Research · Snapshot v1.0 · data as of 67% Jordanov, Kosta · Lenz Research · kosta@lenz.io We presented 1,000 recent real user claims to the five Key findings 67% of claims (672 / 1,000; 95% CI: 64–70%) have at least one frontier model dissenting from the panel majority — or no majority forms at all. Contents How often the frontier disagrees 1How often the frontier disagrees On 67% of claims (672 / 1,000; 95% CI: 64–70%), the frontier panel doesn't agree — at least one model dissents from the majority verdict, or no strict majority forms at all. The breakdown: For each claim we looked at the five frontier verdicts and asked: did at least three pick the same answer (a strict majority)? If yes, how many of the remaining models dissented? If no clear majority emerged at all — verdicts split across three or four different buckets — the claim falls in the Models split, no majority row. Most of these claims are unlikely to appear in any training corpus with a gold label attached — there's no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to. We refer below to the "majority" and to "dissent from the majority." A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness. Frontier verdict patternClaimsShare of corpus All 5 agreed (unanimity)32833%30–36% Panel agreement: Krippendorff’s α (ordinal) = 0.639 (n=1000 claims, 5 raters). Lower bound on model error. For each claim, exactly one of the four verdict buckets is the correct answer. If we assume the panel's most popular bucket is the correct one — the most charitable assumption — the minimum number of models that picked a wrong verdict is: ≥1 model wrong on 67% of claims (any non-unanimous panel) Relaxing the "most popular is correct" assumption can only raise these counts, never lower them. The actual error rates are likely higher still: even the 33% of cases where all five agree can and likely does include shared blind spots. 2Substantive vs nuance disagreement On 34% of claims (343 / 1,000; 95% CI: 31–37%), at least two frontier models pick verdicts that are 2 or more buckets apart in our 4-bucket rubric — a disagreement that goes beyond calibration. Not every disagreement is equal. A "True" vs "Mostly True" split is a confidence-calibration shift. A "True" vs "False" split is a substantive disagreement about the answer. We measure this as the max pairwise bucket distance across the 5 verdicts on each claim, where the verdicts are ordered True (0) → Mostly True (1) → Misleading (2) → False (3). DistanceInterpretationClaimsShare 0Full unanimity (all 5 picked the same bucket)32833%30–36% Caveat. Bucket distance treats True / Mostly True / Misleading / False as an ordinal scale; an equal-spaced interpretation is a simplification. A 2-bucket gap can still reflect rubric ambiguity, temporal-framing differences, or differing interpretations of "Misleading." We report it as a coarse "substantive vs nuance" indicator, not a metric of error magnitude. 3Model-vs-model agreement Highest peer agreement: Gemini 3 Pro × Gemini 3 Pro + Search (75%) — unsurprising, since they share a base model. Lowest: Claude Opus 4.7 × Gemini 3 Pro, Claude Opus 4.7 × Gemini 3 Pro + Search and Gemini 3 Pro × Sonar Pro (53%) — three pairs tie at the floor. How often each pair of frontier models picked the same verdict label, across GPT-5.4Claude Opus 4.7Gemini 3 ProGemini 3 Pro + SearchSonar Pro GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro 4Per-model behavior Two angles on the same five models: how each one distributes its verdicts (4.1), and how often each one's verdict matches the strict majority of the other four (4.2). Some models concentrate verdicts at the True/False poles; others distribute more broadly across the middle two buckets. This reflects model-level decision priors interacting with the specific claims — without ground truth, we can't separate the two. The table below shows the share of claims each model assigned to each bucket, with 95% Wilson CIs underneath each cell. Model GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro 4.2 Agreement with the rest of the panel Across the five models, peer-majority agreement ranges from 69% to 81%. This is peer-alignment in this corpus, not correctness — no model is treated as ground truth here, and eligible n differs per row. For each model, how often does its verdict match the strict majority (≥3/4) of the other four? A claim is eligible only when a ≥3/4 majority exists among the other four. ModelAgreement w/ peer majorityEligible nIneligibleTier GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro 5Detailed results Denominator per row: claims in that domain (the Claims column). DomainClaimsAny disagreementSubstantive (≥2 buckets)No majority Finance General Health History Legal Politics Science Tech 5.2 Per-verdict panel agreement When the panel does land on a middle bucket, it almost never converges. Mostly True and Misleading majorities reach unanimity at most 5% of the time, vs 43–47% for True and False majorities. Consistent with this, work on a different real-world corpus (17,856 PolitiFact claims with a single-family Llama-3 ablation, Schwab et al. 2025) finds nuanced labels are where fact-check verdict models concentrate their errors — a related observation from a different methodological setup (single-family ablation, not a frontier panel). Denominator: claims with a strict ≥3/5 frontier majority on this verdict. Majority verdictEligible nUnanimous (5/5)Majority only (3-4 of 5) True Mostly True Misleading False Viewed from the other direction — of the 328 claims where all 5 frontier models converged on the same verdict, the distribution across verdicts: Unanimous verdictClaimsShare of unanimous True Mostly True Misleading False 6Data 1,000 claims — the most recent real-world user submissions to a fact-checking platform that pass every eligibility filter listed under Exclusions below. None of these claims is older than February 15, 2026. Unless otherwise stated, every metric on this page uses this set as its denominator; tables that use a different denominator (e.g. claims with a strict ≥3/5 frontier majority on a verdict) state it inline. These claims were submitted to Lenz, The atomic_claim field in the CSV is not the user's raw submission. It's the output of Lenz's framing step, which strips emotional language and bias and distills the input into a single neutral, testable proposition anchored to the submission date. Frontier models were rated against the framed claim, not the raw text. A user who types "Canadian authorities are throwing Christians in jail for quoting the Bible!!!" is rated on the proposition "As of April 4, 2026, Canadian authorities have jailed individuals for publicly quoting the Bible because of their Christian beliefs." Exclusions Claims marked private by the submitting user 7Methodology Parametric (training-only): GPT-5.4 (OpenAI), Claude Opus 4.7 (Anthropic), Gemini 3 Pro (Google) 7.2 Prompt Output exactly one label: True, Mostly True, Misleading, or False. All five models received the same system placeholder (.) and the same user prompt template (usr_v2). No structured-output schema, tool-call schema, seed, top-p, or logit-bias controls were used. The harvester requested deterministic decoding where supported (temperature=0.0); GPT-5.4 and Claude Opus 4.7 were called without an explicit temperature because their provider adapters reject custom temperature settings. Output length was capped at 16 tokens for GPT-5.4, Claude Opus 4.7, and Sonar Pro; Gemini 3 Pro and Gemini 3 Pro + Search used a 1024-token cap (lower caps produced provider-side errors during harvester development). Gemini 3 Pro + Search enabled Google Search grounding; Sonar Pro was treated as retrieval-augmented through Perplexity's search-backed API. Parseable outputs had to equal exactly one of the four labels after normalization. No LLM grader. All measurements derive from direct parsed-label equality Sampling frame & inferential target. The corpus is the 1,000 most recent eligible claims submitted to this single fact-checking platform (per the filters in §6) — not a probability sample from any wider population, and not a complete enumeration (older eligible claims exist but are excluded by the cap). Reported Wilson 95% CIs are nominal binomial intervals under a model where each claim is an independent draw from a hypothetical stream of similar eligible submissions to this same platform under the same screening rules. They are not coverage statements about "all real-world fact-checks." Non-iid caveat. Lenz claims are not independently and identically distributed: users cluster submissions around news events, screening selects for certain topics, and individual users often submit multiple related claims in a single session. True sampling variability under a more honest cluster model (e.g. cluster bootstrap) would likely be larger than what Wilson reports. We surface CIs as a minimum precision floor, not a guaranteed coverage interval. Wilson 95% confidence intervals on every reported rate. We use the Wilson score interval [1] rather than the Wald (normal-approximation) interval because it has better small-N behavior and handles boundary cases (p=0/n, p=n/n) without producing degenerate zero-width intervals. It is the de-facto standard in modern ML evaluation literature. Wilson CIs appear inline next to every rate in §1, §2, §3, §4.2, §5, and the appendix; the printed bounds are exact, not centered on the raw point estimate. Inter-rater reliability — Krippendorff's α (ordinal). The verdict scale (True / Mostly True / Misleading / False) is ordinal, so we score with Krippendorff's α at the ordinal level of measurement [2] rather than Fleiss' κ (which treats categories as nominal and would underestimate agreement — a True ↔ Mostly True 1-bucket disagreement is much smaller than a True ↔ False polar split, and the ordinal metric reflects that). α is reported as a single panel-level number alongside the §1 results table. No model-vs-model significance testing. We report pairwise agreement rates with 95% Wilson CIs as descriptive statistics rather than treating the page as a model leaderboard. Pairwise significance tests are sensitive to the comparison target and eligibility set: for example, peer-majority agreement is a paired claim-level outcome, but each model has a different set of claims where the other four models form a strict majority. References. 8Reproducibility Full per-claim data: download CSV. PDF artifact: download PDF. Browser-independent rendering of this page for offline reading, citation, or arxiv-style preprint hosting. Hash-pinned in the snapshot manifest (pdf_sha256) so the PDF served at /v1.0/pdf is byte-identical across re-deploys. This snapshot is v1.0, data as of May 21, 2026. Harvester prompt version: usr_v2. Grader: direct parsed-label equality across Permanent record & citation: doi.org/10.5281/zenodo.20344847. The Zenodo deposit mirrors the PDF artifact under a permanent DOI for citation in academic and preprint contexts. 9Limitations The pigeonhole rate is a floor on rubric inconsistency, not an absolute "model X is factually wrong" judgement on any specific claim. Only one of {True, Mostly True, Misleading, False} can be the correct bucket, so any disagreement implies at least one inconsistent verdict — but we don't claim to know which model is wrong on which claim. 10FAQ Why no Lenz-vs-frontier comparison? You're a fact-checking platform. A meaningful accuracy comparison requires human-labeled ground truth. We're working on a follow-up study (see below) that human-labels every claim in this corpus and compares both the frontier panel and Lenz's own verdicts against those labels. Until that ships, we'd rather publish nothing about Lenz's relative accuracy than publish a comparison that can't actually answer "who's right." This paper measures only what is measurable without ground truth: how the frontier panel behaves on real-world claims. Has anyone measured frontier-LLM disagreement before? Yang & Wang (2026) show top frontier models disagree on 16-38% of MMLU-Pro and GPQA items even at matched aggregate accuracy, and demonstrate that switching the annotation model in downstream scientific re-analyses can flip estimated treatment-effect signs. On real-world claim verification with rigorous human annotation, the canonical reference is AVeriTeC (4,568 fact-checked claims, multi-round annotation against 50 organizations, inter-annotator κ=0.619). Larger fact-check corpora exist — for example, 17,856 PolitiFact claims under a single-family Llama-3 ablation. Why not use a standard fact-checking benchmark like AVeriTeC instead of building a corpus? Two reasons. First, AVeriTeC, PolitiFact, and similar fact-check corpora have been publicly available for years and almost certainly appear in current frontier-model training data — measured disagreement on them confounds true inference disagreement with memorization. Lenz's corpus is structurally fresh: real-user submissions from the past 180 days, indexed only on lenz.io, never paired with canonical verdicts in any public training set. Second, those corpora draw from a narrower distribution (political claims from US-centric fact-checkers, often pre-screened for newsworthiness) than what real users actually ask about — Lenz claims span health, science, finance, history, tech, and legal questions in the same 4-bucket rubric. What about benchmark contamination — did the models see these claims during training? These are recent real-user submissions, not curated benchmark items from SimpleQA, TruthfulQA, FActScore, or other public datasets. Some claims may overlap topically with material seen in training, but they aren't paired with canonical answer keys the way benchmark items are. Retrieval-enabled models can still find sources on the live web — including Lenz's own public claim pages — though this corpus isn't a controlled contamination audit. Why these five models? Why four buckets instead of five (with Abstain)? Will you re-run this? This is a frozen snapshot (v1.0, data as of May 21, 2026). The archival URL /research/llm-disagreement/v1.0 will always serve this exact version. When v2 ships — with more claims, refreshed model versions, or methodology changes — it'll appear with a clear changelog entry; v1.0 stays at its archival URL. What's the planned follow-up? We're working on a companion study that human-labels every claim in this corpus and uses those labels as ground truth to evaluate both the five frontier models and Lenz's own verdict. The point isn't a leaderboard. The point is to map the structure of disagreement: where do frontier panels systematically diverge from a human consensus, where does Lenz diverge from both, how each individual model and Lenz align with the same human reference, and what categories of claims drive each kind of divergence (rubric ambiguity, temporal framing, domain specialization, calibration drift). The current paper says that the frontier disagrees on real-world claims; the follow-up will say how, on the same corpus, with humans as the reference. 11Ethics & data use Only public-facing claim fields are used: the atomic claim text and the claim's creation date. If a claim is later privatized or deleted by its submitter, we can drop it from this snapshot 12Changelog v1.0 (May 21, 2026, code a6b78be): initial frozen snapshot. Frontier-disagreement only; no Lenz-vs-frontier comparison. Appendix: Example claims where the frontier fractures The twenty claims in this corpus with the widest spread between the highest- and lowest-bucket frontier verdicts. These are claims where the panel doesn't just disagree — it disagrees substantively, with at least one model picking a verdict ≥2 buckets away from another. Ordered by max pairwise bucket distance (descending), no-majority cases tie-broken first, then by stable hash of the claim ID. Deterministic — the page renders the same examples on every load until the next snapshot. Muthiah Muralidaran said that the Indian Premier League is purely a business and that flat pitches are prepared because low-scoring matches are boring for sponsors. → Politics · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro The World Bank's active portfolio in Nigeria stands at over $16.4 billion as of 2025. → Finance · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Individuals who prefer music with less positive emotional content tend to have higher intelligence. → Science · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Humans systematically overestimate short time intervals. → Science · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro There exist published research papers on unsupervised regime identification in multivariate oceanic current time series, particularly focusing on coastal regions and methods that infer the number of regimes from data, which are relevant for forecasting applications in areas such as Bahia de Santos, Brazil. → Science · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Equal Measures 2030’s 2024 SDG Gender Index provides a downloadable dataset that includes a field labeled “required annual change”. → General · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro The FDI World Dental Federation confirms that daily oral hygiene routines, including mouthwash use, significantly reduce the incidence of gingivitis, periodontal disease, and dental caries. → Health · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro A group known as "Khanna Coolies" operated as bicycle-riding food porters delivering meals in Calcutta. → History · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides. → General · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro A study led by Yadan Li at Southwest University in Chongqing found that exposure to frightening images and sounds at night (20:00) produced greater increases in skin conductance, heart rate, and blood pressure than the same exposure during the day (08:00), regardless of room lighting conditions. → Science · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Generator performance standards parameters are the responsibility of the Network Planning and Design department, not the Asset Management department. → Tech · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Teams in the esports game Valorant that select agent compositions with balanced roles such as duelist, controller, initiator, and sentinel have a higher probability of winning compared to teams with unbalanced compositions, according to statistical analysis of professional match data as of April 2026. → Science · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Kenyan President William Ruto has stated that Kenya has a total of 20,000 kilometers of tarmacked (paved) roads. → Politics · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro SIGMAS raised a $1 million seed funding round in 2026, co-led by Mucker Capital and HongShan Capital (formerly Sequoia China). → Finance · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro The diagnostic literature on autism describes autistic people who are frequently devastated by accidentally breaking social rules they were trying hard to follow. → Health · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Donald Trump said that an attack on Iran was postponed at the request of Gulf allies. → Politics · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Wildlife species in Vietnam, including elephants, rhinoceroses, and tigers, face significant threats from habitat loss and are classified as endangered. → Science · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Volodymyr Zelensky was nominated for the Nobel Peace Prize for 2026. → Politics · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro Amadeo historically served as a logistical transition point between the urbanized lowlands and the mountainous hinterlands of Cavite, Philippines. → History · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro As of May 6, 2026, Muslims from multiple countries have gathered in Hooghly district, West Bengal, India. → Politics · max bucket distance: 3 · no majority GPT-5.4 Claude Opus 4.7 Gemini 3 Pro Gemini 3 Pro + Search Sonar Pro On 67% of real-world user fact-checks in this corpus, the five strongest frontier LLMs disagree. Rely on any single one and you inherit that disagreement. Snapshot v1.0 · data as of May 21, 2026 · code a6b78be. Citation-stable archive: /research/llm-disagreement/v1.0. Full per-claim CSV: data.csv. PDF: pdf. DOI: 10.5281/zenodo.20344847. About Library Developers Account More Account ✉️ Expires in 15 minutes. Use a different email Open mail app Can't access your email? Sign in with Google Sign up to verify claims Continue with Google or By continuing, you agree to ourTerms of use and Privacy policy. More Research Install App WhatsApp Appearance About About How it works Contact Terms Privacy |
A study by Jordanov (Lenz Research) investigated the agreement among five frontier large language models when applied to one thousand recent, real-world user claims submitted to a fact-checking platform, revealing significant disagreement beyond standard benchmark comparisons. The methodology involved presenting each claim, anchored by a specific date, to the LLMs and requiring them to classify the claim using a four-bucket rubric: True, Mostly True, Misleading, or False. Since only one bucket can be correct per claim, any divergence among the models signifies potential label inconsistency. The analysis demonstrated that disagreement is frequent, with at least one frontier model dissenting from the majority verdict on 67% of the claims. Furthermore, a substantial portion of these disagreements involved substantive differences rather than mere calibration shifts. Specifically, 34% of the claims exhibited a disagreement gap of two or more buckets between the most divergent verdicts, which indicates a substantive disagreement regarding the answer itself, rather than just confidence calibration. Overall agreement across the five models was characterized by Krippendorff's alpha for ordinal agreement of 0.639, suggesting nontrivial but limited consistency across the panel. The agreement patterns among the models varied based on the specific pair being compared, with the strongest peer agreement observed between Gemini 3 Pro and Gemini 3 Pro plus Search, reaching 75% agreement. Conversely, certain pairs displayed lower agreement rates, indicating model-specific biases or different reasoning structures. Examining the distribution of verdicts revealed that while the panel often converged on the extreme poles of True or False, the middle buckets, Mostly True and Misleading, showed minimal unanimity across the group. When analyzing the distribution of verdicts among the five models, some models tended to concentrate their outputs at the True or False poles, while others exhibited a more distributed spread across the middle range. This behavior reflects differing model-level decision priors interacting with the specific claims, without a ground truth to confirm accuracy. The analysis indicated that while the panel exhibited an overall floor on agreement, the presence of substantive disagreements highlights that relying on any single model in this context inherits the full scope of observed panel variance. Domain stratification revealed differential rates of disagreement across subject areas. Domains such as Finance, Health, Politics, and Science showed high disagreement rates, often exceeding 67%, indicating that the complexity or context of the claims significantly influences how the LLMs align their verdicts. The analysis further distinguished between nuance disagreement, such as a difference between True and Mostly True, and substantive disagreement, such as a difference between True and False, demonstrating that the nature of the divergence is as critical as the divergence itself. In summary, this research indicates that when assessing real-world fact-checks, frontier LLMs frequently disagree, and this disagreement is often substantive, particularly in complex domains. The study underscores that agreement among models is limited, and the manner in which models handle nuanced labels versus polar answers depends on the specific corpus and the inherent ambiguity of the fact-checking rubric. The authors explicitly caution that this measurement reflects model behavior on real-world submissions rather than objective correctness, noting that true accuracy comparison requires a separate process involving human-labeled ground truth. |