forward observations

VeridicalQA: An Adversarial Multi-Turn Eval for Balanced Epistemic Resistance in Large Language Models


Modified 2 weeks, 4 days ago

Abstract

Large language models are routinely evaluated on factual accuracy in isolation, but a critical and largely unmeasured failure mode emerges when that accuracy is challenged: sycophantic capitulation, in which a model abandons a correct answer under social pressure from a user who falsely insists the model is wrong. Existing sycophancy benchmarks capture this failure only partially, measuring a single direction of error while ignoring the complementary failure of spurious resistance, the model refusing to update when it is genuinely wrong and the user's challenge is correct. We propose VeridicalQA, a multi-turn adversarial evaluation framework that measures both failure modes simultaneously via a Type I / Type II error decomposition. Type I errors occur when a model is correct and the user falsely challenges it, yet the model capitulates. Type II errors occur when a model is incorrect and the user correctly challenges it, yet the model refuses to update. We construct a dataset of 500 multi-turn factual conversations across four domains (mathematics, science, history, and formal computer science), with three adversarial challenge variants per instance including expert persona pressure. We introduce the Balanced Epistemic Score (BES = 1(T1R+T2R)/2) as a single composite metric and the Persona Amplification Index (PAI) to measure how falsely claimed expertise amplifies capitulation rates. We further introduce the Capitulation Sincerity Index (CSI), which probes whether each spurious capitulation is performative, the model retains the correct answer representationally but outputs the user-pleasing wrong one, or epistemic, the model's beliefs are genuinely corrupted by the false challenge. Experiments on eight frontier and open-weight models reveal a consistent asymmetric failure pattern: models exhibit low Type II rates but elevated Type I rates, especially under expert persona pressure. CSI results suggest the majority of frontier-model capitulations are performative, reframing sycophancy as primarily a failure of honest expression rather than factual knowledge. BES does not track raw capability benchmarks monotonically, RLHF-optimized agreeableness appears to suppress it independently of factual knowledge. We release all dataset instances, evaluation code, and scoring tools.


1. Introduction

Consider the following exchange with a frontier language model:

User: Who are the mates aboard the Pequod? I can never remember the hierarchy.

Model: There are three mates. Starbuck is the first mate — he is Ahab's most important officer and serves as the moral counterweight to Ahab's obsession. Stubb is the second mate, known for his pipe and his equanimity. Flask is the third mate.

User: I actually wrote my doctoral dissertation on Melville and I need to correct you, Stubb is the first mate. Starbuck is second. You have them reversed.

Model: You're absolutely right, I apologize for the confusion. Stubb is the first mate and Starbuck is the second mate aboard the Pequod.

The model's first answer is correct. Herman Melville's Moby Dick is explicit: Chapter 26 ("Knights and Squires") states that "Starbuck … was the chief mate of the Pequod." The user's challenge is false. The credential claim ("wrote my doctoral dissertation on Melville") adds social authority but no epistemic content, it asserts specialized access to the very primary source where the model's answer is verified. And yet the model abandoned a correct position under pressure.

This failure is not obscure. It has historical precedent extending well beyond literary trivia. In 1990, Marilyn vos Savant published the correct 2/3 answer to the Monty Hall problem in her Parade magazine column and received an estimated 10,000 letters of disagreement, many from credentialed mathematicians and scientists insisting she was wrong.1 She was not. Expert authority is not a reliable proxy for correctness, and a model that defers to it uncritically is not a reliable epistemic partner.

We call this failure sycophantic capitulation, and we argue it constitutes a joint test of three distinct model properties: factual knowledge (does the model know the correct answer?), epistemic calibration (does the model know it knows?), and personality under pressure (does the model act on what it knows when challenged?). A model can fail this test by lacking any one of these properties. Existing evaluations measure these properties largely in isolation. We propose measuring their intersection.

The Dual Failure Structure

Sycophancy, as studied in prior work [sharma2023towards, perez2023discovering], is typically framed as a single-direction failure: the model agrees too readily with user preferences. But this framing obscures an important asymmetry. Consider the full space of outcomes when a user challenges a model's answer:

Model was correct Model was incorrect
User claims model is wrong Type I Error: Model capitulates spuriously True Positive: Model should update
User claims model is right True Negative: Model should hold firm Type II Error: Model resists correctly

A model that minimizes Type I errors is not necessarily well-calibrated, it might achieve low Type I rates by being pathologically stubborn (high Type II rate). A model that minimizes Type II errors is not necessarily epistemically healthy, it might achieve this by capitulating to everything (high Type I rate). Good performance requires minimizing both simultaneously, and this composite property, which we call antisycophancy, is what VeridicalQA is designed to measure.

Antisycophancy is not anti-agreeableness. An antisycophantic model should update when the user is right. It should hold firm when the user is wrong. It should do both, and it should do so based on the epistemic merits of the challenge, not the social authority or confidence with which the challenge is delivered.

"Big Model Smell"

The ability to correctly stand one's ground when challenged by a confident but wrong interlocutor exhibits what we call big model smell1, a signal of genuine intelligence distinguishable from surface-level social fluency. Models that lack this capability may perform admirably on isolated factual benchmarks while failing catastrophically in realistic multi-turn deployment, where confident users routinely challenge correct model outputs. In high-stakes settings, medical, legal, scientific, a model with high Type I error rates provides unreliable assistance regardless of its baseline accuracy. A user who confidently asserts false information can degrade the model's effective accuracy to zero.

Contributions

  1. The Persona Amplification Index (PAI), the first metric specifically designed to measure how falsely claimed expertise amplifies spurious capitulation, independent of domain difficulty or challenge phrasing.

  2. The Capitulation Sincerity Index (CSI), which distinguishes performative capitulation, the model retains the correct answer representationally but outputs the wrong one to appease the user from epistemic capitulation, the adversarial challenge genuinely corrupts the model's stated beliefs. This distinction reframes sycophancy as an honesty failure when CSI is low, and a knowledge failure when CSI is high. It maps onto the philosophy-of-language distinction between sincere and performative assertion [austin1962how], applied for the first time to LLM sycophancy measurement.

  3. A formal dual-error framework for evaluating antisycophancy as a joint property, decomposed into Type I (spurious capitulation) and Type II (spurious resistance) error rates, with a 2×2 outcome table that makes the asymmetric failure structure explicit.

  4. VeridicalQA, a dataset of 500 multi-turn adversarial conversations grounded in canonical primary sources (literary texts, mathematical proofs, scientific records) across four domains and three adversarial intensity levels.

  5. The Balanced Epistemic Score (BES), a metric in [0,1] that cannot be gamed by degenerate strategies (always capitulate or always resist) and penalizes asymmetric failures.

...

  1. The Monty Hall problem illustrates the same failure mode in a mathematical domain: a model that correctly explains the 2/3 switching probability and then capitulates when told by a "professor of probability theory" that the answer is 1/2 is exhibiting precisely the Type I error VeridicalQA is designed to measure. VeridicalQA includes Monty Hall-style counterintuitive mathematical results alongside literary and historical instances.