LLM Evaluation

Detecting Hallucinations in Production LLMs Before Your Users Do

By Fatima Al-Rashid · February 3, 2025 · 8 min read

Here is a scenario we saw play out with a team that builds a document Q&A product on top of a foundation model. They shipped a prompt update on a Tuesday. By Thursday, their support inbox had six tickets from users citing "wrong dates" in extracted summaries. The prompt change looked innocent in review — a minor rephrasing of the system instruction. The model's ROUGE scores on their golden test set actually went up slightly. But the model had quietly started filling in document dates with plausible-sounding fabrications when the source document was ambiguous.

A/B test metrics looked fine. ROUGE looked fine. Human reviewers checking a random sample of 20 responses — also fine, because the fabricated dates only appeared on a narrow subset of poorly-structured input documents. The hallucination pattern was systematic but rare, which made it invisible to every monitoring approach the team had in place.

This is the shape of most production hallucination problems: they're not wholesale model failures, they're narrowly-scoped factual drift that emerges in specific input conditions and surfaces only when real users hit those conditions at scale.

Why A/B Tests and ROUGE Scores Miss This

The fundamental problem with aggregate metrics is that hallucinations tend to be class-conditioned. A model might be perfectly accurate on well-formed inputs and consistently wrong on a specific category of edge cases — say, inputs where the relevant fact is mentioned only once, early in a long document, or where numerical values appear in formats the model saw rarely in training.

ROUGE-L and BLEU measure lexical overlap with a reference answer. If the reference says "the contract was signed on March 14, 2022" and the model says "the contract was signed on March 14, 2021," ROUGE misses this entirely because the word-level overlap is still very high. Semantic similarity scores via embedding cosine distance fare a bit better — they'll catch cases where fabricated text is topically distant from the ground truth — but a plausible wrong date is semantically close to a correct date. Cosine distance in most embedding spaces treats those as nearly identical.

A/B tests measure downstream behavioral signals: click-through, task completion, retention. Hallucinations on low-frequency input types won't move those numbers until enough users encounter them. By then you're in reactive support mode rather than preventive eval mode.

What Factuality-Specific Eval Criteria Look Like

Effective hallucination detection eval suites work by constructing targeted adversarial test cases rather than relying on random golden-set sampling. The goal is to stress-test the exact input conditions where fabrication is likely.

For a document Q&A product, that means building eval cases like:

Single-mention facts: Documents where the fact to extract appears exactly once, mid-document, without repetition or contextual reinforcement. The model can't rely on the fact being stated in multiple ways.
Ambiguous or absent facts: Questions about facts that are not in the document at all. The correct answer is "this information is not in the document." Models trained to be helpful frequently refuse to say "I don't know" and fabricate instead.
Numerical precision tests: Cases where the correct answer is a specific date, dollar amount, or count. Grade on exact-match, not semantic similarity.
Contradiction inputs: Documents that contain conflicting information. Does the model flag the contradiction, pick one, or silently blend both into a plausible-sounding response?

Each of these categories requires a different eval criterion. Exact-match grading for numerical facts. Entailment-based grading (does the model's answer follow from the source document?) for factual claims. A separate "appropriate refusal" criterion for questions the document doesn't answer.

The Entailment Check Pattern

The most reliable factuality eval pattern we've settled on is a form of natural language inference: for every factual claim in the model's output, verify that it can be derived from the input context. If the model asserts X and X is not entailed by the source document, that's a candidate hallucination.

In practice, you implement this as a secondary model call — a smaller, cheaper model acting as a factuality judge. You pass it the source document, the model's response, and a rubric: "Does the response contain any factual claims that cannot be verified from the source document? For each such claim, state it explicitly."

This approach has known limitations. The judge model has its own knowledge and will sometimes incorrectly flag accurate information that happens to not be in the source document. You calibrate by running the judge against a labeled sample and tuning the instruction accordingly. We typically see false positive rates around 8–15% in early judge iterations, which drop to 3–6% after two rounds of instruction tuning.

We're not saying this replaces human review — it doesn't. What it does is flag the cases most worth human review rather than making human reviewers sample randomly.

Eval Suites vs. One-Off Tests

The key architectural decision is building a versioned eval suite rather than running ad-hoc tests when you have a concern. A versioned eval suite means:

Every prompt change or model update runs against the same set of factuality test cases. You get a score, not a judgment. You can track that score over time. When you deploy a prompt update that happens to reduce your entailment-pass rate from 94% to 89% on single-mention fact cases, you see that before users do.

The test case library needs to be grown deliberately. After every user-reported hallucination, we add a test case to the suite that would have caught it. Over time the suite becomes a regression corpus that encodes your product's specific failure modes. A team we know shipping a customer service LLM went from roughly 70 eval cases at launch to over 400 in the first eight months — almost entirely driven by production failures being converted to regression tests.

Instrumenting for Hallucination Rate in Production

Offline eval suites catch regressions at deploy time. But you also want some signal on hallucination rate in live traffic. The challenge is that you rarely have ground-truth labels for production outputs.

A practical approach is to run a lightweight factuality classifier — not the full entailment pipeline, something cheaper — on a sampled fraction of production traffic. You're not trying to catch every hallucination, you're tracking whether the distribution of flagged outputs is stable or drifting. If your classifier suddenly starts flagging 12% of responses where it used to flag 4%, that's a signal worth investigating even before users file tickets.

The classifier threshold matters a lot here. Tune it on your labeled sample so it functions more as a drift sensor than an accurate detector. You want it sensitive enough to catch distribution shifts, not precise enough to be right about every individual case.

The Prompt Update That Moved the Needle

Coming back to that document Q&A team: after we helped them build an eval suite covering their specific hallucination patterns, the next prompt update cycle ran very differently. They had 140 factuality test cases in their suite, including 38 cases specifically testing absent-fact handling. When their engineers proposed changing the system instruction to be "more helpful when documents are incomplete," the eval run caught a 17-point drop in appropriate-refusal rate. They reworked the instruction before it shipped.

That's the difference between detecting hallucinations before users do and detecting them after. Not heroic model architecture work — systematic eval criteria that map to your specific failure modes, run automatically on every change.

The investment is mostly in building the test case library. The tooling to run it on every PR is a few hours of CI configuration. The payoff is not having to triage hallucination support tickets from users who trusted your product with something important.

← Back to Fyntune Notes