Using LLM-as-Judge: When It Works and When It Doesn't
LLM-as-judge has become the default approach for automated eval of open-ended LLM outputs. The appeal is obvious: if you want to evaluate whether a response is helpful, coherent, and accurate, using another LLM to make that judgment feels more natural than writing a deterministic scorer. And for many criteria, it produces results that correlate reasonably well with human judgment.
But LLM-as-judge has failure modes that deterministic evaluation doesn't, and using it without understanding those failure modes leads to eval infrastructure that gives you false confidence. We use LLM-as-judge extensively in Fyntune's built-in criteria — and we've also seen how it breaks in ways that are hard to notice from inside the eval loop.
Where LLM-as-judge works well
The natural home for LLM-as-judge is evaluating criteria that have no single correct answer and require genuine language understanding to assess. Three examples:
Coherence and fluency. Does this response read naturally? Is the argument structure clear? Does it maintain a consistent voice? These are things that humans assess holistically and that rule-based metrics approximate poorly. LLM judges handle these well because they're essentially evaluating what they were trained to produce.
Relevance to a specific query. Given this user question, does the response address what was actually asked? This is harder than it sounds — an answer can be factually correct and well-written while completely missing the point of the question. LLM judges reliably detect this kind of mismatch.
Tone and persona adherence. For features where the model is supposed to maintain a specific persona, communication style, or brand voice, LLM judges work well for rubric-based assessment. "Does this response sound like a professional support agent for a B2B SaaS product" is a judgment that deterministic scoring can't touch.
Where LLM-as-judge breaks down
Self-preference bias. When you use the same model family as your judge that you used to generate the outputs being evaluated, the judge tends to rate outputs from that model family higher. This is a calibration problem: you may not be measuring quality, you may be measuring stylistic similarity to the judge. It shows up when you switch generator models — suddenly your scores drop, but you can't tell if the new model is actually worse or just stylistically different from the judge model.
Mitigation: use a different model family as your judge than your generator. If you're generating with GPT-4o, judge with Claude or vice versa. It doesn't fully eliminate the problem but dramatically reduces it.
Verbosity bias. LLM judges, unless explicitly instructed otherwise, tend to score longer responses higher. This is documented in the academic literature on LLM evaluation and shows up consistently in practice. A verbose, somewhat rambling response that hits all the keywords will often score higher than a concise, precise response that says less but says it better.
Mitigation: anchor your judge prompts explicitly against verbosity. Something like "A shorter response that fully addresses the criteria should be scored equally or higher than a longer response that says the same thing with more padding." Run a calibration check — compare judge scores on response pairs where you've deliberately varied length while keeping content identical.
Factuality — the critical failure mode. LLM-as-judge is unreliable for factuality evaluation unless the judge has access to a ground truth source. A judge model that doesn't know the correct answer to a factual question will often score a confident, fluent incorrect response highly. The fluency and confidence signals override the factual signal.
This is the most dangerous failure mode because factuality is often exactly what you need to evaluate. The fix: for factuality criteria, provide the judge with the correct answer or source document and ask it to evaluate groundedness (does the response stay within this source?) rather than accuracy in the abstract. Groundedness is checkable; free-floating factuality against an absent ground truth is not.
Position bias in pairwise comparisons. When you ask an LLM judge to compare two responses A and B, it disproportionately favors whichever appears first. This effect is consistent enough that if you reverse the order (B then A) you'll get different preference results on a meaningful fraction of pairs. For pairwise comparison evals, always run both orderings and look for cases where the judge contradicts itself — those are the least reliable comparison points.
Calibrating LLM-as-judge before you trust it
Before using LLM-as-judge in a regression testing context, you should calibrate it against human-labeled samples. The process is straightforward but gets skipped more often than it should:
- Have humans score a set of 50-100 outputs on each criterion you plan to use LLM-as-judge for.
- Run your LLM judge on the same set.
- Compute agreement rate and rank correlation between human scores and judge scores.
- Look at the systematic biases: does the judge consistently score higher than humans? Lower? Does it disagree on specific input types?
- Adjust your judge prompt to reduce systematic biases, then re-calibrate.
For a new criterion, we generally aim for Spearman rank correlation above 0.7 before using it in regression gates. Below that, you're measuring something, but you can't be confident you're measuring what you think you are.
Calibration is not a one-time step. Judge models get updated, generator models change, and your input distribution shifts as your product grows. A quarterly re-calibration against a fresh set of human labels catches judge drift before it silently corrupts your eval signals.
Combining LLM-as-judge with deterministic criteria
The most reliable eval suites don't rely exclusively on LLM-as-judge. They use a layered approach:
Deterministic checks first. Format compliance, schema validation, required field presence, string constraint checks — anything you can check with a rule, check with a rule. These are cheap, fast, and perfectly reliable. LLM-as-judge should never be used for things you can check deterministically.
Classifier-based checks second. For criteria like topic adherence or guardrail compliance, fine-tuned classifiers often outperform general LLM judges. They're cheaper to run, more consistent, and easier to calibrate. If you have labeled data for a specific classification problem, train a classifier before defaulting to LLM-as-judge.
LLM-as-judge for the residual. After deterministic and classifier-based checks, use LLM-as-judge for criteria that genuinely require language understanding to assess — coherence, tone, relevance, helpfulness. These are where LLM judges add real value over any other automated approach.
This isn't a knock on LLM-as-judge as a technique. It's a recognition that using a powerful tool for tasks that don't require it introduces unnecessary noise and cost. Reserve LLM judgment for genuinely judgment-requiring criteria.
Practical judge prompt structure
Judge prompt design matters more than most teams expect. A poorly structured judge prompt produces inconsistent scores that don't correlate with quality. A few structural choices that consistently improve judge reliability:
- Provide explicit scoring rubrics for each score level, not just a label. "5 = excellent" is less useful than "5 = the response fully addresses the query, contains no factual errors, and uses appropriate detail for the user's apparent level of expertise."
- Ask for a chain-of-thought explanation before the score. Forcing the judge to explain its reasoning before committing to a number improves score calibration and makes disagreements interpretable.
- Explicitly instruct the judge on known biases. "Longer responses are not better. A concise accurate response should score equally to a verbose accurate response."
- Return a numeric score, not a categorical label. Categorical labels ("poor", "good", "excellent") collapse variation that numeric scores preserve, making delta analysis harder.
We're not saying LLM-as-judge is unreliable and should be avoided — it's one of the better tools we have for automated open-ended evaluation. Its failure modes are real and consistently underestimated. Understanding them — self-preference, verbosity bias, factuality blindness, position bias — and building mitigations into your judge configuration is what separates eval infrastructure that you can trust from eval infrastructure that makes you feel like you're measuring quality while you aren't.