LLM Evaluation

LLM Eval Metrics That Actually Matter in Production

By Fatima Al-Rashid · April 24, 2025 · 10 min read

Every team building LLM features starts with the metrics they already know: BLEU, ROUGE, perplexity. These come from machine translation and language modeling research, and they're well understood — but they were designed for different problems than the ones production LLM features actually face. A decade of NLP research has tuned them for benchmark performance. Your users don't experience benchmarks.

When we started thinking carefully about which metrics actually predict user satisfaction in deployed LLM products, the ranking was counterintuitive. The metrics with the most research behind them (BLEU, ROUGE) ranked lowest. The metrics most correlated with the things that make users stop using a product — hallucination, guardrail failures, coherence breakdown — were largely absent from standard eval frameworks.

This is our current hierarchy, based on building evals for LLM-powered products across a few different verticals since late 2024. It's not a universal ranking — the weights depend heavily on your use case — but it's a useful starting point for teams figuring out what to instrument.

Tier 1: Metrics That Directly Predict Trust Loss

Factuality / Entailment Pass Rate

For any feature that retrieves or synthesizes factual information, factuality is the highest-stakes metric. A single fabricated fact that a user acts on can break trust in a way that takes months of good performance to rebuild. This is especially true for B2B products where users are applying LLM output to consequential decisions.

Factuality is measured as: of the factual claims in the model's output, what fraction are entailed by the input context or verifiably accurate? In practice this requires either a judge model doing entailment checking or a human-labeled golden dataset. The cost is non-trivial, which is why many teams skip it. That's a mistake. This metric is the one most strongly correlated with user-reported accuracy complaints.

Guardrail Compliance Rate

If your feature has behavioral constraints — topics it should refuse, formats it must follow, information it should never reveal — guardrail compliance rate tracks what fraction of outputs adhere to those constraints. This is not the same as not-getting-a-complaint from users. Guardrail failures often don't get reported; users just quietly lose trust or find workarounds.

Guardrail compliance is typically evaluated with rule-based checks (does the output contain any of these patterns?) supplemented by classifier-based checks for more nuanced constraints. The suite of guardrails to test should be driven by your feature's specific constraints, not a generic safety checklist.

Tier 2: Metrics That Predict Response Quality

Semantic Similarity (BERTScore / Embedding Cosine Distance)

Semantic similarity measures — computed via embedding cosine distance or BERTScore — outperform ROUGE for most LLM evaluation tasks because they capture meaning similarity rather than lexical overlap. "The invoice was paid on March 3rd" and "Payment for the invoice was processed on March 3" have low ROUGE-L but high semantic similarity. That's the correct judgment.

Semantic similarity is most useful as a regression detector: if your semantic similarity score on your golden set drops after a prompt update, something semantically meaningful has changed in the output distribution, even if ROUGE is stable. It's not a perfect indicator of quality (high semantic similarity doesn't mean correct), but it's a reliable indicator of meaningful change.

The embedding model matters here. We use domain-appropriate embedding models where available. A general-purpose embedding model will conflate terms that are semantically distinct in your specific domain. For a legal document assistant, legal-domain embeddings produce more reliable similarity scores than a general-purpose model trained on web text.

Coherence Score

Coherence measures whether the response holds together logically — whether each sentence follows from the previous, whether the overall response is organized, whether it doesn't contradict itself internally. It's distinct from factuality (a coherent response can be factually wrong) and from semantic similarity (a response can be semantically similar to a reference but incoherent).

In practice, coherence most often degrades when: the input is very long and the model loses track of earlier context; the prompt asks the model to do multiple things and it tries to satisfy them in a conflicted way; or the model is near its context limit and generates degraded output toward the end of a long response.

Coherence is harder to automate than factuality or similarity. The most reliable approach is a judge model with a carefully calibrated rubric. The rubric matters: a vague "is this coherent?" question produces inconsistent scores; a structured rubric ("does each paragraph follow logically from the previous?", "does the conclusion align with the analysis?") produces more consistent scores.

Tier 3: Metrics That Are Useful But Often Overstated

ROUGE-L and BLEU

ROUGE and BLEU are still useful in narrow cases: tasks where the correct output has a well-defined form (code generation, structured data extraction, translation). For open-ended generation, summarization, Q&A over documents, and conversational tasks, these metrics are noisy enough that they can actively mislead. A prompt change that improves actual quality can decrease ROUGE if it produces more verbose output, and vice versa.

We still include ROUGE in our standard eval suite as one signal among many, but we don't weight it heavily and we flag any case where ROUGE and semantic similarity disagree significantly — that disagreement is usually worth investigating.

Response Length Distribution

Tracking the distribution of response lengths over time catches prompt-induced verbosity shifts that can hurt UX without showing up in quality metrics. If a prompt update causes the model to start giving 600-word responses where it used to give 150-word responses, length distribution catches it; factuality and coherence don't.

Length distribution is also a useful anomaly detector. Sudden shifts in length distribution often precede other quality issues — they're a canary signal worth monitoring in production.

What We Don't Include (and Why)

Perplexity is often mentioned in eval discussions but isn't useful for measuring the quality of individual outputs from a fine-tuned or prompted model. It measures how surprised the model is by a sequence — that's useful during training, not during inference evaluation.

Human preference scores (thumbs up/down, pairwise ranking) are high-signal when you can collect them, but they're expensive, slow, and biased toward stylistic preferences that don't always align with accuracy. We use them as a calibration input for judge models rather than as a primary eval metric.

A Concrete Comparison: What Each Metric Caught

To make this concrete: we ran a comparison eval on a document summarization feature across two prompt versions. The change was a single instruction addition intended to make summaries more concise. Here's what each metric reported:

ROUGE-L: increased from 0.41 to 0.44 — a positive signal by ROUGE's logic, since outputs now had tighter lexical overlap with shorter reference summaries.
Semantic similarity (cosine, text-embedding-3-small): dropped from 0.81 to 0.74 — a meaningful regression signal indicating the summaries had shifted in meaning, not just length.
Factuality / entailment pass rate: dropped from 91% to 83% — the conciseness instruction was causing the model to omit qualifying clauses, turning conditional facts into unqualified assertions.
Guardrail compliance: stable at 97% — format and topic constraints unaffected.
Response length distribution: median dropped from 180 words to 110 words — expected, and confirms the instruction worked as intended on length.

If the team had only looked at ROUGE, they would have shipped a prompt update that introduced a meaningful factuality regression. Semantic similarity flagged the change. Factuality scoring identified the specific failure mode. The deploy was blocked at the eval gate, the instruction was refined to preserve qualifying clauses, and factuality recovered to 89% before shipping.

We're not saying every team needs all five metrics running from day one. We're saying the metrics that matter most are often not the ones that are easiest to compute — and building toward a complete metric stack incrementally is worth the investment.

Building Your Metric Stack

The practical question is which subset of these to actually instrument given limited engineering time. Our recommendation: start with guardrail compliance and factuality. These are the metrics most directly tied to trust failures, they're actionable (when they degrade you know what kind of problem you're looking for), and they compound in importance as your user base grows.

Add semantic similarity as your primary regression detector — it's cheap to compute and catches meaningful output distribution shifts early. Add coherence once you have the first two instrumented and when your feature generates long-form output where coherence failures are likely.

ROUGE stays in the suite for historical continuity and as a sanity check, but don't let it veto a deploy that all other signals say is fine. The goal is a metric stack where each metric is answering a different question about output quality, not a stack where you're averaging five metrics that all measure roughly the same thing.

← Back to Fyntune Notes