Semantic Similarity vs Exact Match: Which Eval Approach Fits Your Use Case
The choice between exact-match grading and semantic similarity scoring is one of the foundational decisions in LLM evaluation design, and it's frequently made by default rather than deliberately. Teams reach for exact match because it's deterministic and requires no additional model calls. Teams reach for semantic similarity because it feels more sophisticated and handles paraphrase cases gracefully. Both instincts can lead to the wrong choice for a given task.
The correct answer depends on what your output actually needs to be: precisely this, or meaningfully like this. Those are different requirements, they show up in different feature types, and conflating them produces eval suites that miss the regressions that matter while flagging changes that don't.
When Exact Match Is the Right Tool
Exact match is appropriate whenever the correct answer has a single valid form — or when different forms, even if semantically similar, represent meaningfully different outputs for your use case.
The clearest case is structured output extraction. If your feature extracts invoice numbers from documents, "INV-2024-08831" and "Invoice 2024-08831" might be semantically very similar (both refer to the same invoice), but only one of them is correct for your downstream system that expects a specific format. Semantic similarity would score both as nearly identical. Exact match catches the format deviation.
Similarly: dates, amounts, identifiers, codes, classification labels. When the output is a fact with a canonical form, exact match is not only appropriate but necessary. Semantic similarity would mislead you — it treats "March 3, 2025" and "March 8, 2025" as nearly identical, when those are factually distinct dates that could mean completely different things for a user acting on the output.
Numerical outputs deserve particular mention. A claim summarization feature that extracts coverage limits — "up to $500,000" vs "up to $50,000" — will score a 0.97 cosine similarity between those two strings because they're almost lexically identical and share the same semantic context. Exact match correctly treats them as different. For anything where precision matters, exact match should be in your eval suite regardless of what other metrics you're running.
Where Exact Match Breaks Down
Exact match fails in two ways: false negatives (penalizing correct outputs that differ in form from the reference) and false positives (accepting near-miss outputs that happen to share surface form with the reference).
False negatives are the more common failure in LLM eval. Suppose your reference answer is "The deductible is waived for preventive care visits." A valid model output might be "Preventive care visits are covered with no deductible." These convey the same information. Exact match scores them as completely different (0 overlap). Even ROUGE-L, which is more lenient, struggles here because the sentence structure and word order are substantially different. The output is correct, the eval says it's wrong, and your eval scores become disconnected from actual quality.
This is how exact match can mask real quality improvements. You rephrase your system instruction to produce clearer explanations of policy terms. The model starts generating more natural, user-readable summaries. Exact match scores drop because the outputs no longer match your reference answers word-for-word. You see a score decrease and roll back the change — having just rejected an improvement.
False positives are rarer but worth knowing about. Exact match on short outputs can be gamed by repetition. If your reference answer is "Yes" and your model always outputs "Yes" regardless of the question, you'll score 100% on those cases. This sounds contrived, but it's a real problem when your eval set has a strong label imbalance — the model learns to output the majority class and exact match doesn't catch it.
Semantic Similarity: What It Measures and What It Misses
Semantic similarity via embedding cosine distance captures whether two texts occupy similar regions of a semantic vector space. Models like text-embedding-3-small (OpenAI), text-embedding-004 (Google), or domain-specific models like legal-bert or sci-bert for specialized domains map text to dense vectors, and cosine distance between vectors is a proxy for semantic relatedness.
This handles paraphrase cases well. It handles synonym substitution, sentence reordering, and changes in syntactic structure that preserve meaning. For open-ended generation tasks — summaries, explanations, analysis, conversational responses — semantic similarity is almost always a better signal than exact match because there are many valid outputs and the goal is meaning preservation, not form preservation.
The failure modes of semantic similarity are meaningful and worth understanding in detail.
False precision on near-antonyms. Embedding spaces encode semantic context, not logical polarity. "The policy covers dental care" and "The policy does not cover dental care" have high cosine similarity because they share nearly identical context words. A hallucination that negates a factual claim may score 0.92 semantic similarity with the correct reference. For factuality evaluation, semantic similarity is therefore not a substitute for entailment-based checking.
Domain sensitivity. General-purpose embedding models are trained on web text. When your LLM feature operates in a specialized domain — regulatory compliance, clinical documentation, financial contracts — the embedding space may not reliably distinguish between concepts that are semantically distant in that domain but share surface-form vocabulary. We've seen cases where "material adverse change" and "material change" scored 0.96 cosine similarity in a general embedding model while having substantially different legal significance. Domain-specific embedding models help, but they're not available for every domain.
Length bias. Short texts often have noisier similarity scores than long texts because there's less signal to embed. A three-word output and a three-word reference might score anywhere from 0.4 to 0.95 on cosine distance depending on the specific words, even when one is clearly correct and the other is clearly wrong. If your feature has short expected outputs, treat semantic similarity scores on those cases with extra skepticism.
A Decision Framework
We've settled on a heuristic for choosing between exact match, semantic similarity, or both:
Ask: does form matter as much as meaning? If the answer is yes — the output must have this structure, this value, this format — use exact match as the primary criterion. Semantic similarity can still run as a secondary signal, but exact match is your gate.
If form doesn't matter but meaning does — the output should convey this information, in any reasonable phrasing — use semantic similarity with a threshold calibrated against your labeled sample. Don't use exact match as the primary gate; it will mislead you.
If both form and meaning matter — you need the right information expressed in a specific structure — run both, and flag outputs that fail either. This is the typical configuration for structured document extraction where outputs need to be both accurate and parseable.
There's a third category: open-ended outputs where neither exact match nor semantic similarity is sufficient. Evaluating the quality of a multi-paragraph analysis, a nuanced explanation, or a creative output requires a judge model or human review. Semantic similarity can still serve as a regression detector in these cases — a meaningful drop in similarity often signals something changed — but it's not a quality grade on its own.
Calibrating Against Ground Truth
Whatever scoring method you choose, the calibration step is non-negotiable. Before relying on any eval metric as a CI gate, run it against a sample of outputs you've labeled by hand and measure its error rate in both directions. How often does it score a correct output as failing? How often does it score a failing output as passing?
A semantic similarity threshold of 0.85 might work well for one task and produce 20% false positives on another. Thresholds are not transferable across tasks. This is the work that makes eval suites accurate rather than just automated.
We see teams skip this calibration step more often than any other. The result is eval suites that developers stop trusting because the failure rate is poorly explained — which defeats the entire purpose of running evals in CI.