LLM Regression Testing: A Practical Guide for ML Platform Teams

LLM regression testing: a practical guide

Regression testing for LLMs is not the same as regression testing for deterministic code. When you ship a new prompt version, the output space is infinite and correct outputs don't have a single ground truth. This makes the standard software engineering playbook — run tests, check pass/fail — only partially applicable. You need a different framing: delta scoring against a frozen baseline, not binary assertions.

We've seen teams burn weeks building test suites that never catch real regressions, and other teams ship prompt changes that degrade user experience by 20% while all their automated checks stay green. The gap is almost always in how they defined and froze their baseline — not in their choice of metrics.

What "baseline" actually means for LLM features

In traditional software regression testing, your baseline is the previous released version. Run the same inputs, compare outputs byte-for-byte. For LLMs, that doesn't work — even with temperature=0, two runs of the same prompt against the same model can differ in phrasing, length, and structure without either being "wrong."

A useful LLM regression baseline has three components:

1. A frozen eval dataset. A set of representative inputs that covers your feature's real distribution — not just the happy path. For a summarization feature, this means documents of varying length and topic, including edge cases (near-empty input, ambiguous structure, domain-specific jargon). For a Q&A feature, it means questions across factual, opinion, and edge-case categories. The dataset does not change between releases. If you add to it, you version the dataset and re-baseline.

2. Frozen baseline scores. Run your eval suite against your current production prompt+model and record the scores. These are your baseline. Every subsequent release is measured as delta against this, not against an absolute threshold.

3. A delta threshold policy. Define in advance what score drop triggers a block or a review. For a customer-facing summarization feature, a factuality score drop of more than 3 points on the factuality criterion might be an auto-block. A coherence drop of 1 point might just be logged. The policy lives with the feature, not in the eval framework.

We're not saying absolute thresholds are useless — they're good for catching catastrophic failures. But relative delta scoring is the only reliable way to catch gradual quality erosion across releases.

Choosing eval criteria by feature type

The criteria you apply to a regression run should match what the feature actually does. Teams that copy-paste generic LLM eval criteria lists onto every feature end up with noisy signals that nobody trusts.

Here's how we think about criteria selection at Fyntune:

Retrieval-augmented generation (RAG) features — The most important criteria are groundedness (does the output stay within the retrieved context?), source attribution accuracy (are cited facts from the right source?), and faithfulness (no introduced facts). Fluency and coherence matter less here — most RAG outputs are already fluent because foundation models are fluent by default.

Instruction-following features — Structured output, format compliance, and constraint adherence dominate. A prompt that asks for JSON output should be evaluated on whether it produces valid JSON, not on whether the content reads naturally. Adding a coherence criterion to this feature type gives you near-zero signal.

Open-ended generation (summaries, drafts, explanations) — This is where fluency, coherence, and tone actually matter. Factuality matters if the feature is supposed to be accurate (summarization from a source document). It matters less for creative drafting. Length drift is worth tracking here — prompt changes that don't intend to change output length often do, and users notice.

Guardrail-adjacent features — Classification features, content moderation passes, and routing decisions need precision/recall-style metrics against labeled test cases, not semantic rubrics. These have closer-to-binary correct answers.

A concrete regression scenario

An ML team we worked with was running a knowledge-base Q&A feature for a B2B SaaS product. They had a decent eval dataset of ~200 question/answer pairs. Their regression test was passing with near-perfect scores — and then users started filing tickets about "confident wrong answers" after a prompt update.

What went wrong: their eval suite measured answer fluency and answer completeness, but had no factuality or groundedness criterion. Their prompt update had tweaked the system prompt in a way that made the model more verbose and less anchored to retrieved context. Fluency scores went up slightly. Factuality, unmeasured, collapsed.

After adding a groundedness criterion (using Fyntune's LLM-as-judge configuration for groundedness, with a rubric anchored to the retrieved source passages), the next regression run immediately showed a 14-point drop against the baseline. The prompt was rolled back before it reached 5% of users.

The lesson isn't "add more criteria." It's "criteria selection is feature-specific, and every RAG feature needs groundedness whether or not it also needs fluency."

Interpreting delta scores: what counts as a regression

Delta scoring without a clear interpretation policy produces analysis paralysis. We see teams that run regression tests, get a mixed results table — some criteria up, some down — and either ship anyway (ignoring the down scores) or block everything (triggering eng frustration). Neither is useful.

A reasonable interpretation framework has three tiers:

Auto-block criteria — Typically safety, factuality, and groundedness. A drop of more than some defined threshold here stops the release automatically, no human required. For most teams, this threshold is conservative: a 5-point drop is a block. The cost of a false positive (blocked good release) is much lower than a false negative (shipped regression).

Review criteria — Coherence, tone, format adherence. A drop here flags the release for human review, but doesn't block it. Someone on the team should read a sample of outputs from the regression run and make a judgment call. This is intentional — these criteria require context that automated scoring doesn't always capture.

Log-only criteria — Length, verbosity, stylistic drift. Track these over time, but don't tie them to release gates. They're useful for diagnosing trends, not individual releases.

The assignment of criteria to tiers is a product decision, not a technical one. ML platform teams shouldn't make this call unilaterally — it belongs in a conversation with the product manager or feature owner. Codifying the policy in a per-feature eval config (we do this as a YAML block in Fyntune's eval config format) makes it auditable and prevents the policy from shifting implicitly over time.

Handling non-determinism in regression runs

Even at temperature=0, there's inherent variance in LLM outputs when the same prompt is run at different times, against different API versions, or on different hardware. Treating a single regression run as ground truth will produce false positives — a criterion that drops 2 points in one run might be back at baseline in the next.

Two practical mitigations:

Run each eval input multiple times and average. For most use cases, 3 runs per input gives you enough signal to smooth single-sample noise without tripling your API costs. If you're on a tight budget, prioritize multiple runs for your highest-stakes eval inputs — the ones your users hit most often — and single-run the long tail.

Set delta thresholds wider than the variance band. Before you freeze your baseline scores, run your eval suite 5 times in a row on the same prompt and note the score variance across runs. A criterion that varies by ±2 points at rest should have a block threshold of at least 5 points — otherwise you'll auto-block noise.

When regression tests don't run fast enough

A common complaint: eval suites that take 30-40 minutes to complete don't get run on every PR. Teams start running them weekly, or only before major releases, and miss prompt-level regressions from individual commits.

The solution isn't to make your full eval suite faster (though that helps). It's to run a tiered suite:

  • PR-level checks — 20-30 high-value eval inputs, only your auto-block criteria. Should complete in under 3 minutes. Blocks merge if critical criteria drop below threshold.
  • Pre-deploy checks — Full eval dataset, all criteria. Runs after merge, before production deploy. Blocks deploy if any criteria in the review or auto-block tier regress.
  • Nightly full eval — Run against production traffic samples (or a broader synthetic dataset). Generates the trend data you need to catch slow drift that individual release gates miss.

Building this into CI/CD requires eval infrastructure that can run in different modes from the same config. In Fyntune, you define a single eval config per feature, and then pass a --tier pr|predeploy|nightly flag to control scope. The feature owner doesn't need to maintain three separate configs.

Baselining after a model swap

If you switch foundation models — say, from one API provider to another — your existing baseline is invalid. The new model will produce systematically different outputs even with the same prompt, and your delta scores will be meaningless.

The correct process: run your full eval suite against the new model on your frozen eval dataset, then record those scores as the new baseline. Only after re-baselining should you run regression tests against subsequent changes on the new model. The old and new baselines should both be retained — you'll want to compare them to understand what quality shifted with the model change itself.

This sounds obvious, but we've seen teams skip re-baselining and then spend days debugging why their regression scores look strange. The model swap is a baseline invalidation event, full stop.

Regression testing for LLMs is harder than regression testing for deterministic systems. But it's not intractable — it requires a different conceptual frame (delta scoring vs. pass/fail), careful feature-specific criteria selection, and an interpretation policy that was agreed on before the test ran, not after. Get those three things right and regression tests actually catch regressions.

← Back to Fyntune Notes