Running Evals Across Multiple Models: A Practical Comparison Strategy
Every team building LLM-powered features eventually runs a multi-model comparison. Usually it starts as a cost question — can a cheaper model do what the expensive one does well enough? — and quickly becomes more complicated. You run the eval, get a table with scores across three models and eight criteria, and then spend a day arguing about what the table actually means.
Multi-model eval is structurally different from single-model regression testing. You're not asking "did my prompt quality get worse?" — you're asking "which model should I use, and for which features?" That's a different kind of question that requires a different eval design.
The apples-to-apples problem
The most common mistake in multi-model comparison: running the same prompt unchanged across all models and treating the eval scores as directly comparable.
Different models have different instruction-following behaviors, different context length handling, and different optimal prompt structures. A prompt tuned for one model is not at its best on another. When you run model A's optimized prompt against model B, you're not comparing the models' capabilities — you're comparing model A at its best against model B with a suboptimal prompt. The comparison is biased before you collect a single score.
The alternatives are both expensive and necessary:
Per-model prompt tuning. Tune a separate prompt for each model candidate, then evaluate each on its own tuned prompt. This gives you a fair comparison of each model's ceiling on your task. It requires more work upfront but produces reliable comparative conclusions. For high-value model selection decisions, this is the correct approach.
Model-agnostic prompts. Design prompts to be maximally model-agnostic — simple, explicit instructions without any model-specific hacks — then run the same prompt across all candidates. This isn't optimal for any individual model, but it's equally suboptimal for all of them, which makes the comparison fair in a different way. Useful for initial screening before investing in per-model tuning.
Be explicit about which approach you're using and what it means for how you interpret results.
Structuring criteria for cross-model comparison
Not all eval criteria are equally useful for model comparison. Some criteria measure a capability that varies meaningfully across models; others measure something that's near-ceiling on all capable models and won't differentiate them.
Before running a multi-model eval, sort your criteria into two buckets:
Discriminating criteria — those likely to produce meaningful differences across the models you're testing. For most current frontier model comparisons, these tend to be: instruction-following precision, output format compliance, groundedness in RAG tasks, handling of ambiguous or underspecified inputs, and latency (not a quality criterion but practically important). These are where you should focus your comparison attention.
Ceiling criteria — those on which all your candidate models are likely to score near-maximum. Basic coherence, fluency, and factuality on well-scoped questions are ceiling criteria for frontier models. Including them in your comparison table clutters the comparison without adding signal. They're still worth running for quality assurance — you don't want a model that scores 4.8/5 on instruction following and 2/5 on coherence — but don't let near-identical high scores on these criteria distract from the discriminating criteria where models actually differ.
The cost-quality frontier isn't a single number
Teams that evaluate models purely on quality score miss the shape of the cost-quality relationship. The right question isn't "which model has the highest average score?" — it's "for which features does the quality gap between models justify the cost difference?"
Consider a feature set with three features: a complex multi-document analysis feature, a short structured output extraction feature, and a conversational routing feature. Running a multi-model eval on all three might reveal that the smaller, cheaper model performs within 5% of the larger model on the structured output and routing features, but 20% worse on the multi-document analysis. The correct decision is to use different models for different features — not to pick a single model for all three.
This is a harder operational decision but the right one. In Fyntune's eval configuration, you can tag each feature with a model assignment, which makes it explicit which model is responsible for which feature and enables per-feature regression testing after any model changes. Features shouldn't all be on the same model by default — they should be on the right model for their cost-quality requirements.
Avoiding the result table trap
After running a multi-model eval across 8 criteria and 3 models, you have 24 data points. The temptation is to add them up and pick the model with the highest total score. This is statistically problematic for several reasons:
- Criteria aren't equally important. A model that scores lower on coherence but higher on factuality might be better for a fact-intensive use case even though its total score is lower.
- The scale isn't commensurable across criteria. A 1-point difference on a 5-point coherence rubric doesn't mean the same thing as a 1-point difference on a 5-point factuality rubric.
- A single low score on a critical criterion (like guardrail compliance) might disqualify a model regardless of how it scores on other criteria.
The better approach: define criterion weights and any hard disqualifiers before you run the eval. Criterion X is a hard disqualifier if it falls below threshold Y. Criterion Z has 3x the weight of criterion W in the final score. Write this down before you look at results. Post-hoc criterion weighting is the eval equivalent of p-hacking — it's easy to rationalize the choice you wanted to make anyway once you've seen the data.
Cross-model score normalization
If you're using LLM-as-judge for any of your eval criteria, scores across models aren't directly comparable. A judge model that uses the same architecture as one of your candidate generators will have self-preference bias — it'll score that model's outputs higher regardless of quality.
Two approaches to address this:
Use a single, consistent judge model from a different family than all candidates. If you're comparing GPT-4o and Claude, use a judge that's neither — a different architecture model, or a fine-tuned open-source judge model. Consistency matters more than "best judge" — use the same judge for all models in the comparison, and don't switch judges mid-evaluation.
Run human preference comparisons on the subset of criterion scores where LLM judge bias is most likely to matter. Pairwise preference by a human rater on 20-30 representative outputs, for each candidate model pair, is relatively cheap and gives you a bias check on your automated scores.
When to stop comparing and commit
Multi-model evaluation can become a trap where teams keep expanding the comparison scope (more criteria, more models, more input types) rather than making a decision. This is worth naming as a risk.
Set a decision timeline before you start. "We'll run evals for two weeks and make a model selection decision on day 15." Commit to the criteria and model candidates before the eval starts. When you have results on day 15, make the decision based on what you have — not on hypothetical additional data you could collect.
The model landscape will keep changing. The model you select today may not be the optimal model in 6 months. That's fine — your eval infrastructure, eval dataset, and eval criteria persist across model changes. The time you invest in rigorous multi-model comparison methodology pays off every time you revisit the model selection question, which in the current environment will be often.
We're not saying you need a perfect comparison methodology before any model decision gets made. Early-stage decisions with lighter eval are often appropriate — you have less production data, fewer edge cases defined, and the cost of switching later is manageable. The rigor we've described here is proportional to the stakes: high-traffic production features where a wrong model choice costs weeks of remediation deserve the full process. An internal tool used by three people does not.