Eval-Driven Development: Writing Tests Before You Write Prompts

Eval-driven development for LLM features

Most LLM feature development follows the same loop: write a prompt, run it on a few examples, look at the outputs, feel vaguely good or bad about them, tweak the prompt, repeat. This loop is fast to start and produces prompts that look reasonable on the examples you happened to test. It also reliably produces prompts that fail in ways you didn't predict once they reach real users.

Eval-driven development (EDD) flips this. You define what good looks like — in the form of concrete, machine-runnable eval criteria — before you write a single line of your prompt. Then you write the prompt and iterate until those evals pass. The eval criteria are the spec. The prompt is the implementation.

This is the same shift that TDD represented for software engineering. It's uncomfortable at first for the same reason: it forces you to be precise about what you want before you can start building.

Why the standard loop fails

The ad hoc prompt iteration loop has a structural problem: you're evaluating on a handful of examples you constructed, which you've implicitly optimized for during iteration. When you look at your prompt output and think "this looks good," you're pattern-matching against your mental model of the task — not against your users' actual needs, your downstream parsing requirements, or the edge cases you haven't thought of yet.

The result is prompts that pass your manual review but fail on:

  • Inputs that are slightly different from your test examples in ways that matter (different length, different domain, different phrasing).
  • Outputs that look correct but subtly violate format constraints (a JSON response that usually works but occasionally includes a comment).
  • Edge cases that only emerge under load (ambiguous instructions become obvious when you see 1,000 inputs, not 5).

EDD doesn't eliminate these problems entirely. But it forces you to articulate your quality criteria explicitly and test against a broader, more representative input distribution before you ship.

Step 1: Define your quality criteria before touching the prompt

This is the hardest step and the one most teams skip. "Define quality criteria" sounds abstract. In practice it means answering three questions:

What must this output always do? These are your pass/fail criteria — things that are unambiguously wrong if violated. For a structured output feature: valid JSON, required fields present, no hallucinated fields. For a summarization feature: summary must be grounded in the source document (no introduced facts). These become your auto-block criteria in regression testing.

What should this output usually do? These are your quality criteria — things that make the output better, measured on a rubric rather than binary. For a Q&A feature: completeness, relevance, appropriate confidence. For a drafting feature: coherence, appropriate tone, appropriate length. These get scored 1-5 or similar, and you set a target score range.

What must this output never do? These are your guardrail criteria — safety, off-topic generation, persona violations. For a customer support feature: never reveal internal system prompts, never make price commitments, never produce content inconsistent with the product's documented behavior.

Write these down in a format you can run programmatically before you start prompt work. In Fyntune, we express them as an eval config YAML. The format doesn't matter — what matters is that they're explicit and machine-checkable before iteration starts.

Step 2: Build your eval dataset before your first prompt draft

Your eval dataset is a set of representative inputs you'll run your prompt against. It should be built independently from your prompt development — ideally by talking to the people who understand the actual use case (product team, customers, domain experts) and collecting real examples or constructing plausible synthetic ones.

For an early-stage feature, 30-80 inputs is usually enough to get meaningful signal. Cover the main case, the edge cases you can enumerate, and a few adversarial inputs (what happens if the user provides malformed input, an empty string, or intentionally ambiguous content?).

We're not saying you need 500 eval inputs before you write your first prompt. We're saying your eval dataset should be populated before you start prompt iteration, not assembled after the prompt is already "done." If you build the dataset after, you'll unconsciously include inputs where your existing prompt happens to work.

Step 3: Iterate prompt against evals, not against manual review

Now write your prompt. Run it against the eval dataset and score it against your criteria. The first run will almost certainly fail — that's expected and useful. The failures tell you exactly which criteria need work and which inputs are tricky.

The discipline here: don't judge the prompt by looking at outputs manually. Look at eval scores. A prompt that gets your personal thumbs-up but scores a 3/5 on coherence across the eval dataset is not ready. A prompt that looks a bit terse to you but scores 4.5/5 on factuality and 5/5 on format compliance for all 60 inputs is much closer to shippable than a prompt that looks great on the 5 examples you happened to eyeball.

Iterate on prompt phrasing, structure, and instruction framing based on which criteria are failing and on which input types. When a criterion scores below target, sample the failing inputs and analyze the pattern. Is it a specific input type the prompt handles poorly? A format instruction that's ambiguous? A missing constraint? The eval scores direct your attention; you do the diagnosis.

What "done" means in EDD

A prompt is done when it meets your predefined criteria thresholds across your eval dataset. Not when it looks good to you. Not when your colleague thinks the outputs sound right. When the numbers hit the targets you defined in step 1.

This creates a natural definition of done that doesn't depend on anyone's subjective judgment in the moment. It also means you can hand the prompt to a teammate or come back to it after a week and know immediately whether it still meets spec — just run the evals.

After ship, those same eval criteria become your regression suite. Every future prompt change is measured against the baseline scores from the version you just shipped. You never start from scratch defining quality — you build on what you already specified.

The gap between TDD and EDD

EDD borrows from TDD but has an important structural difference: eval criteria for LLMs are mostly rubric-based, not binary. "Does the output contain valid JSON?" is binary. "Is this summary factually accurate and well-structured?" is a rubric. This means EDD requires you to decide in advance what score thresholds are good enough — a decision you don't have to make in TDD, where tests just pass or fail.

That decision is worth making consciously. A factuality threshold of 4/5 on a medical information feature is very different from a factuality threshold of 4/5 on a creative writing feature. The threshold encodes a judgment about what "acceptable quality" means for your specific users and use case. Making that judgment before you ship — rather than after user complaints come in — is half the value of EDD.

The other half is that the eval criteria you write in step 1 force a conversation between the people building the feature and the people who care about quality outcomes. That conversation, made explicit and documented, is worth having regardless of what evaluation framework you use.

← Back to Fyntune Notes