Adding LLM Evals to Your CI/CD Pipeline in an Afternoon
The argument for running LLM evals in CI is the same argument that convinced teams to run unit tests in CI a decade ago: if you don't run it automatically, you only run it when you remember to. And when you're shipping fast — pushing prompt updates, iterating on system instructions, swapping model versions — "when you remember to" is not a reliable quality gate.
The friction most teams report isn't philosophical. They agree evals should run in CI. The friction is practical: LLM evals feel different from unit tests. They take minutes instead of seconds. They call external APIs. They have non-deterministic outputs. Their pass/fail criteria are fuzzier. That friction keeps eval out of CI pipelines even when teams know it should be there.
Here's how we approach wiring evals into GitHub Actions and GitLab CI in a way that handles the LLM-specific friction points without abandoning the principle.
What Goes in CI vs. What Doesn't
The first decision is scope. Not all your evals should run on every PR. A full eval suite over 400 test cases, using a judge model for factuality and coherence scoring, might run for 8–12 minutes and cost a few dollars in API calls per run. Running that on every PR is expensive and slow enough to become friction that makes people bypass the gate.
We use a two-tier structure:
Fast tier (runs on every PR): Deterministic and near-deterministic checks only. Format compliance checks (does the output match the required JSON schema?), guardrail pattern matching (does the output contain any prohibited content patterns?), length distribution checks. These complete in under 90 seconds and don't require API calls beyond the model inference itself. If you're using cached golden outputs, some of these run with zero additional API calls.
Full tier (runs on merge to main, or manually triggered): The complete eval suite including semantic similarity scoring, LLM-as-judge factuality and coherence evaluation, and regression comparison against the previous version baseline. This is the gate before production deploy, not the gate before every code review.
This structure keeps PR review fast while still ensuring nothing hits production without a full eval run.
GitHub Actions Setup
The basic GitHub Actions configuration for the fast tier looks like this:
name: LLM Eval — Fast Tier
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
- '.fyntune/**'
jobs:
eval-fast:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install Fyntune CLI
run: pip install fyntune-cli
- name: Run fast eval tier
env:
FYNTUNE_API_KEY: ${{ secrets.FYNTUNE_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
fyntune eval run \
--suite fast-tier \
--prompt-dir ./prompts \
--fail-below guardrail_compliance=0.95 \
--fail-below format_compliance=0.98 \
--output-file eval-results.json
- name: Post eval summary to PR
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('eval-results.json', 'utf8'));
const body = `## Eval Results\n\n${results.summary_markdown}`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
The paths filter is important — it limits fast tier runs to PRs that actually touch prompt files or LLM-related code. Infrastructure PRs and documentation changes skip the eval step.
The --fail-below flags set explicit pass/fail thresholds. This is the configuration decision that matters most. Set them too tight and you'll be debugging spurious failures caused by model non-determinism. Set them too loose and the gate doesn't catch anything. We typically recommend starting at 0.95 for guardrail compliance and tightening over time as you understand your eval suite's variance.
Handling Non-Determinism
LLM outputs are not deterministic at temperature > 0. This creates a problem for CI: the same prompt might produce slightly different outputs on different runs, causing eval scores to vary even when nothing changed.
Three mitigations:
Set temperature=0 for eval runs. This makes model outputs deterministic for a fixed prompt and model version. You lose coverage of variance behavior, but for regression detection, the goal is identifying changes — and changes are more detectable at temperature=0.
Use score thresholds with statistical tolerance. Instead of failing on any score below threshold, fail only when the score is below (threshold - tolerance). For guardrail compliance at 0.95, set tolerance=0.03, so the build only fails if compliance drops below 0.92. This absorbs minor variance while catching real regressions.
Run multi-sample for anything you need to be precise about. For the full eval tier, run each test case 3 times and use the median score. The CI time cost is 3x, but the signal is much more reliable for evaluating prompt changes that affect the distribution of outputs.
GitLab CI Configuration
The GitLab CI equivalent is a YAML pipeline with two stages:
stages:
- eval-fast
- eval-full
eval-fast:
stage: eval-fast
image: python:3.11
rules:
- changes:
- prompts/**/*
- src/llm/**/*
before_script:
- pip install fyntune-cli
script:
- fyntune eval run
--suite fast-tier
--prompt-dir ./prompts
--fail-below guardrail_compliance=0.95
--format junit
--output-file eval-junit.xml
artifacts:
reports:
junit: eval-junit.xml
paths:
- eval-junit.xml
expire_in: 7 days
eval-full:
stage: eval-full
image: python:3.11
rules:
- if: '$CI_COMMIT_BRANCH == "main"'
before_script:
- pip install fyntune-cli
script:
- fyntune eval run
--suite full
--prompt-dir ./prompts
--compare-baseline main
--fail-on-regression 0.05
--format gitlab
artifacts:
paths:
- eval-report.json
expire_in: 30 days
The --compare-baseline main flag compares the current eval scores against the last stored baseline for the main branch. The --fail-on-regression 0.05 flag fails the build if any metric drops more than 5 percentage points from baseline. This is more nuanced than an absolute threshold: a feature that's been at 94% compliance doesn't fail at 93%, but one that drops from 97% to 91% does.
What to Do When Evals Fail in CI
The most common mistake when evals first start failing in CI is treating them like flaky unit tests and adding retry logic or relaxing thresholds. Resist this. An eval failure is not a CI inconvenience — it's the system working as intended.
When an eval fails on a PR, the process should be: identify which eval cases failed, look at the outputs for those cases, understand whether the failure is a real regression or an eval suite gap. Real regression: the prompt change degraded behavior on that test category. Fix the prompt or update the change note to document the intentional trade-off. Eval suite gap: the test case is miscalibrated or the threshold is set incorrectly for that metric. Fix the test, not the threshold.
The discipline of investigating failures rather than routing around them is what makes the eval gate worth having. The pipeline pays dividends proportional to how seriously your team treats its failures.
A Note on Cost Management
Fast tier eval runs at reasonable scale — 50–100 test cases, deterministic checks — cost under $0.50 per run when using a mid-tier model for inference. Full tier runs with LLM-as-judge scoring for 400 cases typically run $2–5. On a team pushing 20–30 PRs per week and running full eval on each main merge, that's $40–100/month in API costs for the eval pipeline. For any team where LLM quality matters, that's a reasonable cost of doing business, not a line item to optimize away.
What Eval-Gated Deploy Is Not
We're not saying eval-gated deploy replaces human review of LLM behavior. Code review, user testing, and red-teaming serve different purposes. Automated evals are good at catching regressions against known criteria at scale — they can run 400 test cases in 10 minutes and tell you precisely which categories degraded. They're not good at surfacing entirely new failure modes your test cases haven't encountered, or evaluating the nuanced cultural and contextual appropriateness of outputs in ways that require human judgment.
The right framing is defense-in-depth: eval-gated CI is the first line of defense that catches the majority of quantifiable regressions before they ship. Human review is the second line, focused on the things automated eval can't reliably score. Using one to replace the other misses the point of both.
The practical payoff of getting eval into CI is not that you can skip human review. It's that human reviewers spend less time on regressions that eval already caught, and more time on the judgment calls that require human insight. The eval gate handles the routine regressions; human attention handles the novel ones.