Product

Your eval pipeline, automated on every deploy

Every prompt change, model swap, and guardrail update gets tested automatically. Fyntune runs your quality criteria and returns a verdict in under 90 seconds.

Start Free Trial Read the Docs

Eval run #3102 — prompt_v51 → v52 REGRESSION

Factuality 0.921 → 0.884 ▼ -4.0%

Guardrails 0.997 → 0.961 ▼ -3.6%

Coherence 0.879 → 0.882 ▲ +0.3%

✗ Deploy blocked. 2 criteria below threshold.

Code push / PR opened

Fyntune eval hook fires

Eval suite runs (42 criteria)

Pass → deploy proceeds

Fail → block + alert team

Semantic Similarity

Output drift detection across prompt versions

Measures whether your LLM's outputs remain semantically consistent across prompt versions and model swaps. Not exact match — Fyntune computes cosine similarity against your golden dataset, so a response that conveys the same information in different words passes, while genuine content drift is flagged. Useful for detecting the slow factual drift that accumulates after successive prompt changes and doesn't show up in A/B test metrics until it's a support problem.

Configurable similarity threshold per feature type
Per-input breakdown, not just aggregate score
Trend view across last 30 prompt versions

Abstract visualization of an LLM evaluation dashboard with metric charts

Factuality Check

Catch hallucinations before they ship

Fyntune's factuality eval runs your LLM outputs against a reference set of verified facts. It flags responses that introduce incorrect claims or contradict source documents — the regression type most likely to generate support escalations.

Ground-truth document comparison
Claim-level flagging, not just sentence scoring
Works with any LLM — no vendor lock-in

Guardrail Compliance

Silent guardrail failures, caught automatically

Guardrail regressions are the hardest to catch manually — they often only surface on edge-case inputs that aren't in your test set. Fyntune evaluates guardrail compliance across a statistically representative sample of your production input distribution on every release.

Production input distribution sampling
Custom guardrail rule definitions via YAML
Outputs blocked from deploy on failure

guardrail eval — 847 samples

Sampling production inputs... 847 selected

Running guardrail compliance suite

Passed 813 / 847 96.0%

Failed 34 / 847 4.0% — ALERT

✗ Threshold 2.0% exceeded. Deploy blocked.

LLM-as-Judge

Open-ended quality criteria, automated

Some quality dimensions — brand voice adherence, response completeness, appropriate hedging on uncertain claims — don't reduce to rule-based checks or cosine scores. Fyntune's LLM-as-judge evaluator lets you define criteria in plain language and uses a calibrated judge model to score outputs at scale. The judge model is independent of your production model and calibrated against your own human-labeled samples to minimize position bias and length preference errors common in off-the-shelf LLM-as-judge implementations.

Custom criteria in plain language
Calibration against human-labeled samples
Judge model independence from production model

criteria:
  - name: brand_voice
    prompt: "Does the response maintain a professional,
      direct tone without marketing language?"
    threshold: 0.85
    judge_model: claude-3-5-sonnet
  - name: helpfulness
    prompt: "Does the response directly address the
      user's question with actionable information?"
    threshold: 0.80

< 90s

Median eval run time across all 42 default criteria

8 in 10

Regressions caught that a 50-case manual QA test set missed in internal testing

40+

Eval criteria available out of the box — or define your own

Get started

Start catching regressions today

Free tier. Connect in under 15 minutes. No credit card required.

Start Free Trial View Quickstart Docs