Product

Your eval pipeline, automated on every deploy

Every prompt change, model swap, and guardrail update gets tested automatically. Fyntune runs your quality criteria and returns a verdict in under 90 seconds.

Eval run #3102 — prompt_v51 → v52 REGRESSION
Factuality 0.921 → 0.884 ▼ -4.0%
Guardrails 0.997 → 0.961 ▼ -3.6%
Coherence 0.879 → 0.882 ▲ +0.3%
✗ Deploy blocked. 2 criteria below threshold.

Code push to verdict in five steps

Fyntune sits inline with your deployment pipeline. No infrastructure changes needed.

01
Code push / PR opened
02
Fyntune eval hook fires
03
Eval suite runs (42 criteria)
04
Pass → deploy proceeds
05
Fail → block + alert team

Four eval types. Every deploy.

Semantic similarity, factuality, guardrail compliance, and LLM-as-judge — the four failure-mode categories that account for the vast majority of production LLM quality incidents we've seen across teams using Fyntune. Not a general-purpose testing framework: each type is purpose-built for the specific ways LLMs degrade when prompts or models change.

Semantic Similarity

Output drift detection across prompt versions

Measures whether your LLM's outputs remain semantically consistent across prompt versions and model swaps. Not exact match — Fyntune computes cosine similarity against your golden dataset, so a response that conveys the same information in different words passes, while genuine content drift is flagged. Useful for detecting the slow factual drift that accumulates after successive prompt changes and doesn't show up in A/B test metrics until it's a support problem.

  • Configurable similarity threshold per feature type
  • Per-input breakdown, not just aggregate score
  • Trend view across last 30 prompt versions
Abstract visualization of an LLM evaluation dashboard with metric charts
Factuality Check

Catch hallucinations before they ship

Fyntune's factuality eval runs your LLM outputs against a reference set of verified facts. It flags responses that introduce incorrect claims or contradict source documents — the regression type most likely to generate support escalations.

  • Ground-truth document comparison
  • Claim-level flagging, not just sentence scoring
  • Works with any LLM — no vendor lock-in
Fyntune eval pipeline diagram
Guardrail Compliance

Silent guardrail failures, caught automatically

Guardrail regressions are the hardest to catch manually — they often only surface on edge-case inputs that aren't in your test set. Fyntune evaluates guardrail compliance across a statistically representative sample of your production input distribution on every release.

  • Production input distribution sampling
  • Custom guardrail rule definitions via YAML
  • Outputs blocked from deploy on failure
guardrail eval — 847 samples
Sampling production inputs... 847 selected
Running guardrail compliance suite
 
Passed 813 / 847 96.0%
Failed 34 / 847 4.0% — ALERT
 
✗ Threshold 2.0% exceeded. Deploy blocked.
LLM-as-Judge

Open-ended quality criteria, automated

Some quality dimensions — brand voice adherence, response completeness, appropriate hedging on uncertain claims — don't reduce to rule-based checks or cosine scores. Fyntune's LLM-as-judge evaluator lets you define criteria in plain language and uses a calibrated judge model to score outputs at scale. The judge model is independent of your production model and calibrated against your own human-labeled samples to minimize position bias and length preference errors common in off-the-shelf LLM-as-judge implementations.

  • Custom criteria in plain language
  • Calibration against human-labeled samples
  • Judge model independence from production model
criteria:
  - name: brand_voice
    prompt: "Does the response maintain a professional,
      direct tone without marketing language?"
    threshold: 0.85
    judge_model: claude-3-5-sonnet
  - name: helpfulness
    prompt: "Does the response directly address the
      user's question with actionable information?"
    threshold: 0.80
< 90s
Median eval run time across all 42 default criteria
8 in 10
Regressions caught that a 50-case manual QA test set missed in internal testing
40+
Eval criteria available out of the box — or define your own

Start catching regressions today

Free tier. Connect in under 15 minutes. No credit card required.