Eval Types
Fyntune ships with 42 default eval criteria organized into five categories. All criteria are configurable by threshold. You can also define custom criteria using the LLM-as-judge approach.
Semantic Similarity
Measures whether the LLM's output is semantically consistent with expected outputs. Uses cosine similarity in embedding space — not string matching.
When to use: Detecting prompt rewrites that shift output meaning without breaking format. Most useful for summarization, Q&A, and content generation features.
# fyntune.yaml threshold override
features:
summarization:
thresholds:
semantic_similarity: 0.85 # default: 0.82
Factuality Check
Evaluates whether the LLM output contains factually incorrect claims relative to a reference document set. Flags hallucinated claims at the sentence level.
When to use: Any LLM feature that generates factual claims from a source document (RAG pipelines, document summarization, customer support answers).
features:
support_answer:
thresholds:
factuality: 0.95 # strict — support answers must be accurate
ground_truth_docs:
- "docs/product-knowledge-base.md"
Guardrail Compliance
Tests whether LLM outputs comply with your defined guardrail rules — content restrictions, topic boundaries, required disclaimers. Runs against a sample of production inputs, not just synthetic test cases.
When to use: Any production LLM feature with compliance requirements, user-facing content restrictions, or safety requirements.
features:
chat_assistant:
thresholds:
guardrail_compliance: 0.99
guardrail_rules:
- "No PII in response"
- "No competitor mentions"
- "Legal disclaimer included when discussing pricing"
LLM-as-Judge (custom criteria)
Define quality criteria in natural language. A calibrated judge model scores each output against your criteria. Useful for subjective dimensions — tone, brand voice, helpfulness — that rule-based checks can't capture.
When to use: Brand voice consistency, helpfulness scoring, response completeness, any dimension where you can express "what good looks like" in a sentence.
features:
sales_email:
custom_criteria:
- name: professional_tone
prompt: "Does the email maintain a professional, confident tone without being pushy?"
threshold: 0.80
judge_model: claude-3-5-sonnet
- name: call_to_action
prompt: "Does the email include a clear, specific next step for the recipient?"
threshold: 0.90
Response Coherence and Tone Consistency
Coherence measures whether the output forms a logically consistent response — no contradictions, no abrupt topic shifts. Tone consistency checks whether the response matches the expected formality and voice relative to prior outputs from the same feature.
When to use: Useful as a baseline check on all features. Low coherence scores often indicate context window issues or prompt truncation problems rather than intentional regression.
Default thresholds reference
| Criterion | Default threshold | Block on fail |
|---|---|---|
| Semantic Similarity | 0.82 | Yes |
| Factuality | 0.90 | Yes |
| Guardrail Compliance | 0.98 | Yes |
| Response Coherence | 0.75 | No (warn only) |
| Tone Consistency | 0.80 | No (warn only) |