Eval Types

Fyntune ships with 42 default eval criteria organized into five categories. All criteria are configurable by threshold. You can also define custom criteria using the LLM-as-judge approach.

Semantic Similarity

Measures whether the LLM's output is semantically consistent with expected outputs. Uses cosine similarity in embedding space — not string matching.

When to use: Detecting prompt rewrites that shift output meaning without breaking format. Most useful for summarization, Q&A, and content generation features.

# fyntune.yaml threshold override
features:
  summarization:
    thresholds:
      semantic_similarity: 0.85  # default: 0.82

Factuality Check

Evaluates whether the LLM output contains factually incorrect claims relative to a reference document set. Flags hallucinated claims at the sentence level.

When to use: Any LLM feature that generates factual claims from a source document (RAG pipelines, document summarization, customer support answers).

features:
  support_answer:
    thresholds:
      factuality: 0.95  # strict — support answers must be accurate
    ground_truth_docs:
      - "docs/product-knowledge-base.md"

Guardrail Compliance

Tests whether LLM outputs comply with your defined guardrail rules — content restrictions, topic boundaries, required disclaimers. Runs against a sample of production inputs, not just synthetic test cases.

When to use: Any production LLM feature with compliance requirements, user-facing content restrictions, or safety requirements.

features:
  chat_assistant:
    thresholds:
      guardrail_compliance: 0.99
    guardrail_rules:
      - "No PII in response"
      - "No competitor mentions"
      - "Legal disclaimer included when discussing pricing"

LLM-as-Judge (custom criteria)

Define quality criteria in natural language. A calibrated judge model scores each output against your criteria. Useful for subjective dimensions — tone, brand voice, helpfulness — that rule-based checks can't capture.

When to use: Brand voice consistency, helpfulness scoring, response completeness, any dimension where you can express "what good looks like" in a sentence.

features:
  sales_email:
    custom_criteria:
      - name: professional_tone
        prompt: "Does the email maintain a professional, confident tone without being pushy?"
        threshold: 0.80
        judge_model: claude-3-5-sonnet
      - name: call_to_action
        prompt: "Does the email include a clear, specific next step for the recipient?"
        threshold: 0.90

Response Coherence and Tone Consistency

Coherence measures whether the output forms a logically consistent response — no contradictions, no abrupt topic shifts. Tone consistency checks whether the response matches the expected formality and voice relative to prior outputs from the same feature.

When to use: Useful as a baseline check on all features. Low coherence scores often indicate context window issues or prompt truncation problems rather than intentional regression.

Default thresholds reference

Criterion Default threshold Block on fail
Semantic Similarity 0.82 Yes
Factuality 0.90 Yes
Guardrail Compliance 0.98 Yes
Response Coherence 0.75 No (warn only)
Tone Consistency 0.80 No (warn only)