Changelog

What's new in Fyntune

Release notes, new features, and fixes — newest first.

Multi-model comparison and eval batch API
  • NEW Eval batch API — run the same criteria set across multiple models in one request. Returns side-by-side delta scores.
  • NEW Dashboard comparison view for multi-model evals: overlaid score timelines per model.
  • IMPROVED Eval run latency reduced by 18% via eval worker pool optimization.
LLM-as-judge calibration tooling
  • NEW Calibration wizard: upload 20+ human-labeled samples to calibrate your LLM-as-judge criteria against human judgment.
  • NEW Calibration score displayed per criterion in the dashboard — shows agreement rate with human labels.
  • FIX Fixed a race condition in concurrent eval runs that could produce stale delta scores for high-frequency deploys.
Vercel integration and overage alerts
  • NEW Official Vercel deploy hook integration. Eval runs trigger automatically on Vercel preview and production deployments.
  • NEW Eval run usage alerts: email notification at 80% and 100% of monthly limit. Optional Slack alert.
  • IMPROVED TypeScript SDK: added full type coverage for eval result objects and criteria config.
Custom YAML criteria and guardrail compliance eval
  • NEW Custom criteria in fyntune.yaml: define rule-based or LLM-as-judge criteria alongside default suite.
  • NEW Guardrail compliance eval type: production input distribution sampling with configurable sample size.
  • IMPROVED GitLab CI integration plugin updated to support GitLab 17.x pipeline syntax.
GitHub Actions integration and Slack alerts
  • NEW fyntune-ai/eval-action@v2 GitHub Actions integration. Blocks PR merge on regression with inline PR comment showing delta scores.
  • NEW Slack webhook integration: regression alerts with criterion deltas and direct link to eval run.
  • FIX Python SDK: resolved @track decorator incompatibility with async LLM call functions.
Prompt version tracking and delta dashboard
  • NEW Prompt version tagging via CLI: fyntune prompt tag command stores diff and links eval results to specific prompt file versions.
  • NEW Dashboard delta view: side-by-side eval scores for any two prompt versions with per-criterion breakdown.
  • IMPROVED Factuality eval: ground truth comparison now supports multi-document context.
Initial release — Fyntune eval platform
  • NEW Python SDK with @track decorator for OpenAI, Anthropic, and any REST LLM API.
  • NEW Default eval suite: 42 criteria covering semantic similarity, factuality, coherence, and tone consistency.
  • NEW Web dashboard with eval run history, per-criterion scores, and quality timeline charts.
  • NEW Starter (free) and Team ($149/mo) pricing tiers.