Changelog
What's new in Fyntune
Release notes, new features, and fixes — newest first.
Multi-model comparison and eval batch API
- NEW Eval batch API — run the same criteria set across multiple models in one request. Returns side-by-side delta scores.
- NEW Dashboard comparison view for multi-model evals: overlaid score timelines per model.
- IMPROVED Eval run latency reduced by 18% via eval worker pool optimization.
LLM-as-judge calibration tooling
- NEW Calibration wizard: upload 20+ human-labeled samples to calibrate your LLM-as-judge criteria against human judgment.
- NEW Calibration score displayed per criterion in the dashboard — shows agreement rate with human labels.
- FIX Fixed a race condition in concurrent eval runs that could produce stale delta scores for high-frequency deploys.
Vercel integration and overage alerts
- NEW Official Vercel deploy hook integration. Eval runs trigger automatically on Vercel preview and production deployments.
- NEW Eval run usage alerts: email notification at 80% and 100% of monthly limit. Optional Slack alert.
- IMPROVED TypeScript SDK: added full type coverage for eval result objects and criteria config.
Custom YAML criteria and guardrail compliance eval
- NEW Custom criteria in fyntune.yaml: define rule-based or LLM-as-judge criteria alongside default suite.
- NEW Guardrail compliance eval type: production input distribution sampling with configurable sample size.
- IMPROVED GitLab CI integration plugin updated to support GitLab 17.x pipeline syntax.
GitHub Actions integration and Slack alerts
- NEW fyntune-ai/eval-action@v2 GitHub Actions integration. Blocks PR merge on regression with inline PR comment showing delta scores.
- NEW Slack webhook integration: regression alerts with criterion deltas and direct link to eval run.
- FIX Python SDK: resolved @track decorator incompatibility with async LLM call functions.
Prompt version tracking and delta dashboard
- NEW Prompt version tagging via CLI: fyntune prompt tag command stores diff and links eval results to specific prompt file versions.
- NEW Dashboard delta view: side-by-side eval scores for any two prompt versions with per-criterion breakdown.
- IMPROVED Factuality eval: ground truth comparison now supports multi-document context.
Initial release — Fyntune eval platform
- NEW Python SDK with @track decorator for OpenAI, Anthropic, and any REST LLM API.
- NEW Default eval suite: 42 criteria covering semantic similarity, factuality, coherence, and tone consistency.
- NEW Web dashboard with eval run history, per-criterion scores, and quality timeline charts.
- NEW Starter (free) and Team ($149/mo) pricing tiers.