Abstract tool comparison visualization — two distinct workflows connecting at a compliance layer

Technical November 4, 2025 9 min read

MLflow for Experiment Tracking, Cognify for Compliance: Why We Use Both

Callum Reeves

Lead ML Infrastructure Engineer, Cognify

The question comes up in nearly every enterprise conversation: "We already use MLflow. Why would we add another tool?" It's a fair question. MLflow is mature, widely deployed, well-documented, and does exactly what it says. The answer isn't that MLflow is inadequate — it's that MLflow and Cognify solve different problems for different audiences, and in a regulated enterprise, both problems need to be solved.

This post is an honest side-by-side: what MLflow does well, what it doesn't do, what Cognify adds, and when you actually need both versus when MLflow alone might be sufficient.

What MLflow does well

MLflow is designed around four capabilities: experiment tracking, model registry, model packaging, and model serving. The experiment tracking component is the most widely used: log parameters, metrics, and artifacts to runs within experiments, compare runs across a project, reproduce a run from its logged configuration.

For ML engineers, MLflow delivers real value:

Free-form logging — log any parameter or metric with minimal schema constraints
Run comparison — side-by-side comparison of runs with parameter and metric diffs
Artifact storage — attach arbitrary files (model weights, plots, datasets) to runs
Model registry — register model versions with stage transitions (Staging → Production)
Run reproduction — logged parameters and artifacts provide enough context to reproduce a run
Integrations — native integrations with PyTorch, TensorFlow, scikit-learn, and most major frameworks

MLflow's model registry in particular has become useful for teams trying to manage the lifecycle of production models: tracking which version is deployed, when it was promoted, and who made the promotion decision.

Where MLflow stops

MLflow's design choices that make it flexible for engineers create limitations for compliance use:

Mutable records. MLflow run records can be updated after creation. Tags can be added, parameters can be overwritten (if logged with the same key), and run notes can be edited. For a compliance record, this means the current state of an MLflow run might not reflect what was true when the run completed — someone could have updated a tag or note after the fact. MLflow doesn't maintain a change history of run modifications.

No immutability guarantee. There is no write-once, append-only mode in MLflow. The MLflow tracking server backend (whether SQLite, PostgreSQL, or a cloud-managed store) is a standard database with standard write semantics. A DBA with database access can modify records directly. The application layer enforces no cryptographic tamper detection.

No compliance-oriented export format. MLflow's export capabilities are designed for ML engineers: export a run as JSON for reimport into another MLflow instance, or export the model artifact in MLflow's model format. Neither export is structured as a compliance document. Producing a compliance-ready audit package from MLflow data requires custom scripting.

No approval workflow. MLflow's model registry has stage transitions (Staging → Production), but these are not structured compliance approvals. There's no role-based gate requiring a designated compliance reviewer to approve before a model advances. Stage transitions can be performed by any user with registry write access.

No dataset-level versioning with provenance. MLflow allows logging dataset references (in newer versions), but these are informational — they don't enforce that the referenced dataset was snapshotted at the time of the run or compute cryptographic hashes for provenance verification.

The audience split

The fundamental difference is audience design. MLflow's interface is built for ML engineers: tabular run comparison views, metric curves, artifact browsers. Every design decision optimizes for engineering utility — fast access to the information an engineer needs to debug a run or choose between model versions.

Compliance teams have different information needs. They need: structured documentation rather than free-form run data, evidence of a specific review and approval event, the ability to verify that what they're reading hasn't been modified since it was produced, and an export format they can attach to a regulatory submission or internal governance record. They want to stay out of an engineering tool.

We're not saying MLflow's design is wrong — it's correctly optimized for its intended audience. The point is that making compliance teams use an ML engineering tool, or making ML engineers maintain their data in formats that compliance teams can consume, creates friction for both groups. A separate compliance layer that consumes data from the engineering pipeline and presents it to compliance teams in a format they can use serves both audiences better.

Concrete capability comparison

Capability	MLflow	Cognify
Run parameter logging	Yes — free-form, mutable	Yes — structured schema, immutable after run completion
Metric logging	Yes — time series and final values	Yes — linked to specific eval dataset version
Dataset versioning with hash	Limited — reference logging only	Yes — SHA-256 Merkle tree hash, provenance metadata
Immutable audit record	No	Yes — append-only, hash-chained
Compliance approval workflow	No	Yes — role-based, with e-signature and timestamp
Compliance package export (PDF/JSON)	No — requires custom scripting	Yes — structured templates per industry/regulation
Model card auto-generation	No	Yes — from pipeline metadata, with human-authored sections
Compliance team interface (no ML dashboard)	No	Yes — review interface separate from engineering views
Tamper detection	No	Yes — hash chain verification
Retention policy enforcement	No	Yes — configurable, with archival export

The integration pattern in practice

Teams that run both tools use them simultaneously, not as alternatives. The training script logs to MLflow for engineering observability — the engineering team uses MLflow for run comparison, hyperparameter search analysis, and debugging. The same training script initializes Cognify and logs the compliance-relevant artifacts in parallel.

There's deliberate overlap: both tools receive training configuration and eval metrics. MLflow receives the data for engineering analysis. Cognify receives the data for compliance documentation. The overlap is by design — the compliance record should reflect the same facts that the engineering team is working with, not a separate set of records.

import mlflow
import cognify

# Both tools initialize before training
mlflow.set_experiment("clinical-nlp-v2")
cognify.init(workspace_id="...", project="clinical-nlp-v2")

# Both receive training config
mlflow.log_params(training_config)
cognify_run = cognify.run(config=training_config)

# Both receive eval results
mlflow.log_metrics(eval_results)
cognify.eval(results=eval_results, benchmark_version="holdout-v3")

# MLflow for model artifact management
mlflow.pytorch.log_model(model, "model")

# Cognify for compliance package at run completion
cognify_run.complete()  # triggers compliance package generation

When do you actually need both?

The honest answer: if your ML pipeline is not producing models for regulated enterprise use and you don't face model risk management review, compliance documentation requirements, or regulatory examination, MLflow alone is sufficient for experiment tracking and model management. The additional infrastructure of a compliance layer adds overhead that isn't justified if the compliance use case doesn't exist.

If you're deploying fine-tuned models in a regulated context — healthcare, financial services, insurance, regulated government applications — the compliance documentation requirements exist regardless of whether you're using a dedicated compliance tool. Without a tool like Cognify, the documentation work happens anyway: it's done manually, inconsistently, by engineers who are context-switching from their primary work, producing ad hoc artifacts that may or may not satisfy a rigorous review. The question isn't "do we need compliance documentation?" — the question is "do we want to produce it systematically or ad hoc?"

Teams that have operated the ad hoc model for more than a few model deployments consistently report the same experience: the documentation overhead grows as the number of models in production grows, the inconsistency between model versions creates review friction, and the archaeology required to reconstruct documentation for a specific model version during a regulatory review consumes disproportionate engineering time. The systematic approach pays off at a relatively small scale — typically two or three deployed models under active compliance review is enough to make the investment worthwhile.

What MLflow does well

Where MLflow stops

The audience split

Concrete capability comparison

The integration pattern in practice

When do you actually need both?

Related articles

Adding an Audit Trail to Your Hugging Face Fine-Tuning Pipeline

Why ML Compliance Fails: The Gap Between Experiment Tracking and Audit-Ready Documentation

On-Premises LLM Fine-Tuning: Why Air-Gapped Environments Still Need Audit Infrastructure