Abstract visualization of a compliance gap — data nodes on one side, audit documents on the other, disconnected

AI Governance January 14, 2025 8 min read

Why ML Compliance Fails: The Gap Between Experiment Tracking and Audit-Ready Documentation

Fatima Al-Rashid

CEO & Founder, Cognify

Every ML team I talk to at a regulated enterprise uses at least one experiment tracking tool. Weights & Biases, MLflow, Comet — sometimes all three at once. These tools solve a real problem: giving engineers visibility into what happened during training so they can debug, compare runs, and reproduce results. They are genuinely excellent at this.

And yet, when those same teams go to get a fine-tuned model approved for production deployment, they hit a wall. The compliance team — whether that's an internal model risk committee at a bank, a HIPAA compliance officer at a health system, or an AI governance board at an insurance carrier — starts asking questions that W&B dashboards cannot answer. Not because the data isn't there, but because it's in the wrong form, in the wrong place, with none of the access controls or immutability guarantees that turn data into evidence.

This is the core of ML compliance failure, and it has almost nothing to do with the quality of the model. It's an organizational and tooling gap that sits between the end of a training run and the beginning of a compliance review.

Tracking vs. audit: a fundamental mismatch

Experiment tracking tools are designed for ML engineers. Their primary UX surface is the engineering team's dashboard: compare runs, filter by hyperparameter, visualize loss curves, annotate with free-text notes. The mental model is exploratory — you're asking "what happened and what should I try next?"

Audit documentation has a completely different mental model. It's not exploratory; it's evidentiary. The questions it answers are: "What decision was made, by whom, on what basis, and can I prove that the record hasn't been altered since?" A compliance team reviewing a model deployment needs a frozen, signed-off snapshot of what was true at approval time — not a live dashboard that could have changed since then.

Consider what happens when an examiner from a banking regulator asks to see the model documentation for a fine-tuned LLM used in credit underwriting. They don't want a W&B run link. They want a document that says: here is the dataset version used, with a hash proving it hasn't changed; here is the training configuration; here are the evaluation results against these specific benchmarks; here is the name and role of the person who reviewed and approved this; and here is the timestamp of that approval. That document needs to exist independently of any engineering tool, be exportable without requiring engineering access, and be immutable — meaning the approval record cannot be modified after the fact.

What compliance teams actually need

When I was building clinical NLP models at a healthcare technology company, the compliance review process consumed weeks — not because the reviewers were slow, but because they were asking for things that didn't exist as documents. Every review started with the same archaeology project: can someone produce a list of exactly which training records were in the dataset? Who reviewed the data source agreements? What were the evaluation metrics on the holdout set, and was that holdout set kept completely separate from the fine-tuning corpus?

These questions are answerable — but answering them required manual work from three different people across two different teams. Someone had to pull dataset statistics from an S3 manifest. Someone else had to find the relevant DUA in a SharePoint folder. A third person had to export eval results from MLflow into a spreadsheet. Then a compliance analyst had to stitch these artifacts into something coherent.

What compliance teams need is not better access to engineering tools. They need a structured, compliance-oriented view of the same information: organized by model version, tied to specific approval checkpoints, and in a format that their existing review workflow can consume. The information must be:

Structured. Not free-text notes in a run comment. Machine-readable fields with defined schemas.
Complete at a point in time. A snapshot that represents what was true when the model was approved — not a live view that reflects the current state of any dataset or training artifact.
Attributable. Every piece of data linked to a specific run, a specific dataset version, a specific evaluator.
Immutable after sign-off. The audit record must be write-once after the compliance team approves. An approval that can be edited retroactively is not an approval.

The export problem

Even teams that use MLflow with discipline — tagging runs carefully, adding meaningful notes, tracking hyperparameters consistently — hit an export problem. The data lives in an MLflow tracking server, or in MLflow's artifact store, or in a mix of both depending on what was logged where. To turn this into a compliance document, someone needs to write a custom export script, decide what fields to include, format the output, and then get that output reviewed and signed. This process is ad hoc every time, which means it's inconsistent between model versions and between teams.

Inconsistency is a compliance problem in its own right. If the documentation for model version 1.2 includes dataset statistics and model version 1.3 doesn't, a regulator examining both will ask why. Structured, systematic documentation generation isn't just convenient — it's part of demonstrating that the compliance process is repeatable and controlled.

Mutability: the hidden killer

Here is something that doesn't get discussed enough: most experiment tracking stores are mutable. Run notes can be edited. Tags can be changed. Metrics can be added retroactively. In a collaborative ML environment, this is a feature — engineers update notes as they learn more about what a run produced. But in a compliance context, mutability destroys evidentiary value.

We're not saying experiment tracking tools are poorly designed — they're correctly designed for their purpose. The issue is that using a mutable store as your compliance record creates a chain of custody problem. If run notes can be edited after an approval decision was made, how does a regulator know the notes they're reading reflect what the reviewer saw at approval time?

Immutability in a compliance audit log means append-only writes, hash-chained entries, and no admin override. It means the record of what happened during training, and the record of who approved it and when, are cryptographically tied together and cannot be altered without detection. This is architecturally incompatible with a standard experiment tracking database — and it's one of the primary reasons experiment tracking tools can't be repurposed as audit systems without significant additional infrastructure.

Where the gap widens in regulated industries

The generic ML compliance problem becomes acute when you add regulatory specificity. A bank subject to SR 11-7 model risk management guidance needs to demonstrate conceptual soundness and ongoing monitoring for every model material to its operations. A health system deploying a clinical NLP model needs to document data provenance in ways that are consistent with HIPAA's minimum necessary standard. An insurance carrier needs model documentation that satisfies state insurance department requirements, which vary by jurisdiction.

None of these regulatory contexts maps cleanly onto what W&B or MLflow produce. SR 11-7 asks for documentation of model limitations and their potential impact. A model card asks for intended use, out-of-scope use cases, and performance across demographic groups. HIPAA-adjacent documentation asks about data source authorization and de-identification methods. These fields have to be populated by humans who understand the compliance context — but they also need to be structured, versioned, and tied to specific training artifacts so that the relationship between the documentation and the underlying model is clear and defensible.

The organizations that handle this well have typically built internal tooling — essentially a compliance documentation system that sits on top of their experiment tracking infrastructure and pulls the relevant artifacts into a structured form. These internal tools represent months of engineering work, and they need to be maintained as the underlying ML infrastructure evolves. Most early-stage or growing AI teams don't have the bandwidth to build and maintain this tooling while also doing the actual ML work.

Closing the gap without slowing engineers

The answer is not to make engineers do more documentation work. If the solution requires ML engineers to fill out compliance forms, you've created an adversarial relationship between two teams that need to work together, and you've added a step that will be skipped under deadline pressure.

The correct approach is to make compliance documentation a byproduct of the engineering process — generated automatically from the artifacts that already exist (dataset hashes, hyperparameter logs, eval results, checkpoint metadata) — and to give compliance teams a structured interface to review and sign off on that documentation without needing engineering access to the training environment.

This requires an explicit audit trail layer that sits between the experiment tracking tool (which engineers control) and the compliance review interface (which governance teams control). That layer needs to handle: versioned dataset snapshots with cryptographic provenance, immutable run records, structured eval documentation, and a sign-off workflow with appropriate access controls.

Cognify was built specifically to fill this gap — an audit trail that instruments existing fine-tuning pipelines without requiring pipeline rewrites, and produces compliance packages in formats that governance teams can actually consume. The goal isn't to replace W&B or MLflow; both serve their purpose well. The goal is to produce the compliance-facing documentation that those tools were never designed to produce, without adding friction to the engineering workflow that generates the underlying data.

The gap between experiment tracking and audit-ready documentation is real, it's structural, and it won't close by itself. But it's also solvable — and solving it is what turns a well-trained model into a deployed, production-approved one.

Tracking vs. audit: a fundamental mismatch

What compliance teams actually need

The export problem

Mutability: the hidden killer

Where the gap widens in regulated industries

Closing the gap without slowing engineers

Related articles

Dataset Versioning for Fine-Tuning: Why SHA-256 Hashes Are Not Enough

LLM Model Risk Management at Banks: What SR 11-7 Requires in the Age of Fine-Tuned Models

Designing an Immutable Audit Log for ML Pipelines