Data Provenance for Fine-Tuning: The Six Questions Every Compliance Team Will Ask
Every compliance review of a fine-tuned model in a regulated enterprise eventually comes down to six questions about the training data. They come in different forms — as items on a model risk committee checklist, as examiner questions during a bank's model validation, as documentation requirements in a health system's AI governance policy — but the underlying questions are the same.
What's striking is how consistently ML teams are unprepared for them. Not because the questions are unreasonable — they're entirely reasonable — but because the workflow for building fine-tuning pipelines doesn't naturally generate the documentation needed to answer them. The data gets assembled, the model gets trained, and the answers to these six questions exist somewhere across a combination of S3 buckets, data engineering team knowledge, and Slack message history.
This post works through each question, why compliance teams ask it, what good documentation looks like, and what the cost of not having it ready is.
Q1: Where did this data come from?
The first question sounds simple, but the answer needs to be specific. "Our internal customer service transcripts" is not a sufficient answer. "Customer service transcripts extracted from the CRM system (Salesforce, production instance) via daily export job, records with ticket creation dates between 2022-01-01 and 2024-06-30, filtered to English-language tickets categorized as 'billing' or 'account management', total 847,000 records" is an answer that compliance can work with.
The specificity matters because regulators are asking about data sources to understand three things: whether the data was obtained in ways that comply with applicable law and data source agreements, whether the data's provenance is consistent with its use in the model (e.g., data collected for customer service operations being used for credit risk modeling would raise questions), and whether the data is representative of the population the model will be applied to.
Documentation standard: a structured data source registry for each training dataset version, with source system, record type, date range, filtering criteria, and record count. Generated at extraction time, not reconstructed from memory weeks later.
Q2: Are you authorized to use it?
For internally-generated data (customer records, transaction logs, clinical notes), the authorization question asks: does the organization's legal right to use this data for model training flow from customer agreements, applicable law, or regulatory permissions — or are there consent or use limitations that would restrict this use? For externally-licensed data, the question asks: do the data source agreement terms permit use for training ML models?
This question catches more teams off guard than any other. Data used routinely in operational analytics is often used in model training without a separate authorization review, on the assumption that "we have the data, therefore we can use it." But data source agreements, consent terms, and regulatory frameworks sometimes distinguish between operational use and use for model training — a distinction that's increasingly relevant as organizations review their AI governance frameworks.
For clinical data, authorization is additionally structured around specific regulatory exceptions (HIPAA treatment operations, research waivers, de-identification standards). The authorization record for a clinical NLP training dataset needs to identify the specific legal basis, not just assert that the data was "properly used."
Documentation standard: a data authorization record for each data source, identifying the specific authorization basis (data source agreement clause, regulatory exception, consent mechanism), reviewed and signed by a legal or privacy officer. This is a human-authored attestation, not something that can be auto-generated from pipeline metadata.
Q3: What changed since last time?
This question assumes that there was a "last time" — a previous model version with a previous training dataset that was previously reviewed. It asks: what is different about the training data for this model version compared to the previous one, and is that difference material to the compliance review?
Teams that don't do systematic dataset versioning often can't answer this question at all. If the training dataset was assembled fresh for each training run without version tracking, there's no baseline to compare against. The compliance team is forced to treat each model version as a new model with a new dataset — which is more work for everyone.
Teams that do SHA-256 hash their datasets can answer "something changed" but often can't answer "what changed." Knowing that the hash is different doesn't tell you whether new records were added, old records were removed, or labels were corrected. A compliance reviewer needs to understand the nature of the change to assess whether it requires re-examination of the model's compliance properties.
Documentation standard: structured version diff for each dataset version, documenting record count delta, source system changes (new sources added, sources removed), date range changes, filter criteria changes, and a human-authored description of the intent behind the changes. The diff should be linked to both the previous and current dataset version hashes so it can be verified against the actual dataset records.
Q4: Who reviewed these changes?
Data governance at regulated enterprises is not just about documentation — it's about decision accountability. When training data changes between model versions, someone with appropriate authority should have reviewed that change and determined it was appropriate for the intended use. The compliance team wants to know who that person was, what their role and authority was, when they reviewed, and what they concluded.
This is the approval gate question. Without a structured data approval process, data changes happen implicitly: a data engineer updates the extraction script, the new dataset is used in the next training run, and no one has explicitly signed off on the change. The data change may be entirely benign — a bug fix, an update to include more recent records — but without an approval record, there's no way to demonstrate that it was intentional and reviewed.
Documentation standard: a timestamped approval record for each dataset version that differs materially from the previous version, including the reviewer's name, role, the scope of their review (what they examined), and their conclusion. This record should be immutable after it's created — the approval event should be a sealed record that reflects the state of documentation at the time of review.
Q5: How do we know the data wasn't modified?
This is the integrity question. It asks for evidence that the training dataset hasn't been altered between when it was assembled (and approved) and when it was used for training. In a well-designed pipeline, this evidence comes from cryptographic hashes computed at assembly time and verified at training time.
The question has both a technical and a process dimension. The technical dimension is: does the hash of the dataset at training time match the hash recorded at assembly time? The process dimension is: is the hash stored in a system where it itself can't be modified — i.e., is the hash record trustworthy?
A SHA-256 hash of the dataset stored in an MLflow run record is technically useful but compliance-limited, because MLflow run records can be modified (as discussed elsewhere in this blog). A SHA-256 hash stored in an immutable audit log with cryptographic chain verification is both technically and compliance-useful — the hash and its storage system together provide the integrity evidence that compliance needs.
Documentation standard: cryptographic hashes at the record or chunk level (Merkle tree structure for large datasets), stored in a system with tamper-evident guarantees. The verification process — how to confirm that the training data matched the hash at training time — should be documented so that an external auditor can verify it independently.
Q6: What was filtered out, and why?
Data filtering decisions — what records were excluded from the training dataset and why — are often underdocumented because they seem like implementation details rather than compliance decisions. But from a compliance perspective, filtering decisions are among the most significant choices made about training data, because they directly affect model behavior on the excluded population.
A credit model trained on a dataset from which certain demographic groups were excluded (even for ostensibly neutral reasons like "no credit history available") will perform differently on those groups than on groups that were well-represented. A clinical NLP model trained only on English-language notes will perform differently on patients whose care was documented in other languages. These aren't necessarily disqualifying — they're use case limitations that need to be documented.
Documentation standard: a filtering log for each dataset version, documenting each filtering criterion applied (record type exclusions, date range cutoffs, language filters, quality thresholds), the count of records excluded by each criterion, and the rationale for each criterion. For filters that could affect demographic representation, the documentation should acknowledge this and describe how it's addressed in deployment (scope limitations, supplemental evaluation).
Building the answers in systematically
The six questions above have a common property: they're all answerable with information that exists in a well-instrumented pipeline. The data source is known when the extraction script runs. The authorization record exists (or should) in the data governance system. The diff is computable from the hash comparison. The approval event happens at a specific time and can be recorded. The integrity hash is produced at assembly time. The filtering log is a byproduct of the filtering process.
The work involved isn't generating this information — it's capturing it in a structured, retrievable form at the time it's produced, and connecting it to the training run that uses it. That's the difference between answering these six questions in minutes from a Cognify compliance package and answering them in days from distributed artifacts across engineering systems and email threads.
The teams that are ahead of their compliance requirements on data provenance are not teams with special data governance capabilities — they're teams that decided early that provenance documentation would be a systematic output of their pipeline rather than a retroactive reconstruction exercise. That decision is mostly structural: it changes when and how the documentation is generated, not what information is captured.