Abstract model documentation visualization — structured compliance document nodes with data provenance connections

AI Governance September 22, 2025 11 min read

Automating Model Cards for Regulated Industries: Beyond the Google Template

Sona Mehrotra

Head of Product, Cognify

The model card framework introduced by Mitchell et al. in their 2019 FAccT paper was a genuine contribution to responsible AI practice. A structured, standardized document covering model purpose, training data, intended and out-of-scope uses, performance characteristics, and ethical considerations gives ML teams a forcing function to think through these dimensions systematically, and gives downstream users a reference for appropriate deployment.

The problem is that the original model card framework was designed for ML researchers sharing models publicly — not for compliance officers reviewing a model before regulated enterprise deployment. The two audiences have overlapping but distinct information needs, and the original framework's coverage of regulatory-specific dimensions is thin. Healthcare, financial services, and insurance all require model documentation that goes significantly beyond what the standard model card structure provides.

This post covers what regulated-industry model cards need that the original framework doesn't cover, how to structure those additions, and which parts can be automated from pipeline metadata versus which require human input.

The Mitchell et al. framework as a starting point

The original model card framework covers nine sections: model details, intended use, factors, metrics, evaluation data, training data, quantitative analyses, ethical considerations, and caveats and recommendations. This structure is well-designed for its purpose: enabling informed decisions about whether and how to use a model.

The "training data" section in the original framework is light — it asks for basic description of the training dataset and its characteristics. The "ethical considerations" section is open-ended, asking for discussion of factors that may impact the model's use. The "intended use" section asks for primary intended uses, primary intended users, and out-of-scope uses.

For regulated industries, each of these sections needs to be more specific, and several additional sections need to be added. The goal isn't to make model cards longer for its own sake — it's to make them answer the specific questions that a compliance reviewer or regulator will ask.

What regulated industries add

Across healthcare, financial services, and insurance, four categories of additional documentation consistently appear in regulated model review processes:

1. Data provenance attestation. A formal attestation of the legal and governance basis for using the training data, signed by a data steward or compliance officer. This is not just a description of the data — it's a statement that the data was used with appropriate authorization, that any PHI was handled in compliance with applicable law, and that data source agreements permit the specific use.

2. Protected class evaluation. Explicit evaluation results across demographic subgroups relevant to the regulatory context, with documented methodology and acceptable thresholds. For credit models, this means ECOA-relevant protected classes. For healthcare models, this means clinically relevant subpopulations.

3. Deployment scope limitations. Specific, documented restrictions on where and how the model may be deployed. "Intended for internal use in credit underwriting within the commercial lending division" is a deployment scope limitation. "Do not use for consumer credit decisions" is a scope limitation. These limitations become part of the model's governance record.

4. Regulatory cross-references. Explicit citations to the regulatory frameworks that govern the model's deployment context, and documentation of how the model documentation addresses each relevant requirement. A model used in banking credit decisioning should cross-reference SR 11-7 and the ECOA/Regulation B requirements relevant to credit models. A healthcare model should reference HIPAA audit control requirements.

Data provenance attestation section

The data provenance attestation section is the most distinctively compliance-specific addition. It's structured around a set of questions that a compliance officer or auditor would ask about the training data:

What data was used? Source systems, record types, date ranges, record counts — specific enough to be verifiable against source system records.
Under what legal basis? The specific authorization for using the data in model training: patient authorization, HIPAA treatment operations exception, data source agreement terms, research waiver.
De-identification status. If PHI was involved: which de-identification method, who performed it, validation status.
Data quality attestation. What data quality checks were performed and what issues were identified and addressed.
Attesting party. The name, role, and date of the person attesting to the above. This creates an accountability record — someone is formally on record as verifying the data provenance claims.

The attesting party is a human who must review and sign; they cannot be auto-populated from pipeline metadata. But the factual content — data source records, record counts, hash values linking documentation to training artifacts — can and should be auto-populated from the training pipeline's provenance records.

Protected class evaluation section

The Mitchell et al. framework includes a "factors" section that mentions disaggregated evaluation as a consideration. Regulated industries need this to be a structured section with specific content requirements rather than a general consideration.

The protected class evaluation section should include:

Which protected classes were evaluated, and why (linked to the regulatory framework governing the deployment context)
How each protected class was defined for evaluation purposes — particularly for classes like "race" where the definition and data availability may be complex
The evaluation dataset used for subgroup analysis, with a statement that it was kept separate from the fine-tuning training set
Performance metrics for each subgroup, with comparison to overall performance
The acceptable threshold for disparity, how that threshold was determined, and whether results are within it
Any identified disparities that are outside threshold, what the root cause analysis found, and what (if anything) was done to address them

We're not saying that a model with identified performance disparities cannot be approved — sometimes disparities reflect real-world differences that a model is correctly capturing, and sometimes they're within acceptable bounds given the deployment context. The point is that the documentation should show that the question was asked, the analysis was done, and a deliberate decision was made about the acceptability of the result.

Regulatory cross-reference section

A regulatory cross-reference section maps model documentation elements to specific regulatory requirements. This section serves two purposes: it demonstrates to reviewers that the documentation was designed with the regulatory framework in mind, and it provides a quick reference for compliance teams checking documentation completeness.

For a credit decisioning model at a bank, a cross-reference table might look like:

Regulatory requirement	Source	Documentation section(s)
Conceptual soundness documentation	SR 11-7, Section III	Model architecture; Training methodology; Known limitations
Outcomes analysis	SR 11-7, Section IV	Evaluation results; Benchmark evaluation methodology
Disparate impact analysis	Regulation B, 12 CFR 202	Protected class evaluation; Demographic parity results
Model limitations disclosure	SR 11-7, Section III	Known limitations; Out-of-scope use cases; Deployment restrictions

This table doesn't add new information — it organizes existing documentation in a way that makes compliance review more efficient. A model risk committee member can scan it to verify that each regulatory requirement has corresponding documentation before reading the full model card.

Automation: what can be auto-populated

A significant portion of a regulated-industry model card can be auto-populated from pipeline metadata if the training pipeline is properly instrumented. Cognify's model card generation uses the following auto-population approach:

Auto-populated from pipeline metadata:

Model version identifier, base model identification, fine-tuning approach
Training data record counts, dataset version hashes, dataset registration timestamps
Training hyperparameters (learning rate, batch size, training epochs, optimizer)
Hardware configuration, training compute
All quantitative evaluation results — metrics are pulled directly from logged eval records, with links to the evaluation dataset version used
Training artifact hashes (checkpoint hashes for the final model artifact)
Lineage graph — dataset version → training run → model artifact, with timestamps

Human-authored sections:

Intended use statement — requires human judgment about appropriate deployment contexts
Out-of-scope uses — specific exclusions that reflect business and regulatory decisions
Known limitations narrative — interpretation of what the quantitative results imply about failure modes
Data provenance attestation — requires a named attesting party
Ethical considerations narrative — requires human judgment about context-specific concerns
Deployment scope limitations — reflects business and compliance decisions about where the model may be used

The sections that still need humans

Auto-population covers the factual and quantitative content of a model card efficiently. But several sections can't be meaningfully generated from pipeline metadata alone because they require judgment rather than facts.

The intended use statement is the clearest example. A training pipeline knows what the model was trained on and how it performed on evaluation benchmarks. It doesn't know the business decision about what use cases are acceptable, what populations the model will be deployed against, or what downstream human oversight processes are in place. These decisions require a human — typically a product owner and compliance officer working together — to articulate.

The same applies to the known limitations narrative. The pipeline can surface that the model's performance on a specific demographic subgroup is 8% below its overall performance. It cannot determine whether that gap is within acceptable bounds for the specific deployment context, what the root cause is likely to be, or what operational safeguards are appropriate. The narrative interpretation of quantitative results requires human expert judgment.

The workflow we find effective is: generate the auto-populated sections first (typically in a few minutes via API call to Cognify's model card generation endpoint), then route the draft to the model owner and compliance officer for the human-authored sections. The auto-populated sections provide the factual substrate that makes human authoring faster — the compliance officer writing the regulatory cross-reference section isn't starting from a blank page, they're reviewing auto-generated content and filling in the sections that require their judgment.

The Mitchell et al. framework as a starting point

What regulated industries add

Data provenance attestation section

Protected class evaluation section

Regulatory cross-reference section

Automation: what can be auto-populated

The sections that still need humans

Related articles

Eval Benchmarks Your Compliance Team Will Actually Trust

Data Provenance for Fine-Tuning: The Six Questions Every Compliance Team Will Ask

Designing Approval Workflows for ML Models: What Compliance Teams Need from the Sign-Off Interface