Abstract financial model risk visualization — compliance structure nodes on a dark background
Regulatory 14 min read

LLM Model Risk Management at Banks: What SR 11-7 Requires in the Age of Fine-Tuned Models

Ingrid Holst

Head of Compliance Strategy, Cognify

SR 11-7, the Federal Reserve's supervisory guidance on model risk management issued in April 2011, was written when "model" meant something reasonably specific: a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories to process inputs into quantitative outputs. Credit scoring models. Interest rate risk models. Market risk VaR calculations. The guidance was comprehensive, well-reasoned, and — with respect to the technology of 2011 — complete.

A fine-tuned large language model used in credit underwriting is a different kind of object. It has billions of parameters, was trained on data that may include the bank's own historical decisioning data, can produce outputs that vary based on subtle phrasing differences in inputs, and cannot be fully explained in the way a logistic regression can. Whether SR 11-7 "applies" to such a model is no longer an open question at most large banks — it does. What remains genuinely unclear is how its specific requirements translate into practice for a model type the guidance authors didn't contemplate.

This post works through the main SR 11-7 pillars and what each one requires of a bank using a fine-tuned LLM in a material business process.

SR 11-7 and the LLM scope question

SR 11-7's definition of "model" is broad enough to capture most uses of fine-tuned LLMs in regulated financial contexts. The guidance defines a model as any quantitative method that applies theories to process inputs into outputs used in decision-making. An LLM that processes loan application narratives and outputs risk assessments — even if those outputs are text rather than a numeric score — is being used in decision-making. It falls within scope.

The OCC's 2021 updated guidance and subsequent interagency AI statements have reinforced the view that the existing model risk management framework applies to AI/ML models, including those based on complex neural architectures. Banks are not starting from a regulatory blank slate. The question is implementation, not applicability.

Some banks have tried to argue that a fine-tuned LLM is merely a "tool" used by human reviewers, not a model in the SR 11-7 sense, if the human makes the final credit decision. This argument has weakened as examiner expectations have sharpened. If the model's output materially influences the human decision — and if the model is systematically applied across a population of applicants — MRM examiners are treating it as a model.

Conceptual soundness for fine-tuned LLMs

SR 11-7's model development standards require documentation of conceptual soundness — essentially, a theoretical basis for why the model's methodology is appropriate for its intended use. For a logistic regression credit model, this is tractable: you can articulate the assumptions (linearity of log-odds, independence of observations), cite the theoretical backing for those assumptions, and document where the model's assumptions may deviate from reality.

For a fine-tuned LLM, conceptual soundness documentation has to address several harder questions:

Base model selection rationale. Why was this foundation model (say, a specific instruct-tuned version of a large language model) chosen for this task? What properties of the base model's pretraining make it appropriate for processing credit-relevant text? What known limitations of the base model — factual hallucination tendencies, sensitivity to prompt format, performance degradation on out-of-distribution inputs — are relevant to the deployment context?

Fine-tuning data representativeness. Was the fine-tuning corpus representative of the population on which the model will be applied? If the bank fine-tuned on historical credit applications from a particular geographic market or time period, what are the extrapolation risks when applying the model to a broader or more current applicant pool?

Output interpretation. How do the model's text outputs map to credit risk categories? If the model produces a narrative risk assessment, how is that assessment being converted into a decision-relevant signal? This conversion — from generative text to a structured decision input — is itself a process that needs to be documented and validated.

Outcomes analysis for generative models

Traditional model validation relies heavily on backtesting and outcomes analysis: comparing model predictions against actual outcomes over a holdout period. A credit model predicts default probability; you wait and observe default rates; you compare. The feedback loop is clear.

For an LLM generating text outputs, the outcomes analysis question is more complex. Consider a growing bank using a fine-tuned LLM to generate credit narrative summaries for human reviewers. The "outcome" isn't the model's text — it's the human decision that followed. Isolating the model's contribution to that decision, and linking it to downstream loan performance, requires explicit tracking of which model version produced which summary for which application, and then correlating that with loan performance data.

This tracking requirement is non-trivial. It means every model inference in a decision context needs to be logged: which model version was called, on which input, producing which output, feeding which downstream decision. Without this logging, backtesting is impossible — you can't link model outputs to outcomes if you don't know which model version processed which application.

SR 11-7 requires ongoing performance monitoring precisely because models degrade. For LLMs, degradation can happen through concept drift (the distribution of inputs shifts over time, moving away from the fine-tuning distribution), through behavioral drift in the underlying base model if it's updated, or through changes in how humans are interpreting and using the model's outputs. Monitoring needs to be designed with these LLM-specific failure modes in mind.

Ongoing monitoring requirements

The ongoing monitoring requirements in SR 11-7 are among the most operationally demanding for LLM deployments. The guidance requires periodic model performance review, sensitivity analysis, and benchmarking against alternative approaches. For statistical models with numeric outputs, these requirements map onto established quantitative processes. For LLMs, banks are building new monitoring practices.

Several approaches are emerging in practice:

Output distribution monitoring. Track the distribution of model outputs over time — in the case of text outputs, this might mean monitoring embeddings of outputs for distributional shift, or tracking higher-level metrics like assessment sentiment distribution, length, or category frequencies. Sudden shifts in output distribution can signal that the model is encountering input types it wasn't fine-tuned on.

Benchmark evaluation cadence. Maintain a held-out evaluation set of representative inputs with human-validated reference outputs. Run the model against this benchmark set on a scheduled cadence (monthly or quarterly) to detect capability drift. The evaluation set itself needs to be version-controlled — if it changes, the continuity of the performance monitoring series is broken.

Adverse action alignment checks. For models touching credit decisions, compliance also needs to monitor for disparate impact. Fair lending analysis of LLM-assisted decisions requires being able to attribute output differences to model behavior versus other factors — which requires, again, systematic logging of model versions and outputs at the decision level.

Model inventory challenges

SR 11-7 requires banks to maintain a model inventory — a comprehensive record of all models in use, their status, validation history, and materiality tier. Fine-tuned LLMs create inventory challenges that banks are still working through.

First, versioning: a fine-tuned LLM can be retrained frequently. Each retraining produces a new model version. Are these separate inventory entries? They should be — each version has a distinct training history and potentially different risk characteristics. But managing the inventory of a model that might go through a dozen fine-tuning iterations per year is operationally demanding.

Second, base model changes: if the underlying base model is updated by its provider, and the bank is fine-tuning on top of it, the change in the base model may effectively create a new model even if the bank's fine-tuning process is unchanged. Inventory management needs to track base model version as a first-class attribute.

Third, prompt-tuned variants: some banks are using prompt engineering or soft prompt tuning rather than full fine-tuning for some use cases. Whether these constitute separate model inventory entries is a question MRM teams are currently wrestling with. The answer likely depends on whether the behavioral differences between prompt-tuned variants are material to the risk profile of the model's outputs.

Documentation standards the MRM team expects

Model risk management teams at banks have developed detailed documentation templates for statistical models over years of practice. These templates are starting to be adapted for LLMs, but the adaptation is uneven. The documentation that a bank's MRM team typically wants for a fine-tuned LLM in a material use case includes:

  • Model purpose and intended use statement — precise about scope, not aspirational
  • Training data documentation — source, vintage, record counts, filtering criteria, de-identification steps if applicable
  • Fine-tuning methodology — training framework, hyperparameters, training compute, convergence criteria
  • Evaluation methodology and results — benchmark datasets, metrics, performance against each benchmark, comparison to baseline
  • Known limitations — out-of-scope inputs, performance degradation conditions, known failure modes
  • Monitoring plan — what metrics will be tracked, at what frequency, with what thresholds for escalation
  • Change management procedures — what constitutes a material model change requiring re-validation

This documentation package needs to be produced for each model version, linked to specific training artifacts (dataset hashes, checkpoint records), and retained for the period specified in the bank's model risk policy — typically three to seven years.

The base model inheritance problem

We're not saying that every property of the base model needs to be validated by the bank — that would be impractical. But the inheritance relationship between a commercial foundation model and a bank's fine-tuned version creates a documentation challenge that can't be ignored.

When a bank fine-tunes a commercial LLM and deploys the resulting model in a credit decision context, it inherits the risks of the base model to the extent that the fine-tuning doesn't fully suppress them. If the base model has documented hallucination tendencies on certain input types, the fine-tuned version may inherit those tendencies unless specifically addressed in fine-tuning or evaluated against them.

MRM teams are increasingly requiring banks to document: what due diligence was done on the base model's risk characteristics before fine-tuning was initiated, and what evaluation was done post-fine-tuning to assess whether the base model's known limitations are present in the fine-tuned version. This is new territory — there are no established standards for base model due diligence in a banking context — but the documentation expectation is real and growing.

For ML teams at banks building fine-tuning infrastructure today, the practical implication is that every model version needs a complete artifact trail — training data version, training configuration, evaluation results against a stable benchmark set — that supports the MRM team's validation and ongoing monitoring work. Building that trail as an organic output of the training pipeline, rather than as a retroactive documentation exercise, is the difference between a smooth validation cycle and one that requires weeks of archaeology before the MRM team can begin its work.