Abstract pipeline instrumentation visualization — code flow nodes with compliance tracking annotations
Technical 15 min read

Adding an Audit Trail to Your Hugging Face Fine-Tuning Pipeline

Callum Reeves

Lead ML Infrastructure Engineer, Cognify

The Hugging Face Trainer class is the de facto standard for fine-tuning transformer models in production ML pipelines. Its callback system — the TrainerCallback protocol — is the right hook point for compliance instrumentation because callbacks fire at every significant event in the training lifecycle: run start, epoch boundaries, evaluation steps, checkpoint saves, and run end.

This post is a step-by-step walkthrough of adding Cognify's audit trail to an existing Hugging Face Trainer-based fine-tuning loop. It assumes you have a working training script using transformers.Trainer or SFTTrainer from the trl library. The instrumentation adds approximately 15 lines of code across your training script, and takes under 30 minutes from start to first compliance package export.

Prerequisites

You'll need:

  • Python 3.9+ with transformers >= 4.36
  • An existing training script using Trainer or SFTTrainer
  • A Cognify account and workspace API key (available at fyntuneq.com/login/signup)
  • Your training dataset accessible as an arrow/parquet file, JSONL, or HuggingFace Dataset object

We're assuming you're already logging to W&B or MLflow for experiment tracking. Cognify runs alongside those tools — it doesn't replace them.

Step 1: Installation and workspace setup

Install the SDK:

pip install cognify-sdk

Initialize your workspace in the training script. This goes before your dataset loading code:

import cognify

cognify.init(
    workspace_id="your-workspace-id",
    api_key=os.environ["COGNIFY_API_KEY"],
    project="clinical-nlp-v2",
    run_tags={"team": "ml-platform", "use_case": "discharge-summary"}
)

The workspace_id and project fields structure how runs appear in the compliance dashboard. Projects map to a single model purpose — all fine-tuning runs for a given model use case should share a project identifier so that version lineage is visible across runs.

Step 2: Dataset registration

Before loading your dataset into the Trainer, register it with Cognify. This is where the dataset versioning and hash capture happens:

from datasets import load_from_disk

# Load your dataset as usual
train_dataset = load_from_disk("/data/clinical-notes-v3.1/train")
eval_dataset  = load_from_disk("/data/clinical-notes-v3.1/eval")

# Register with Cognify — this computes SHA-256 hashes and
# stores the versioned snapshot record
cgnf_train_ds = cognify.dataset(
    dataset=train_dataset,
    name="clinical-notes",
    version="3.1",
    source_path="/data/clinical-notes-v3.1/train",
    provenance={
        "source_system": "ehr-extraction-pipeline",
        "extraction_date": "2025-04-15",
        "deidentification_method": "safe_harbor",
        "authorized_use": "discharge-summary-fine-tuning",
        "record_count": len(train_dataset),
    }
)

cgnf_eval_ds = cognify.dataset(
    dataset=eval_dataset,
    name="clinical-notes-eval",
    version="3.1",
    source_path="/data/clinical-notes-v3.1/eval",
    provenance={
        "source_system": "ehr-extraction-pipeline",
        "extraction_date": "2025-04-15",
        "deidentification_method": "safe_harbor",
        "authorized_use": "discharge-summary-fine-tuning",
        "record_count": len(eval_dataset),
    }
)

The provenance dict is structured metadata that appears in the compliance package export. The fields are validated against a schema — if you omit required fields for your compliance tier, Cognify raises a warning. The deidentification_method and authorized_use fields are required for healthcare compliance configurations.

Cognify computes a Merkle tree hash over the dataset records and stores the root hash with the provenance metadata. If the same dataset path is registered in a later run and the hash doesn't match, Cognify flags a dataset change and prompts for diff documentation.

Step 3: Cognify Trainer callback

The CognifyTrainerCallback hooks into the Hugging Face TrainerCallback protocol. It captures hyperparameters, checkpoint metadata, and training metrics at each callback event:

from cognify.integrations.huggingface import CognifyTrainerCallback

# Initialize the callback, linking it to your registered datasets
cognify_callback = CognifyTrainerCallback(
    train_dataset_ref=cgnf_train_ds,
    eval_dataset_ref=cgnf_eval_ds,
    capture_hyperparams=True,    # logs TrainingArguments fields
    capture_checkpoints=True,    # hashes checkpoint dirs on save
    capture_system_info=True     # logs GPU type, CUDA version, etc.
)

# Pass to Trainer as usual
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=[cognify_callback],  # add alongside any existing callbacks
    # ... rest of your Trainer config
)

The callback intercepts on_init_end to capture training configuration, on_save to hash checkpoint directories, on_evaluate to record eval metrics, and on_train_end to finalize the run record. None of these fire additional code on the critical path of the training loop itself — all Cognify writes are asynchronous to avoid adding latency to the training step.

If you're using SFTTrainer from the trl library, the same callback works — SFTTrainer inherits from Trainer and supports the full callback protocol.

Step 4: Eval benchmark logging

If you run evaluation against compliance-relevant benchmarks (domain-specific safety evals, fairness benchmarks, or custom holdout sets required by your compliance policy), you can log those results separately from the Trainer's built-in evaluation:

import cognify

# After running your benchmark evaluation suite
cognify.eval(
    benchmark_name="clinical-safety-v2",
    benchmark_version="2.3",
    results={
        "accuracy": 0.921,
        "f1_macro": 0.887,
        "false_negative_rate": 0.041,
        "demographic_parity_gap": 0.018,
    },
    eval_dataset_ref=cgnf_eval_ds,
    notes="Evaluated on held-out discharge notes from Q1 2025. "
          "FNR within policy threshold of 0.05."
)

These eval records are stored as first-class objects in the run lineage graph — linked to the specific dataset version they were evaluated against, not just attached as metadata to the training run. This distinction matters for compliance: if the same model version is evaluated against an updated benchmark in a later period, the two eval records are separate nodes in the lineage graph, and the compliance reviewer can see both.

Step 5: Reviewing lineage in the dashboard

When the training run completes, the Cognify dashboard shows the full lineage graph: dataset snapshot → training configuration → checkpoints → eval results → model artifact. Each node is timestamped and linked.

Compliance reviewers access this view with read-only credentials — they don't need access to your training environment or your experiment tracking tool. The dashboard surfaces: dataset hash and provenance metadata, hyperparameter summary, training metrics time series, eval results with benchmark version, and checkpoint inventory with SHA-256 hashes.

The reviewer can annotate specific nodes (requesting clarification on a data source, flagging a metric for discussion) and the annotations are linked to the run record. This keeps the review conversation attached to the evidence it's about, rather than living in a separate email thread or Jira ticket.

Step 6: Exporting the compliance package

When the compliance reviewer approves the run, the audit package becomes available for export. From the dashboard or CLI:

cognify export \
  --run-id run_abc123 \
  --format pdf \
  --template healthcare \
  --output ./audit-packages/discharge-summary-v3.1-audit.pdf

The healthcare template generates a structured PDF that includes: model identification, intended use statement, training data provenance attestation (with de-identification method and authorization record), training configuration, evaluation results against each registered benchmark, known limitations, and the approval record with reviewer name, role, and timestamp.

The JSON export (--format json) produces a machine-readable compliance record that can be ingested into GRC systems (Vanta, Drata, ServiceNow) or archived alongside your model artifact.

What gets captured and what doesn't

The CognifyTrainerCallback captures everything available through the Trainer callback interface: TrainingArguments fields, optimizer state at init, learning rate schedule type, checkpoint paths and hashes, per-step metrics logged to on_log, per-evaluation results. What it doesn't capture automatically:

  • Model weights content. Cognify stores checkpoint SHA-256 hashes, not the weights themselves. Your weights stay in your artifact store.
  • Dataset content. Cognify stores Merkle root hashes and provenance metadata. The actual training records stay in your data store.
  • Custom metrics from outside the Trainer loop. If you run evaluation outside the Trainer.evaluate() call, use cognify.eval() to log those results explicitly (as shown in Step 4).
  • Pre-processing and tokenization configuration. If your tokenizer configuration or preprocessing steps are compliance-relevant (for example, if you applied custom filtering to remove certain record types), log those as provenance metadata on the dataset registration call.

We're not saying the callback-based approach captures every possible signal about a training run. For complex pipelines with multi-stage data processing, you'll want to instrument the data pipeline steps explicitly using cognify.dataset() at each stage. The callback handles the Trainer lifecycle; you own the data pipeline instrumentation.

The 30-minute estimate for initial integration is realistic for a standard Trainer-based pipeline. Pipelines with multi-stage data assembly, custom training loops, or complex distributed training configurations (covered in our PyTorch FSDP guide) take longer to instrument correctly. But the pattern is the same: register datasets before they enter the training loop, attach the callback to the Trainer, log external evals explicitly. The compliance package is a byproduct of that instrumentation, not an additional step.