Abstract distributed training visualization — shard distribution across multiple compute nodes with lineage tracking connections
Technical 14 min read

Instrumenting PyTorch FSDP Training with Cognify: Full Shard Lineage and Checkpoint Tracking

Callum Reeves

Lead ML Infrastructure Engineer, Cognify

PyTorch's Fully Sharded Data Parallel (FSDP) wrapper is the standard approach for training large models that don't fit on a single GPU. It shards both model parameters and optimizer states across all participating ranks, which means that at any given training step, each GPU holds only a fraction of the model's parameters. This is excellent for memory efficiency, but it creates a specific challenge for compliance audit infrastructure: the checkpoint for any given training step is physically distributed across multiple files, and the relationship between those shard files and the logical model checkpoint is implicit rather than explicit.

Standard checkpoint logging — "save checkpoint to /checkpoints/step-5000" — doesn't capture the shard structure, so a compliance reviewer looking at a checkpoint record can't verify whether they have a complete and authentic checkpoint unless they understand FSDP's sharding semantics. This post covers how to instrument FSDP training with Cognify to produce a complete, verifiable checkpoint record that includes shard-level lineage.

The FSDP checkpoint problem for compliance

When you save a checkpoint during FSDP training, the output depends on which sharding strategy and state dict type you're using. With FSDP's default StateDictType.FULL_STATE_DICT, all ranks gather parameters to rank 0 before saving — producing a single checkpoint file that looks like a standard PyTorch checkpoint. With StateDictType.LOCAL_STATE_DICT, each rank saves its own shard — producing N files (one per rank) that must all be present to reconstruct the full checkpoint.

For small models or small GPU clusters, FULL_STATE_DICT is common and the checkpoint record is straightforward. For large models (7B+ parameters) across many GPUs, FULL_STATE_DICT is often impractical because rank 0 doesn't have enough memory to gather the full model. LOCAL_STATE_DICT or the newer SHARDED_STATE_DICT (which writes shards in a format that can be efficiently loaded without gathering) are used instead.

The compliance implication: if your checkpoint consists of 8 shard files across 8 ranks and your compliance record says "checkpoint saved at step 5000," a reviewer cannot verify that all 8 shards are present and authentic without explicitly checking all 8 files. A complete compliance record needs to capture the shard manifest — the full list of shard files, their hashes, and the sharding configuration — not just the checkpoint directory path.

FSDP sharding strategies and their checkpoint implications

FSDP supports three sharding strategies, each with different checkpoint characteristics:

FULL_SHARD (default). Both parameters and optimizer state are sharded. Checkpoint with SHARDED_STATE_DICT produces one file per rank. Full model reconstruction requires all shard files and knowledge of the sharding configuration (world size, rank mapping). The sharding configuration must be part of the checkpoint record — a shard file without its sharding config is not loadable.

SHARD_GRAD_OP. Parameters are replicated across ranks; only optimizer state is sharded. Checkpoint is smaller per rank (full parameters on each rank, sharded optimizer state). Compliance record is simpler — the model weights are consistent across ranks, but optimizer state reconstruction still requires the shard manifest.

NO_SHARD. Standard DDP behavior — full parameters and optimizer state on each rank. Equivalent to DDP for checkpoint purposes; compliance record is straightforward.

The most complex compliance documentation challenge is FULL_SHARD with SHARDED_STATE_DICT, which is the configuration used for large model training. This is also the most common configuration for compliance-relevant large fine-tuning jobs (7B+ parameter models), so it's worth addressing specifically.

Rank-0 logging pattern

For distributed training, audit logging needs to happen once per distributed group, not once per rank. If each of your 8 ranks independently calls the Cognify SDK logging functions, you'll produce 8 duplicate (and potentially conflicting) audit records for each training event.

The standard pattern is rank-0 logging: only the process with dist.get_rank() == 0 makes SDK calls. For events that require information from all ranks (like the shard manifest for a distributed checkpoint), rank 0 collects that information from all other ranks before logging.

import torch.distributed as dist
import cognify

def should_log():
    """Only rank 0 logs to Cognify."""
    return not dist.is_initialized() or dist.get_rank() == 0

# Initialize only on rank 0
if should_log():
    cognify.init(
        workspace_id=os.environ["COGNIFY_WORKSPACE"],
        project="large-model-fine-tuning",
        run_tags={"model_size": "7b", "sharding": "full_shard"}
    )

For checkpoint events, rank 0 collects shard information from all ranks using distributed communication before logging the checkpoint record. This is covered in the shard-level hashing section below.

Cognify FSDP integration setup

Cognify provides an FSDP-specific integration in cognify.integrations.pytorch_fsdp. The setup wraps your FSDP model and hooks into the checkpoint save and load paths:

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from cognify.integrations.pytorch_fsdp import CognifyFSDPWrapper

# Wrap your FSDP model with Cognify instrumentation
model = FSDP(
    base_model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    cpu_offload=CPUOffload(offload_params=False),
    auto_wrap_policy=auto_wrap_policy,
)

# Add Cognify tracking layer
cognify_model = CognifyFSDPWrapper(
    model,
    run_ref=current_run,          # Reference to active Cognify run
    checkpoint_base_dir=checkpoint_dir,
    log_on_rank=0,                # Which rank handles logging
    shard_hash_algorithm="sha256" # Hash algorithm for shard files
)

# Use cognify_model in your training loop
# Checkpoint calls go through the wrapper automatically

The CognifyFSDPWrapper intercepts calls to model.state_dict() and torch.save() when triggered via the wrapper's checkpoint interface. On checkpoint, it coordinates across ranks to collect shard paths and initiate parallel hashing.

Shard-level hashing for full checkpoint provenance

When a checkpoint save occurs, the wrapper triggers the following sequence:

1. Each rank saves its shard to a local path using standard PyTorch save_state_dict mechanics.

2. Each rank computes the SHA-256 hash of its shard file and sends it to rank 0 via dist.gather_object().

3. Rank 0 collects all shard hashes and assembles the shard manifest: a list of shard files with their rank assignments, file sizes, and SHA-256 hashes.

4. Rank 0 computes a manifest hash — SHA-256 of the canonical JSON serialization of the shard manifest — which serves as the logical checkpoint fingerprint.

5. Rank 0 logs the checkpoint record to Cognify, including the manifest hash, the full shard manifest, the FSDP sharding configuration (world size, sharding strategy, state dict type), and the training step number.

def save_checkpoint_with_lineage(
    model: CognifyFSDPWrapper,
    step: int,
    output_dir: str
):
    # Each rank saves its shard
    shard_path = os.path.join(output_dir, f"shard-rank{dist.get_rank()}.pt")
    with FSDP.state_dict_type(model.fsdp_model, StateDictType.SHARDED_STATE_DICT):
        state_dict = model.fsdp_model.state_dict()
        torch.save(state_dict, shard_path)

    # Cognify wrapper handles hash collection and manifest logging
    # This call blocks until rank 0 has completed logging
    model.log_checkpoint(step=step, shard_path=shard_path)

The manifest hash is what appears as the "checkpoint fingerprint" in the compliance record. A compliance reviewer can verify a specific checkpoint by: retrieving the shard manifest from the compliance record, downloading the shard files from wherever they're stored, computing SHA-256 of each shard file, and confirming that the resulting manifest hash matches the record. This verification is fully self-contained and doesn't require running any Cognify software.

Logical checkpoint reconstruction for compliance

One challenge for compliance documentation of FSDP checkpoints is explaining what the checkpoint represents to a non-engineering reviewer. A shard manifest with 8 files and 8 hashes is technically complete but not self-explanatory.

Cognify's compliance package for an FSDP run includes a "checkpoint summary" section that translates the technical record into compliance-friendly language: "Checkpoint at training step 5,000 consists of 8 shard files totaling 28.4 GB. The complete model can be reconstructed by loading all 8 shards with the recorded FSDP configuration. The manifest hash serves as a fingerprint for the complete checkpoint — any modification to any shard file will produce a different manifest hash."

This translation layer is important for compliance packages that will be reviewed by model risk committee members or regulators who understand the concept of a "checkpoint" and its significance for reproducibility, but not the specifics of FSDP shard structures.

Multi-node considerations

For training runs spanning multiple nodes (e.g., 4 nodes with 8 GPUs each for a 32-rank job), the shard manifest will have 32 entries and the hash collection becomes a multi-node distributed operation. The Cognify FSDP wrapper handles this via standard PyTorch distributed communication regardless of the physical node topology — the rank-0 collection pattern works the same whether all ranks are on a single node or spread across many.

One practical issue in multi-node setups is checkpoint storage architecture. If each rank writes its shard to a local filesystem path that's only accessible from its own node, rank 0 cannot directly access the shard files to compute the manifest hash. The workaround is either: shared distributed storage (NFS, Lustre, S3-mounted filesystem) accessible from all ranks, or a two-phase approach where each rank computes its own shard hash before saving (while the shard data is in memory) and sends the hash to rank 0.

The in-memory hash approach avoids filesystem access issues but requires that hashing happen before the shard bytes are written to disk — which means the hash corresponds to the in-memory tensor data, not the serialized file bytes. The compliance record should document which approach was used, since the hash verification procedure differs between the two.

For large-scale fine-tuning jobs at growing ML teams, the added instrumentation overhead from shard-level hashing is generally under 3% of checkpoint time for typical shard sizes — negligible for training jobs that checkpoint every few hundred steps. The compliance value — a verifiable, complete record of every checkpoint in a distributed training run — is significant for any model that will face regulatory documentation requirements at deployment time.