Dataset versioning
Cognify's dataset versioning captures a cryptographic snapshot of every dataset at the exact moment it's used in training. This section explains how that works and how to configure it for your use case.
How versioning works
When you call cognify.dataset(), Cognify:
- Reads metadata about the dataset (record count, schema, source URI)
- Computes a SHA-256 hash of the content using a chunked Merkle tree
- Creates an immutable version record in your workspace with a timestamp
- Links that version record to the current training run
The dataset bytes themselves are never uploaded to Cognify. Only the hash, metadata, and lineage link are stored.
Automatic vs manual snapshot triggers
Automatic: When using framework integrations (Hugging Face Trainer, PyTorch DataLoader hooks), Cognify intercepts dataset reads and creates a version automatically before training starts.
Manual: Call cognify.dataset() explicitly for datasets loaded outside supported frameworks.
Hash algorithm (SHA-256 + Merkle tree)
Cognify uses a deterministic chunked Merkle tree construction:
- Dataset is split into 64 MB chunks
- Each chunk gets a SHA-256 leaf hash
- The tree is built bottom-up; the root hash is the version identifier
- For tabular data (CSV, Parquet, Arrow), rows are sorted by a stable key before hashing to ensure determinism across distributed reads
This means two identically-ordered datasets with identical content produce identical hashes, even if loaded from different storage backends.
Version diffing
When two versions of a named dataset exist in a workspace, Cognify computes a structural diff:
- Record count delta (added/removed)
- Schema changes (column additions/removals, type changes)
- Hash comparison (identical content = zero diff)
Diffs are visible in the dashboard and included in audit packages as a data provenance report section.
Retention and archival policies
Configure retention per workspace or per dataset:
| Industry | Recommended retention | Cognify default |
|---|---|---|
| Healthcare AI (FDA guidance) | 7 years | 7 years on Enterprise |
| Financial services (SR 11-7) | 3–5 years | 5 years on Enterprise |
| Insurance | 3 years minimum | 3 years on Growth/Enterprise |
| Standard (Starter/Growth) | — | 2 years |
Version records are never permanently deleted during the retention period — only archived to cold storage after 90 days of inactivity.