Dataset versioning

Cognify's dataset versioning captures a cryptographic snapshot of every dataset at the exact moment it's used in training. This section explains how that works and how to configure it for your use case.

How versioning works

When you call cognify.dataset(), Cognify:

  1. Reads metadata about the dataset (record count, schema, source URI)
  2. Computes a SHA-256 hash of the content using a chunked Merkle tree
  3. Creates an immutable version record in your workspace with a timestamp
  4. Links that version record to the current training run

The dataset bytes themselves are never uploaded to Cognify. Only the hash, metadata, and lineage link are stored.

Automatic vs manual snapshot triggers

Automatic: When using framework integrations (Hugging Face Trainer, PyTorch DataLoader hooks), Cognify intercepts dataset reads and creates a version automatically before training starts.

Manual: Call cognify.dataset() explicitly for datasets loaded outside supported frameworks.

Hash algorithm (SHA-256 + Merkle tree)

Cognify uses a deterministic chunked Merkle tree construction:

  • Dataset is split into 64 MB chunks
  • Each chunk gets a SHA-256 leaf hash
  • The tree is built bottom-up; the root hash is the version identifier
  • For tabular data (CSV, Parquet, Arrow), rows are sorted by a stable key before hashing to ensure determinism across distributed reads

This means two identically-ordered datasets with identical content produce identical hashes, even if loaded from different storage backends.

Version diffing

When two versions of a named dataset exist in a workspace, Cognify computes a structural diff:

  • Record count delta (added/removed)
  • Schema changes (column additions/removals, type changes)
  • Hash comparison (identical content = zero diff)

Diffs are visible in the dashboard and included in audit packages as a data provenance report section.

Retention and archival policies

Configure retention per workspace or per dataset:

IndustryRecommended retentionCognify default
Healthcare AI (FDA guidance)7 years7 years on Enterprise
Financial services (SR 11-7)3–5 years5 years on Enterprise
Insurance3 years minimum3 years on Growth/Enterprise
Standard (Starter/Growth)2 years

Version records are never permanently deleted during the retention period — only archived to cold storage after 90 days of inactivity.