TrainingGuard: AI Dataset Protection
Split AI training data across independent custodians via XorIDA threshold secret sharing. No single party can access, poison, or steal the dataset. Immutable provenance tracking for EU AI Act compliance and dataset licensing enforcement. Information-theoretic security for the model training pipeline.
Executive Summary
TrainingGuard protects AI training datasets from theft, poisoning, and unauthorized access by splitting them across independent custodians using XorIDA threshold secret sharing. Every split and reconstruction is recorded in an immutable provenance chain with SHA-256 hashes, actor identities, and timestamps.
Two functions cover the complete workflow: guardDataset() splits training data into N shares (default 3-of-3) with HMAC-SHA256 integrity protection per share and automatic chunking for large datasets. reconstructDataset() rebuilds the dataset from K-of-N shares, verifies the SHA-256 hash against the manifest, and appends a provenance record documenting who reconstructed it and when.
The security guarantee is information-theoretic — not computationally hard to break, but mathematically impossible. Fewer than K custodians learn exactly zero information about the training data, regardless of computing power. This protects against data theft by insiders, cloud provider breaches, and even quantum computers.
Designed for EU AI Act Article 10 compliance (data governance obligations), dataset licensing enforcement (no single licensee can reconstruct alone), and secure multi-party model training where no participant should access the full dataset.
The Problem
AI training datasets are high-value targets for theft, poisoning, and regulatory violations. Traditional protection mechanisms fail against insider threats and breaches.
Dataset theft is undetectable. A single engineer with database access can copy 50,000 labeled images to a USB drive. The dataset is gone, but audit logs show nothing suspicious — just read operations.
Data poisoning is invisible until deployment. An attacker with write access changes labels on 0.1% of samples. The model trains successfully but fails catastrophically on specific inputs.
Cloud storage is centralized trust. S3 buckets, Azure Blob Storage, and Google Cloud Storage all require trusting the cloud provider's access controls, employee background checks, and compliance certifications. One compromised admin account exposes everything.
Encryption-at-rest protects against disk theft, not authorized access. Once the application decrypts the dataset for training, it exists in plaintext memory on a server where anyone with SSH access can dump it.
| Attack Vector | Encryption at Rest | Access Control | Audit Logs | TrainingGuard |
|---|---|---|---|---|
| Insider theft | No | Partial | Detect after | Prevents |
| Data poisoning | No | No | No | SHA-256 detect |
| Cloud provider breach | No | No | No | Prevents |
| Unauthorized reconstruction | No | Partial | Detect after | K-of-N required |
| Provenance loss | No | No | Append-only | Immutable chain |
| Single point of failure | Yes | Yes | Yes | Distributed |
The Old Way: Centralized Storage
The New Way: Threshold-Split Custody
Real-World Use Cases
Six scenarios where TrainingGuard protects AI training pipelines from theft, poisoning, and compliance violations.
Multiple organizations contribute training data without sharing raw datasets. Each org holds one share. Model training happens on reconstructed batches, then shares are deleted. No org can reconstruct alone.
3-of-5 multi-party custodyPHI-labeled medical imaging datasets split across hospital, research institution, and compliance officer. HIPAA audit trail via provenance chain. SHA-256 hash proves no data poisoning.
HIPAA provenance + integrityTransaction datasets split between bank, third-party trainer, and auditor. Model training requires 2-of-3 consent. Provenance proves which parties accessed which datasets when.
SOX compliance + non-repudiationDefense datasets split across classified network, R&D lab, and oversight authority. 3-of-3 required for reconstruction. Air-gapped custodians prevent exfiltration.
CMMC L5 + zero trustCommercial training datasets sold with split custody. Buyer gets share 1, seller retains share 2, escrow holds share 3. Buyer cannot use dataset without seller cooperation, enforcing licensing terms.
License enforcement by designOpen-source model training with community-verified datasets. Dataset split across 5 trusted institutions. Reconstruction requires 3-of-5 consensus. SHA-256 hash published for verification.
Transparent provenance chainHow It Works
Five-step pipeline: validate, hash, chunk, split, manifest. Reconstruction reverses the flow with hash verification.
guardDataset() Pipeline
1. Validate config — Minimum 2 custodians, threshold ≤ total custodians, data not empty.
2. SHA-256 hash — Hash the entire dataset before chunking. This hash goes in the manifest and is verified on reconstruction.
3. Chunk data — Large datasets split into 2MB chunks (configurable). Each chunk processed independently.
4. Per chunk: PKCS#7 pad → XorIDA split → HMAC-SHA256 — Padding ensures chunk size divisibility, XorIDA generates N shares, HMAC signs each share.
5. Build manifest — UUID, metadata, chunk count, dataset hash, config, provenance record (action: 'split', actor, timestamp, hash).
reconstructDataset() Pipeline
1. Validate share counts — Each chunk must have ≥ threshold shares. Total chunks must match manifest.
2. Per chunk: XorIDA reconstruct → PKCS#7 unpad — Threshold shares reconstruct the chunk, padding removed.
3. Concatenate chunks — Reassemble all chunks in order, trim to original size from manifest.
4. Verify SHA-256 hash — Hash reconstructed data, compare against manifest. Mismatch = RECONSTRUCT_FAILED error.
5. Append provenance — Add 'reconstruct' provenance record (actor, timestamp, hash). Return data + updated manifest.
chunkSize in TrainingGuardConfig). Each chunk is independently padded, split via XorIDA, and HMAC-signed. This enables parallel distribution of shares and incremental reconstruction. Chunk size trades memory usage vs. overhead — smaller chunks = more metadata, larger chunks = higher RAM during split/reconstruct.
Integration Patterns
Three deployment patterns for different trust topologies.
Pattern 1: Multi-Cloud Distribution
import { guardDataset, reconstructDataset } from '@private.me/trainingguard'; const metadata = { name: 'fraud-detection-v4', version: '4.2.0', recordCount: 100000, format: 'parquet', dataSize: data.length, source: 'transaction-db-prod', }; const config = { custodians: 3, threshold: 2, chunkSize: 2 * 1024 * 1024 }; const result = await guardDataset(data, metadata, config, 'ml-engineer@corp.com'); if (result.ok) { const { manifest, shares } = result.value; // shares[chunkIndex][custodianIndex] await uploadToS3(shares[0][0], 'custodian-1-us-east'); // AWS await uploadToAzure(shares[0][1], 'custodian-2-eu'); // Azure await uploadToGCS(shares[0][2], 'custodian-3-asia'); // GCP }
Pattern 2: Federated Learning
// Hospital A, Research Lab B, Compliance Officer C each hold 1 share const config = { custodians: 3, threshold: 3 }; // All 3 required const result = await guardDataset(phiData, metadata, config, 'data-engineer@hospital.org'); // Distribute shares to 3 independent orgs // Training time: collect shares from all 3 parties const collectedShares = [ await fetchFromHospital(), await fetchFromLab(), await fetchFromCompliance(), ]; const rebuilt = await reconstructDataset(manifest, collectedShares, 'trainer@lab.edu'); if (rebuilt.ok) { await trainModel(rebuilt.value.data); // Use reconstructed dataset // Provenance now shows: split (hospital) + reconstruct (lab) }
Pattern 3: Dataset Licensing Enforcement
// Seller splits dataset 3-of-3: buyer, seller, escrow const config = { custodians: 3, threshold: 2 }; // Any 2 of 3 can reconstruct const result = await guardDataset(dataset, metadata, config, 'seller@datasets.ai'); if (result.ok) { const { manifest, shares } = result.value; await sendToBuyer(shares[0][0], manifest); // Share 1 to buyer await keepInternal(shares[0][1]); // Share 2 seller retains await sendToEscrow(shares[0][2], manifest); // Share 3 to escrow // Buyer CANNOT reconstruct alone — needs seller cooperation OR escrow release // Seller can revoke by deleting their share (buyer now needs escrow) }
Security
Six layers of defense protecting training data from theft, poisoning, and unauthorized access.
| Layer | Technology | Protects Against |
|---|---|---|
| 1. Information-Theoretic Split | XorIDA over GF(2) | Any K-1 shares reveal zero information. Not computationally hard — mathematically impossible, even with quantum computers. |
| 2. SHA-256 Hash Verification | SHA-256 (FIPS 180-4) | Data poisoning detection. Hash computed before split, verified after reconstruct. Mismatch rejects the dataset. |
| 3. HMAC-SHA256 per Share | HMAC-SHA256 | Share tampering detection. Each share signed at split time. Reconstruction verifies HMACs before combining. |
| 4. Provenance Chain | Immutable append-only log | Unauthorized access detection. Every split/reconstruct/access action recorded with actor, timestamp, hash. |
| 5. PKCS#7 Padding | PKCS#7 (RFC 5652) | Block size alignment. Prevents size-based inference attacks on chunk boundaries. |
| 6. crypto.getRandomValues() | Web Crypto API | Cryptographically secure randomness for XorIDA polynomial coefficients. No Math.random(). |
RECONSTRUCT_FAILED error. The provenance chain records which custodians participated in reconstruction, creating an audit trail for forensic investigation.
Threat Model
TrainingGuard assumes:
- Honest-but-curious custodians: Custodians follow the protocol but may attempt to infer information from their shares. XorIDA guarantees they learn nothing.
- Byzantine custodians (threshold - 1): Up to K-1 custodians may collude or be compromised. As long as ≥K custodians remain honest, the dataset cannot be reconstructed by attackers.
- Network adversaries: Shares transmitted over untrusted networks. HMAC integrity protects against tampering in transit.
- Storage breaches: Cloud provider breach exposes one custodian's shares. Attacker learns zero information (information-theoretic security).
TrainingGuard does NOT protect against:
- Compromise of ≥K custodians: If threshold or more custodians are compromised, the dataset can be reconstructed.
- Side-channel attacks during reconstruction: RAM dumps, timing analysis, or EM radiation during the reconstruction phase may leak information. Use hardware enclaves (SGX/SEV) for high-security scenarios.
- Social engineering: Tricking custodians into releasing shares. Multi-party authorization and provenance logging mitigate this.
Performance Benchmarks
Real-world dataset sizes from small (1MB) to large (100MB). Measured on Node.js 22, median of 100 iterations.
| Dataset Size | Split Time | Reconstruct Time | Total Roundtrip | Chunks |
|---|---|---|---|---|
| 1 MB | 18 ms | 12 ms | 30 ms | 1 |
| 5 MB | 82 ms | 54 ms | 136 ms | 3 |
| 12.5 MB | 198 ms | 128 ms | 326 ms | 7 |
| 25 MB | 385 ms | 248 ms | 633 ms | 13 |
| 50 MB | 742 ms | 486 ms | 1,228 ms | 25 |
| 100 MB | 1,465 ms | 951 ms | 2,416 ms | 50 |
Node.js 22 • 100 iterations • 2-of-3 config • 2MB chunk size • Median time
Honest Limitations
TrainingGuard is not a universal solution. Here are scenarios where it does not help or where alternatives are better.
What TrainingGuard Does NOT Do
| Limitation | Why | Alternative |
|---|---|---|
| Prevent ≥K custodian collusion | If threshold or more custodians collude, they can reconstruct the dataset. This is inherent to threshold schemes. | Increase N and K. Use legally binding contracts. Multi-jurisdiction custody. |
| Protect in-memory data during training | Once reconstructed, the dataset exists in plaintext memory. RAM dumps or debugger attach can extract it. | Hardware enclaves (SGX/SEV). Secure multi-party computation for training. |
| Prevent model extraction attacks | TrainingGuard protects the dataset, not the trained model. Attackers can query the model to extract training data. | Differential privacy during training. Model watermarking. Rate limiting inference. |
| Replace access control | TrainingGuard is a cryptographic layer. It does not authenticate custodians or enforce who can request shares. | Combine with IAM systems, mutual TLS, or hardware tokens for custodian authentication. |
| Guarantee provenance authenticity | Provenance records can be forged if the manifest is not integrity-protected by external signatures. | Sign manifests with custodian private keys (Ed25519). Store signed manifests in append-only ledger. |
When NOT to Use TrainingGuard
- Small datasets (<1MB): Overhead of chunking and splitting exceeds the security benefit. Use simple AES-256-GCM encryption instead.
- Single-party custody: If all custodians are in the same organization under unified access control, TrainingGuard adds complexity without distributing trust. Use encryption-at-rest.
- Real-time inference pipelines: Reconstruction latency (100ms+ for large datasets) may be unacceptable for latency-sensitive inference. Cache reconstructed datasets in memory.
- Public datasets: If the dataset is already public (e.g., ImageNet, COCO), splitting provides no confidentiality benefit. TrainingGuard is for proprietary or sensitive datasets.
Regulatory Compliance
TrainingGuard supports data governance obligations under EU AI Act, GDPR, HIPAA, and financial regulations.
EU AI Act — Article 10 (Data Governance)
Article 10 requires high-risk AI systems to ensure training datasets are "relevant, representative, free of errors and complete" with documented "data governance and management practices."
| AI Act Requirement | TrainingGuard Feature |
|---|---|
| Data provenance tracking | Immutable provenance chain records split/access/reconstruct actions with actor, timestamp, hash |
| Data integrity verification | SHA-256 hash verification after reconstruction catches poisoning |
| Access history | Provenance log shows who reconstructed the dataset when |
| Multi-party oversight | Threshold reconstruction requires cooperation of ≥K independent custodians |
HIPAA — Security Rule
45 CFR § 164.312(a)(1) requires covered entities to implement technical safeguards to protect ePHI. TrainingGuard provides:
- Access control: Threshold reconstruction prevents unauthorized access by single custodian.
- Audit controls: Provenance chain creates HIPAA-compliant audit logs (who, what, when).
- Integrity controls: SHA-256 hash verification detects unauthorized modifications.
- Transmission security: HMAC-SHA256 per share protects data in transit between custodians.
SOX — Section 404 (Internal Controls)
Financial institutions using AI for fraud detection or credit scoring must document data governance controls. TrainingGuard provenance chain demonstrates:
- Who accessed training data and when (non-repudiation via provenance)
- Data integrity verification (SHA-256 hash prevents silent poisoning)
- Multi-party authorization for dataset reconstruction (separation of duties)
GDPR — Article 32 (Security of Processing)
GDPR requires "appropriate technical and organizational measures" to protect personal data. TrainingGuard provides:
- Pseudonymization: Shares reveal zero information about the dataset (information-theoretic security).
- Confidentiality: Threshold custody prevents single-point breaches from exposing data.
- Integrity: SHA-256 + HMAC prevent unauthorized modification.
- Accountability: Provenance chain demonstrates compliance with data processing obligations.
Complete API Surface
Three core functions: guard, reconstruct, verify.
Split training data via XorIDA across custodians. Chunks large datasets (default 2MB), generates HMAC per share, creates provenance record, and returns a manifest with SHA-256 dataset hash. Actor is recorded in provenance (e.g., 'engineer@corp.com').
Reconstruct dataset from custodian shares. Verifies SHA-256 hash matches manifest. Appends a 'reconstruct' provenance record. Returns reconstructed data and updated manifest. Shares must be ≥ threshold per chunk.
Create a provenance entry with the current timestamp. Action describes what happened, actor identifies who (email/DID), dataHash is the SHA-256 hex hash at that point, details is optional context.
Verify a provenance chain: at least one record, all fields non-empty, valid actions, timestamps in non-decreasing order. Returns true if valid, false otherwise.
Provenance Chain
Every split, access, reconstruction, and verification is recorded in an immutable append-only log with actor identity, timestamp, and data hash.
Provenance Actions
| Action | When | Recorded Fields |
|---|---|---|
split |
guardDataset() called | actor (who split), timestamp (when), dataHash (SHA-256 of original dataset) |
access |
Manual provenance logging | actor (who accessed), timestamp, dataHash (current hash), details (what action) |
reconstruct |
reconstructDataset() called | actor (who reconstructed), timestamp, dataHash (SHA-256 after reconstruct, must match manifest) |
verify |
Manual integrity check | actor (who verified), timestamp, dataHash, details (verification result) |
Example Provenance Chain
[
{
"action": "split",
"actor": "ml-engineer@hospital.org",
"timestamp": "2026-04-10T14:32:18.442Z",
"dataHash": "a3f5...d8e2"
},
{
"action": "reconstruct",
"actor": "trainer@research-lab.edu",
"timestamp": "2026-04-11T09:15:42.112Z",
"dataHash": "a3f5...d8e2",
"details": "Training run batch-042"
},
{
"action": "verify",
"actor": "compliance@hospital.org",
"timestamp": "2026-04-12T11:22:05.881Z",
"dataHash": "a3f5...d8e2",
"details": "HIPAA audit verification"
}
]
verifyProvenance() function enforces chain integrity: all records must have non-empty fields, actions must be valid, and timestamps must be in non-decreasing order. This prevents backdating or out-of-order insertion. For cryptographic non-repudiation, sign each provenance record with the actor's Ed25519 key and store signatures in the manifest.
Error Handling
All functions return Result<T, TrainingGuardError>. Six error codes cover validation, split, reconstruction, and provenance failures.
| Error Code | When | Common Causes |
|---|---|---|
| INVALID_CONFIG | guardDataset() config validation | Fewer than 2 custodians, threshold > total custodians, threshold < 1, empty data |
| SPLIT_FAILED | XorIDA split operation | Internal crypto error during share generation, memory allocation failure |
| HMAC_FAILED | Share integrity check | Share modified in transit or storage, HMAC key mismatch, corrupted share |
| RECONSTRUCT_FAILED | XorIDA reconstruct or hash verify | SHA-256 hash mismatch (data poisoning), unpadding failed, insufficient entropy |
| INSUFFICIENT_SHARES | reconstructDataset() validation | Fewer than threshold shares per chunk, wrong number of chunk groups, missing custodian |
| PROVENANCE_INVALID | verifyProvenance() check | Missing fields, out-of-order timestamps, invalid action type, empty chain |
Error Handling Pattern
const result = await guardDataset(data, metadata, config, actor); if (!result.ok) { switch (result.error.code) { case 'INVALID_CONFIG': console.error('Config error:', result.error.message); break; case 'SPLIT_FAILED': console.error('Crypto failure:', result.error.message); break; default: console.error('Unknown error:', result.error); } return; } // Success path const { manifest, shares } = result.value;
Deployment Options
SaaS Recommended
Fully managed infrastructure. Call our REST API, we handle scaling, updates, and operations.
- Zero infrastructure setup
- Automatic updates
- 99.9% uptime SLA
- Enterprise SLA available
SDK Integration
Embed directly in your application. Runs in your codebase with full programmatic control.
npm install @private.me/trainingguard- TypeScript/JavaScript SDK
- Full source access
- Enterprise support available
On-Premise Upon Request
Enterprise CLI for compliance, air-gap, or data residency requirements.
- Complete data sovereignty
- Air-gap capable deployment
- Custom SLA + dedicated support
- Professional services included
Enterprise On-Premise Deployment
While trainingGuard is primarily delivered as SaaS or SDK, we build dedicated on-premise infrastructure for customers with:
- Regulatory mandates — HIPAA, SOX, FedRAMP, CMMC requiring self-hosted processing
- Air-gapped environments — SCIF, classified networks, offline operations
- Data residency requirements — EU GDPR, China data laws, government mandates
- Custom integration needs — Embed in proprietary platforms, specialized workflows
Includes: Enterprise CLI, Docker/Kubernetes orchestration, RBAC, audit logging, and dedicated support.