Loading...
private.me Docs
Get TrainingGuard
PRIVATE.ME · Technical White Paper

TrainingGuard: AI Dataset Protection

Split AI training data across independent custodians via XorIDA threshold secret sharing. No single party can access, poison, or steal the dataset. Immutable provenance tracking for EU AI Act compliance and dataset licensing enforcement. Information-theoretic security for the model training pipeline.

v0.1.0 2MB auto-chunking SHA-256 integrity 0 npm deps EU AI Act ready
Section 01

Executive Summary

TrainingGuard protects AI training datasets from theft, poisoning, and unauthorized access by splitting them across independent custodians using XorIDA threshold secret sharing. Every split and reconstruction is recorded in an immutable provenance chain with SHA-256 hashes, actor identities, and timestamps.

Two functions cover the complete workflow: guardDataset() splits training data into N shares (default 3-of-3) with HMAC-SHA256 integrity protection per share and automatic chunking for large datasets. reconstructDataset() rebuilds the dataset from K-of-N shares, verifies the SHA-256 hash against the manifest, and appends a provenance record documenting who reconstructed it and when.

The security guarantee is information-theoretic — not computationally hard to break, but mathematically impossible. Fewer than K custodians learn exactly zero information about the training data, regardless of computing power. This protects against data theft by insiders, cloud provider breaches, and even quantum computers.

Designed for EU AI Act Article 10 compliance (data governance obligations), dataset licensing enforcement (no single licensee can reconstruct alone), and secure multi-party model training where no participant should access the full dataset.

Section 02

The Problem

AI training datasets are high-value targets for theft, poisoning, and regulatory violations. Traditional protection mechanisms fail against insider threats and breaches.

Dataset theft is undetectable. A single engineer with database access can copy 50,000 labeled images to a USB drive. The dataset is gone, but audit logs show nothing suspicious — just read operations.

Data poisoning is invisible until deployment. An attacker with write access changes labels on 0.1% of samples. The model trains successfully but fails catastrophically on specific inputs.

Cloud storage is centralized trust. S3 buckets, Azure Blob Storage, and Google Cloud Storage all require trusting the cloud provider's access controls, employee background checks, and compliance certifications. One compromised admin account exposes everything.

Encryption-at-rest protects against disk theft, not authorized access. Once the application decrypts the dataset for training, it exists in plaintext memory on a server where anyone with SSH access can dump it.

Attack Vector Encryption at Rest Access Control Audit Logs TrainingGuard
Insider theft No Partial Detect after Prevents
Data poisoning No No No SHA-256 detect
Cloud provider breach No No No Prevents
Unauthorized reconstruction No Partial Detect after K-of-N required
Provenance loss No No Append-only Immutable chain
Single point of failure Yes Yes Yes Distributed

The Old Way: Centralized Storage

Training Dataset upload CLOUD STORAGE single point of access INSIDER Full dataset access CLOUD ADMIN Backdoor access ATTACKER Breach = full theft All eggs in one basket

The New Way: Threshold-Split Custody

Training Dataset 50K samples, 12.5MB XorIDA Split 2-of-3 threshold Custodian 1 AWS us-east-1 Custodian 2 Azure EU Custodian 3 On-prem vault ANY 1 SHARE = ZERO INFO Information-theoretic guarantee Reconstruction requires ANY 2 of 3 custodians — provenance tracked
Section 03

Real-World Use Cases

Six scenarios where TrainingGuard protects AI training pipelines from theft, poisoning, and compliance violations.

🧠
AI / ML
Federated Learning

Multiple organizations contribute training data without sharing raw datasets. Each org holds one share. Model training happens on reconstructed batches, then shares are deleted. No org can reconstruct alone.

3-of-5 multi-party custody
🏥
Healthcare
Clinical Data Training

PHI-labeled medical imaging datasets split across hospital, research institution, and compliance officer. HIPAA audit trail via provenance chain. SHA-256 hash proves no data poisoning.

HIPAA provenance + integrity
💹
Financial
Fraud Detection Models

Transaction datasets split between bank, third-party trainer, and auditor. Model training requires 2-of-3 consent. Provenance proves which parties accessed which datasets when.

SOX compliance + non-repudiation
🏛
Government
Classified Training Data

Defense datasets split across classified network, R&D lab, and oversight authority. 3-of-3 required for reconstruction. Air-gapped custodians prevent exfiltration.

CMMC L5 + zero trust
🌐
Platform
Dataset Licensing

Commercial training datasets sold with split custody. Buyer gets share 1, seller retains share 2, escrow holds share 3. Buyer cannot use dataset without seller cooperation, enforcing licensing terms.

License enforcement by design
🛡
Research
Anti-Poisoning Pipeline

Open-source model training with community-verified datasets. Dataset split across 5 trusted institutions. Reconstruction requires 3-of-5 consensus. SHA-256 hash published for verification.

Transparent provenance chain
Section 04

How It Works

Five-step pipeline: validate, hash, chunk, split, manifest. Reconstruction reverses the flow with hash verification.

guardDataset() Pipeline

1. Validate config — Minimum 2 custodians, threshold ≤ total custodians, data not empty.
2. SHA-256 hash — Hash the entire dataset before chunking. This hash goes in the manifest and is verified on reconstruction.
3. Chunk data — Large datasets split into 2MB chunks (configurable). Each chunk processed independently.
4. Per chunk: PKCS#7 pad → XorIDA split → HMAC-SHA256 — Padding ensures chunk size divisibility, XorIDA generates N shares, HMAC signs each share.
5. Build manifest — UUID, metadata, chunk count, dataset hash, config, provenance record (action: 'split', actor, timestamp, hash).

reconstructDataset() Pipeline

1. Validate share counts — Each chunk must have ≥ threshold shares. Total chunks must match manifest.
2. Per chunk: XorIDA reconstruct → PKCS#7 unpad — Threshold shares reconstruct the chunk, padding removed.
3. Concatenate chunks — Reassemble all chunks in order, trim to original size from manifest.
4. Verify SHA-256 hash — Hash reconstructed data, compare against manifest. Mismatch = RECONSTRUCT_FAILED error.
5. Append provenance — Add 'reconstruct' provenance record (actor, timestamp, hash). Return data + updated manifest.

Dataset 50K samples 12.5 MB Chunk 2MB blocks XorIDA 2-of-3 split Share 1 Share 2 Share 3 Custodian 1 Custodian 2 Custodian 3 GUARD DATASET FLOW Hash → Chunk → Split → Distribute → Provenance
Automatic Chunking
Large datasets are automatically split into 2MB chunks (configurable via chunkSize in TrainingGuardConfig). Each chunk is independently padded, split via XorIDA, and HMAC-signed. This enables parallel distribution of shares and incremental reconstruction. Chunk size trades memory usage vs. overhead — smaller chunks = more metadata, larger chunks = higher RAM during split/reconstruct.
Section 05

Integration Patterns

Three deployment patterns for different trust topologies.

Pattern 1: Multi-Cloud Distribution

Multi-cloud custodians
import { guardDataset, reconstructDataset } from '@private.me/trainingguard';

const metadata = {
  name: 'fraud-detection-v4',
  version: '4.2.0',
  recordCount: 100000,
  format: 'parquet',
  dataSize: data.length,
  source: 'transaction-db-prod',
};

const config = { custodians: 3, threshold: 2, chunkSize: 2 * 1024 * 1024 };
const result = await guardDataset(data, metadata, config, 'ml-engineer@corp.com');

if (result.ok) {
  const { manifest, shares } = result.value;
  // shares[chunkIndex][custodianIndex]
  await uploadToS3(shares[0][0], 'custodian-1-us-east');  // AWS
  await uploadToAzure(shares[0][1], 'custodian-2-eu');    // Azure
  await uploadToGCS(shares[0][2], 'custodian-3-asia');  // GCP
}

Pattern 2: Federated Learning

Multi-party training
// Hospital A, Research Lab B, Compliance Officer C each hold 1 share
const config = { custodians: 3, threshold: 3 }; // All 3 required

const result = await guardDataset(phiData, metadata, config, 'data-engineer@hospital.org');
// Distribute shares to 3 independent orgs

// Training time: collect shares from all 3 parties
const collectedShares = [
  await fetchFromHospital(),
  await fetchFromLab(),
  await fetchFromCompliance(),
];

const rebuilt = await reconstructDataset(manifest, collectedShares, 'trainer@lab.edu');
if (rebuilt.ok) {
  await trainModel(rebuilt.value.data); // Use reconstructed dataset
  // Provenance now shows: split (hospital) + reconstruct (lab)
}

Pattern 3: Dataset Licensing Enforcement

Commercial dataset with escrow
// Seller splits dataset 3-of-3: buyer, seller, escrow
const config = { custodians: 3, threshold: 2 }; // Any 2 of 3 can reconstruct

const result = await guardDataset(dataset, metadata, config, 'seller@datasets.ai');
if (result.ok) {
  const { manifest, shares } = result.value;
  await sendToBuyer(shares[0][0], manifest);      // Share 1 to buyer
  await keepInternal(shares[0][1]);               // Share 2 seller retains
  await sendToEscrow(shares[0][2], manifest);    // Share 3 to escrow

  // Buyer CANNOT reconstruct alone — needs seller cooperation OR escrow release
  // Seller can revoke by deleting their share (buyer now needs escrow)
}
Section 06

Security

Six layers of defense protecting training data from theft, poisoning, and unauthorized access.

Layer Technology Protects Against
1. Information-Theoretic Split XorIDA over GF(2) Any K-1 shares reveal zero information. Not computationally hard — mathematically impossible, even with quantum computers.
2. SHA-256 Hash Verification SHA-256 (FIPS 180-4) Data poisoning detection. Hash computed before split, verified after reconstruct. Mismatch rejects the dataset.
3. HMAC-SHA256 per Share HMAC-SHA256 Share tampering detection. Each share signed at split time. Reconstruction verifies HMACs before combining.
4. Provenance Chain Immutable append-only log Unauthorized access detection. Every split/reconstruct/access action recorded with actor, timestamp, hash.
5. PKCS#7 Padding PKCS#7 (RFC 5652) Block size alignment. Prevents size-based inference attacks on chunk boundaries.
6. crypto.getRandomValues() Web Crypto API Cryptographically secure randomness for XorIDA polynomial coefficients. No Math.random().
Anti-Poisoning Architecture
No single custodian can modify training data undetected. Reconstruction requires threshold cooperation (e.g., 2-of-3). After reconstruction, SHA-256 hash verification catches any modification — even a single bit flip fails the entire reconstruction with a RECONSTRUCT_FAILED error. The provenance chain records which custodians participated in reconstruction, creating an audit trail for forensic investigation.

Threat Model

TrainingGuard assumes:

  • Honest-but-curious custodians: Custodians follow the protocol but may attempt to infer information from their shares. XorIDA guarantees they learn nothing.
  • Byzantine custodians (threshold - 1): Up to K-1 custodians may collude or be compromised. As long as ≥K custodians remain honest, the dataset cannot be reconstructed by attackers.
  • Network adversaries: Shares transmitted over untrusted networks. HMAC integrity protects against tampering in transit.
  • Storage breaches: Cloud provider breach exposes one custodian's shares. Attacker learns zero information (information-theoretic security).

TrainingGuard does NOT protect against:

  • Compromise of ≥K custodians: If threshold or more custodians are compromised, the dataset can be reconstructed.
  • Side-channel attacks during reconstruction: RAM dumps, timing analysis, or EM radiation during the reconstruction phase may leak information. Use hardware enclaves (SGX/SEV) for high-security scenarios.
  • Social engineering: Tricking custodians into releasing shares. Multi-party authorization and provenance logging mitigate this.
Section 07

Performance Benchmarks

Real-world dataset sizes from small (1MB) to large (100MB). Measured on Node.js 22, median of 100 iterations.

Dataset Size Split Time Reconstruct Time Total Roundtrip Chunks
1 MB 18 ms 12 ms 30 ms 1
5 MB 82 ms 54 ms 136 ms 3
12.5 MB 198 ms 128 ms 326 ms 7
25 MB 385 ms 248 ms 633 ms 13
50 MB 742 ms 486 ms 1,228 ms 25
100 MB 1,465 ms 951 ms 2,416 ms 50

Node.js 22 • 100 iterations • 2-of-3 config • 2MB chunk size • Median time

~15 ms/MB
Split throughput
~10 ms/MB
Reconstruct throughput
< 2.5s
100MB roundtrip
Parallelization Opportunity
Each chunk is processed independently. For datasets with many chunks (e.g., 100MB = 50 chunks), parallelizing across CPU cores or worker threads can reduce wall-clock time significantly. A 50-chunk dataset with 8 parallel workers achieves ~6x speedup (measured: 2.4s serial → 410ms parallel on 8-core machine).
Section 08

Honest Limitations

TrainingGuard is not a universal solution. Here are scenarios where it does not help or where alternatives are better.

What TrainingGuard Does NOT Do

Limitation Why Alternative
Prevent ≥K custodian collusion If threshold or more custodians collude, they can reconstruct the dataset. This is inherent to threshold schemes. Increase N and K. Use legally binding contracts. Multi-jurisdiction custody.
Protect in-memory data during training Once reconstructed, the dataset exists in plaintext memory. RAM dumps or debugger attach can extract it. Hardware enclaves (SGX/SEV). Secure multi-party computation for training.
Prevent model extraction attacks TrainingGuard protects the dataset, not the trained model. Attackers can query the model to extract training data. Differential privacy during training. Model watermarking. Rate limiting inference.
Replace access control TrainingGuard is a cryptographic layer. It does not authenticate custodians or enforce who can request shares. Combine with IAM systems, mutual TLS, or hardware tokens for custodian authentication.
Guarantee provenance authenticity Provenance records can be forged if the manifest is not integrity-protected by external signatures. Sign manifests with custodian private keys (Ed25519). Store signed manifests in append-only ledger.

When NOT to Use TrainingGuard

  • Small datasets (<1MB): Overhead of chunking and splitting exceeds the security benefit. Use simple AES-256-GCM encryption instead.
  • Single-party custody: If all custodians are in the same organization under unified access control, TrainingGuard adds complexity without distributing trust. Use encryption-at-rest.
  • Real-time inference pipelines: Reconstruction latency (100ms+ for large datasets) may be unacceptable for latency-sensitive inference. Cache reconstructed datasets in memory.
  • Public datasets: If the dataset is already public (e.g., ImageNet, COCO), splitting provides no confidentiality benefit. TrainingGuard is for proprietary or sensitive datasets.
Section 09

Regulatory Compliance

TrainingGuard supports data governance obligations under EU AI Act, GDPR, HIPAA, and financial regulations.

EU AI Act — Article 10 (Data Governance)

Article 10 requires high-risk AI systems to ensure training datasets are "relevant, representative, free of errors and complete" with documented "data governance and management practices."

AI Act Requirement TrainingGuard Feature
Data provenance tracking Immutable provenance chain records split/access/reconstruct actions with actor, timestamp, hash
Data integrity verification SHA-256 hash verification after reconstruction catches poisoning
Access history Provenance log shows who reconstructed the dataset when
Multi-party oversight Threshold reconstruction requires cooperation of ≥K independent custodians

HIPAA — Security Rule

45 CFR § 164.312(a)(1) requires covered entities to implement technical safeguards to protect ePHI. TrainingGuard provides:

  • Access control: Threshold reconstruction prevents unauthorized access by single custodian.
  • Audit controls: Provenance chain creates HIPAA-compliant audit logs (who, what, when).
  • Integrity controls: SHA-256 hash verification detects unauthorized modifications.
  • Transmission security: HMAC-SHA256 per share protects data in transit between custodians.

SOX — Section 404 (Internal Controls)

Financial institutions using AI for fraud detection or credit scoring must document data governance controls. TrainingGuard provenance chain demonstrates:

  • Who accessed training data and when (non-repudiation via provenance)
  • Data integrity verification (SHA-256 hash prevents silent poisoning)
  • Multi-party authorization for dataset reconstruction (separation of duties)

GDPR — Article 32 (Security of Processing)

GDPR requires "appropriate technical and organizational measures" to protect personal data. TrainingGuard provides:

  • Pseudonymization: Shares reveal zero information about the dataset (information-theoretic security).
  • Confidentiality: Threshold custody prevents single-point breaches from exposing data.
  • Integrity: SHA-256 + HMAC prevent unauthorized modification.
  • Accountability: Provenance chain demonstrates compliance with data processing obligations.
Section 10

Complete API Surface

Three core functions: guard, reconstruct, verify.

guardDataset(data: Uint8Array, metadata: DatasetMetadata, config: TrainingGuardConfig, actor: string): Promise<Result<TrainingGuardResult, TrainingGuardError>>

Split training data via XorIDA across custodians. Chunks large datasets (default 2MB), generates HMAC per share, creates provenance record, and returns a manifest with SHA-256 dataset hash. Actor is recorded in provenance (e.g., 'engineer@corp.com').

reconstructDataset(manifest: TrainingManifest, shares: TrainingDataShare[][], actor: string): Promise<Result<{ data: Uint8Array; manifest: TrainingManifest }, TrainingGuardError>>

Reconstruct dataset from custodian shares. Verifies SHA-256 hash matches manifest. Appends a 'reconstruct' provenance record. Returns reconstructed data and updated manifest. Shares must be ≥ threshold per chunk.

createProvenanceRecord(action: 'split' | 'access' | 'reconstruct' | 'verify', actor: string, dataHash: string, details?: string): ProvenanceRecord

Create a provenance entry with the current timestamp. Action describes what happened, actor identifies who (email/DID), dataHash is the SHA-256 hex hash at that point, details is optional context.

verifyProvenance(records: ProvenanceRecord[]): boolean

Verify a provenance chain: at least one record, all fields non-empty, valid actions, timestamps in non-decreasing order. Returns true if valid, false otherwise.

Section 11

Provenance Chain

Every split, access, reconstruction, and verification is recorded in an immutable append-only log with actor identity, timestamp, and data hash.

Provenance Actions

Action When Recorded Fields
split guardDataset() called actor (who split), timestamp (when), dataHash (SHA-256 of original dataset)
access Manual provenance logging actor (who accessed), timestamp, dataHash (current hash), details (what action)
reconstruct reconstructDataset() called actor (who reconstructed), timestamp, dataHash (SHA-256 after reconstruct, must match manifest)
verify Manual integrity check actor (who verified), timestamp, dataHash, details (verification result)

Example Provenance Chain

Provenance timeline
[
  {
    "action": "split",
    "actor": "ml-engineer@hospital.org",
    "timestamp": "2026-04-10T14:32:18.442Z",
    "dataHash": "a3f5...d8e2"
  },
  {
    "action": "reconstruct",
    "actor": "trainer@research-lab.edu",
    "timestamp": "2026-04-11T09:15:42.112Z",
    "dataHash": "a3f5...d8e2",
    "details": "Training run batch-042"
  },
  {
    "action": "verify",
    "actor": "compliance@hospital.org",
    "timestamp": "2026-04-12T11:22:05.881Z",
    "dataHash": "a3f5...d8e2",
    "details": "HIPAA audit verification"
  }
]
Integrity Enforcement
The verifyProvenance() function enforces chain integrity: all records must have non-empty fields, actions must be valid, and timestamps must be in non-decreasing order. This prevents backdating or out-of-order insertion. For cryptographic non-repudiation, sign each provenance record with the actor's Ed25519 key and store signatures in the manifest.
Section 12

Error Handling

All functions return Result<T, TrainingGuardError>. Six error codes cover validation, split, reconstruction, and provenance failures.

Error Code When Common Causes
INVALID_CONFIG guardDataset() config validation Fewer than 2 custodians, threshold > total custodians, threshold < 1, empty data
SPLIT_FAILED XorIDA split operation Internal crypto error during share generation, memory allocation failure
HMAC_FAILED Share integrity check Share modified in transit or storage, HMAC key mismatch, corrupted share
RECONSTRUCT_FAILED XorIDA reconstruct or hash verify SHA-256 hash mismatch (data poisoning), unpadding failed, insufficient entropy
INSUFFICIENT_SHARES reconstructDataset() validation Fewer than threshold shares per chunk, wrong number of chunk groups, missing custodian
PROVENANCE_INVALID verifyProvenance() check Missing fields, out-of-order timestamps, invalid action type, empty chain

Error Handling Pattern

Result pattern
const result = await guardDataset(data, metadata, config, actor);

if (!result.ok) {
  switch (result.error.code) {
    case 'INVALID_CONFIG':
      console.error('Config error:', result.error.message);
      break;
    case 'SPLIT_FAILED':
      console.error('Crypto failure:', result.error.message);
      break;
    default:
      console.error('Unknown error:', result.error);
  }
  return;
}

// Success path
const { manifest, shares } = result.value;

TrainingGuard is part of the PRIVATE.ME platform. For support, licensing inquiries, or integration assistance, contact us at contact@private.me.

Copyright © 2024-2026 Standard Clouds, Inc. All rights reserved. Proprietary and confidential.

Deployment Options

📦

SDK Integration

Embed directly in your application. Runs in your codebase with full programmatic control.

  • npm install @private.me/trainingguard
  • TypeScript/JavaScript SDK
  • Full source access
  • Enterprise support available
Get Started →
🏢

On-Premise Upon Request

Enterprise CLI for compliance, air-gap, or data residency requirements.

  • Complete data sovereignty
  • Air-gap capable deployment
  • Custom SLA + dedicated support
  • Professional services included
Request Quote →

Enterprise On-Premise Deployment

While trainingGuard is primarily delivered as SaaS or SDK, we build dedicated on-premise infrastructure for customers with:

  • Regulatory mandates — HIPAA, SOX, FedRAMP, CMMC requiring self-hosted processing
  • Air-gapped environments — SCIF, classified networks, offline operations
  • Data residency requirements — EU GDPR, China data laws, government mandates
  • Custom integration needs — Embed in proprietary platforms, specialized workflows

Includes: Enterprise CLI, Docker/Kubernetes orchestration, RBAC, audit logging, and dedicated support.

Contact sales for assessment and pricing →