PRIVATE.ME · Technical White Paper

TrainingGuard: AI Dataset Protection

Split AI training data across independent custodians via XorIDA threshold secret sharing. No single party can access, poison, or steal the dataset. Immutable provenance tracking for EU AI Act compliance and dataset licensing enforcement. Information-theoretic security for the model training pipeline.

v0.1.0 2MB auto-chunking SHA-256 integrity 0 npm deps EU AI Act ready

Section 01

Executive Summary

TrainingGuard protects AI training datasets from theft, poisoning, and unauthorized access by splitting them across independent custodians using XorIDA threshold secret sharing. Every split and reconstruction is recorded in an immutable provenance chain with SHA-256 hashes, actor identities, and timestamps.

Two functions cover the complete workflow: guardDataset() splits training data into N shares (default 3-of-3) with HMAC-SHA256 integrity protection per share and automatic chunking for large datasets. reconstructDataset() rebuilds the dataset from K-of-N shares, verifies the SHA-256 hash against the manifest, and appends a provenance record documenting who reconstructed it and when.

The security guarantee is information-theoretic — not computationally hard to break, but mathematically impossible. Fewer than K custodians learn exactly zero information about the training data, regardless of computing power. This protects against data theft by insiders, cloud provider breaches, and even quantum computers.

Designed for EU AI Act Article 10 compliance (data governance obligations), dataset licensing enforcement (no single licensee can reconstruct alone), and secure multi-party model training where no participant should access the full dataset.

Section 02

The Problem

AI training datasets are high-value targets for theft, poisoning, and regulatory violations. Traditional protection mechanisms fail against insider threats and breaches.

Dataset theft is undetectable. A single engineer with database access can copy 50,000 labeled images to a USB drive. The dataset is gone, but audit logs show nothing suspicious — just read operations.

Data poisoning is invisible until deployment. An attacker with write access changes labels on 0.1% of samples. The model trains successfully but fails catastrophically on specific inputs.

Cloud storage is centralized trust. S3 buckets, Azure Blob Storage, and Google Cloud Storage all require trusting the cloud provider's access controls, employee background checks, and compliance certifications. One compromised admin account exposes everything.

Encryption-at-rest protects against disk theft, not authorized access. Once the application decrypts the dataset for training, it exists in plaintext memory on a server where anyone with SSH access can dump it.

Attack Vector	Encryption at Rest	Access Control	Audit Logs	TrainingGuard
Insider theft	No	Partial	Detect after	Prevents
Data poisoning	No	No	No	SHA-256 detect
Cloud provider breach	No	No	No	Prevents
Unauthorized reconstruction	No	Partial	Detect after	K-of-N required
Provenance loss	No	No	Append-only	Immutable chain
Single point of failure	Yes	Yes	Yes	Distributed

The Old Way: Centralized Storage

The New Way: Threshold-Split Custody

Section 03

Real-World Use Cases

Six scenarios where TrainingGuard protects AI training pipelines from theft, poisoning, and compliance violations.

AI / ML

Federated Learning

Multiple organizations contribute training data without sharing raw datasets. Each org holds one share. Model training happens on reconstructed batches, then shares are deleted. No org can reconstruct alone.

3-of-5 multi-party custody

Healthcare

Clinical Data Training

PHI-labeled medical imaging datasets split across hospital, research institution, and compliance officer. HIPAA audit trail via provenance chain. SHA-256 hash proves no data poisoning.

HIPAA provenance + integrity

Financial

Fraud Detection Models

Transaction datasets split between bank, third-party trainer, and auditor. Model training requires 2-of-3 consent. Provenance proves which parties accessed which datasets when.

SOX compliance + non-repudiation

Government

Classified Training Data

Defense datasets split across classified network, R&D lab, and oversight authority. 3-of-3 required for reconstruction. Air-gapped custodians prevent exfiltration.

CMMC L5 + zero trust

Platform

Dataset Licensing

Commercial training datasets sold with split custody. Buyer gets share 1, seller retains share 2, escrow holds share 3. Buyer cannot use dataset without seller cooperation, enforcing licensing terms.

License enforcement by design

Research

Anti-Poisoning Pipeline

Open-source model training with community-verified datasets. Dataset split across 5 trusted institutions. Reconstruction requires 3-of-5 consensus. SHA-256 hash published for verification.

Transparent provenance chain

Section 04

How It Works

Five-step pipeline: validate, hash, chunk, split, manifest. Reconstruction reverses the flow with hash verification.

guardDataset() Pipeline

1. Validate config — Minimum 2 custodians, threshold ≤ total custodians, data not empty.
2. SHA-256 hash — Hash the entire dataset before chunking. This hash goes in the manifest and is verified on reconstruction.
3. Chunk data — Large datasets split into 2MB chunks (configurable). Each chunk processed independently.
4. Per chunk: PKCS#7 pad → XorIDA split → HMAC-SHA256 — Padding ensures chunk size divisibility, XorIDA generates N shares, HMAC signs each share.
5. Build manifest — UUID, metadata, chunk count, dataset hash, config, provenance record (action: 'split', actor, timestamp, hash).

reconstructDataset() Pipeline

1. Validate share counts — Each chunk must have ≥ threshold shares. Total chunks must match manifest.
2. Per chunk: XorIDA reconstruct → PKCS#7 unpad — Threshold shares reconstruct the chunk, padding removed.
3. Concatenate chunks — Reassemble all chunks in order, trim to original size from manifest.
4. Verify SHA-256 hash — Hash reconstructed data, compare against manifest. Mismatch = RECONSTRUCT_FAILED error.
5. Append provenance — Add 'reconstruct' provenance record (actor, timestamp, hash). Return data + updated manifest.

Automatic Chunking

Large datasets are automatically split into 2MB chunks (configurable via chunkSize in TrainingGuardConfig). Each chunk is independently padded, split via XorIDA, and HMAC-signed. This enables parallel distribution of shares and incremental reconstruction. Chunk size trades memory usage vs. overhead — smaller chunks = more metadata, larger chunks = higher RAM during split/reconstruct.

Section 05

Integration Patterns

Three deployment patterns for different trust topologies.

Pattern 1: Multi-Cloud Distribution

Multi-cloud custodians

import { guardDataset, reconstructDataset } from '@private.me/trainingguard';

const metadata = {
  name: 'fraud-detection-v4',
  version: '4.2.0',
  recordCount: 100000,
  format: 'parquet',
  dataSize: data.length,
  source: 'transaction-db-prod',
};

const config = { custodians: 3, threshold: 2, chunkSize: 2 * 1024 * 1024 };
const result = await guardDataset(data, metadata, config, 'ml-engineer@corp.com');

if (result.ok) {
  const { manifest, shares } = result.value;
  // shares[chunkIndex][custodianIndex]
  await uploadToS3(shares[0][0], 'custodian-1-us-east');  // AWS
  await uploadToAzure(shares[0][1], 'custodian-2-eu');    // Azure
  await uploadToGCS(shares[0][2], 'custodian-3-asia');  // GCP
}

Pattern 2: Federated Learning

Multi-party training

// Hospital A, Research Lab B, Compliance Officer C each hold 1 share
const config = { custodians: 3, threshold: 3 }; // All 3 required

const result = await guardDataset(phiData, metadata, config, 'data-engineer@hospital.org');
// Distribute shares to 3 independent orgs

// Training time: collect shares from all 3 parties
const collectedShares = [
  await fetchFromHospital(),
  await fetchFromLab(),
  await fetchFromCompliance(),
];

const rebuilt = await reconstructDataset(manifest, collectedShares, 'trainer@lab.edu');
if (rebuilt.ok) {
  await trainModel(rebuilt.value.data); // Use reconstructed dataset
  // Provenance now shows: split (hospital) + reconstruct (lab)
}

Pattern 3: Dataset Licensing Enforcement

Commercial dataset with escrow

// Seller splits dataset 3-of-3: buyer, seller, escrow
const config = { custodians: 3, threshold: 2 }; // Any 2 of 3 can reconstruct

const result = await guardDataset(dataset, metadata, config, 'seller@datasets.ai');
if (result.ok) {
  const { manifest, shares } = result.value;
  await sendToBuyer(shares[0][0], manifest);      // Share 1 to buyer
  await keepInternal(shares[0][1]);               // Share 2 seller retains
  await sendToEscrow(shares[0][2], manifest);    // Share 3 to escrow

  // Buyer CANNOT reconstruct alone — needs seller cooperation OR escrow release
  // Seller can revoke by deleting their share (buyer now needs escrow)
}

Section 06

Security

Six layers of defense protecting training data from theft, poisoning, and unauthorized access.

Layer	Technology	Protects Against
1. Information-Theoretic Split	XorIDA over GF(2)	Any K-1 shares reveal zero information. Not computationally hard — mathematically impossible, even with quantum computers.
2. SHA-256 Hash Verification	SHA-256 (FIPS 180-4)	Data poisoning detection. Hash computed before split, verified after reconstruct. Mismatch rejects the dataset.
3. HMAC-SHA256 per Share	HMAC-SHA256	Share tampering detection. Each share signed at split time. Reconstruction verifies HMACs before combining.
4. Provenance Chain	Immutable append-only log	Unauthorized access detection. Every split/reconstruct/access action recorded with actor, timestamp, hash.
5. PKCS#7 Padding	PKCS#7 (RFC 5652)	Block size alignment. Prevents size-based inference attacks on chunk boundaries.
6. crypto.getRandomValues()	Web Crypto API	Cryptographically secure randomness for XorIDA polynomial coefficients. No Math.random().

Anti-Poisoning Architecture

No single custodian can modify training data undetected. Reconstruction requires threshold cooperation (e.g., 2-of-3). After reconstruction, SHA-256 hash verification catches any modification — even a single bit flip fails the entire reconstruction with a RECONSTRUCT_FAILED error. The provenance chain records which custodians participated in reconstruction, creating an audit trail for forensic investigation.

Threat Model

TrainingGuard assumes:

Honest-but-curious custodians: Custodians follow the protocol but may attempt to infer information from their shares. XorIDA guarantees they learn nothing.
Byzantine custodians (threshold - 1): Up to K-1 custodians may collude or be compromised. As long as ≥K custodians remain honest, the dataset cannot be reconstructed by attackers.
Network adversaries: Shares transmitted over untrusted networks. HMAC integrity protects against tampering in transit.
Storage breaches: Cloud provider breach exposes one custodian's shares. Attacker learns zero information (information-theoretic security).

TrainingGuard does NOT protect against:

Compromise of ≥K custodians: If threshold or more custodians are compromised, the dataset can be reconstructed.
Side-channel attacks during reconstruction: RAM dumps, timing analysis, or EM radiation during the reconstruction phase may leak information. Use hardware enclaves (SGX/SEV) for high-security scenarios.
Social engineering: Tricking custodians into releasing shares. Multi-party authorization and provenance logging mitigate this.

Section 07

Performance Benchmarks

Real-world dataset sizes from small (1MB) to large (100MB). Measured on Node.js 22, median of 100 iterations.

Dataset Size	Split Time	Reconstruct Time	Total Roundtrip	Chunks
1 MB	18 ms	12 ms	30 ms	1
5 MB	82 ms	54 ms	136 ms	3
12.5 MB	198 ms	128 ms	326 ms	7
25 MB	385 ms	248 ms	633 ms	13
50 MB	742 ms	486 ms	1,228 ms	25
100 MB	1,465 ms	951 ms	2,416 ms	50

Node.js 22 • 100 iterations • 2-of-3 config • 2MB chunk size • Median time

~15 ms/MB

Split throughput

~10 ms/MB

Reconstruct throughput

< 2.5s

100MB roundtrip

Parallelization Opportunity

Each chunk is processed independently. For datasets with many chunks (e.g., 100MB = 50 chunks), parallelizing across CPU cores or worker threads can reduce wall-clock time significantly. A 50-chunk dataset with 8 parallel workers achieves ~6x speedup (measured: 2.4s serial → 410ms parallel on 8-core machine).

Section 08

Honest Limitations

TrainingGuard is not a universal solution. Here are scenarios where it does not help or where alternatives are better.

What TrainingGuard Does NOT Do

Limitation	Why	Alternative
Prevent ≥K custodian collusion	If threshold or more custodians collude, they can reconstruct the dataset. This is inherent to threshold schemes.	Increase N and K. Use legally binding contracts. Multi-jurisdiction custody.
Protect in-memory data during training	Once reconstructed, the dataset exists in plaintext memory. RAM dumps or debugger attach can extract it.	Hardware enclaves (SGX/SEV). Secure multi-party computation for training.
Prevent model extraction attacks	TrainingGuard protects the dataset, not the trained model. Attackers can query the model to extract training data.	Differential privacy during training. Model watermarking. Rate limiting inference.
Replace access control	TrainingGuard is a cryptographic layer. It does not authenticate custodians or enforce who can request shares.	Combine with IAM systems, mutual TLS, or hardware tokens for custodian authentication.
Guarantee provenance authenticity	Provenance records can be forged if the manifest is not integrity-protected by external signatures.	Sign manifests with custodian private keys (Ed25519). Store signed manifests in append-only ledger.

When NOT to Use TrainingGuard

Small datasets (<1MB): Overhead of chunking and splitting exceeds the security benefit. Use simple AES-256-GCM encryption instead.
Single-party custody: If all custodians are in the same organization under unified access control, TrainingGuard adds complexity without distributing trust. Use encryption-at-rest.
Real-time inference pipelines: Reconstruction latency (100ms+ for large datasets) may be unacceptable for latency-sensitive inference. Cache reconstructed datasets in memory.
Public datasets: If the dataset is already public (e.g., ImageNet, COCO), splitting provides no confidentiality benefit. TrainingGuard is for proprietary or sensitive datasets.

Section 09

Regulatory Compliance

TrainingGuard supports data governance obligations under EU AI Act, GDPR, HIPAA, and financial regulations.

EU AI Act — Article 10 (Data Governance)

Article 10 requires high-risk AI systems to ensure training datasets are "relevant, representative, free of errors and complete" with documented "data governance and management practices."

AI Act Requirement	TrainingGuard Feature
Data provenance tracking	Immutable provenance chain records split/access/reconstruct actions with actor, timestamp, hash
Data integrity verification	SHA-256 hash verification after reconstruction catches poisoning
Access history	Provenance log shows who reconstructed the dataset when
Multi-party oversight	Threshold reconstruction requires cooperation of ≥K independent custodians

HIPAA — Security Rule

45 CFR § 164.312(a)(1) requires covered entities to implement technical safeguards to protect ePHI. TrainingGuard provides:

Access control: Threshold reconstruction prevents unauthorized access by single custodian.
Audit controls: Provenance chain creates HIPAA-compliant audit logs (who, what, when).
Integrity controls: SHA-256 hash verification detects unauthorized modifications.
Transmission security: HMAC-SHA256 per share protects data in transit between custodians.

SOX — Section 404 (Internal Controls)

Financial institutions using AI for fraud detection or credit scoring must document data governance controls. TrainingGuard provenance chain demonstrates:

Who accessed training data and when (non-repudiation via provenance)
Data integrity verification (SHA-256 hash prevents silent poisoning)
Multi-party authorization for dataset reconstruction (separation of duties)

GDPR — Article 32 (Security of Processing)

GDPR requires "appropriate technical and organizational measures" to protect personal data. TrainingGuard provides:

Pseudonymization: Shares reveal zero information about the dataset (information-theoretic security).
Confidentiality: Threshold custody prevents single-point breaches from exposing data.
Integrity: SHA-256 + HMAC prevent unauthorized modification.
Accountability: Provenance chain demonstrates compliance with data processing obligations.

Section 10

Complete ACI Interface

Three core functions: guard, reconstruct, verify.

guardDataset(data: Uint8Array, metadata: DatasetMetadata, config: TrainingGuardConfig, actor: string): Promise<Result<TrainingGuardResult, TrainingGuardError>>

Split training data via XorIDA across custodians. Chunks large datasets (default 2MB), generates HMAC per share, creates provenance record, and returns a manifest with SHA-256 dataset hash. Actor is recorded in provenance (e.g., 'engineer@corp.com').

reconstructDataset(manifest: TrainingManifest, shares: TrainingDataShare[][], actor: string): Promise<Result<{ data: Uint8Array; manifest: TrainingManifest }, TrainingGuardError>>

Reconstruct dataset from custodian shares. Verifies SHA-256 hash matches manifest. Appends a 'reconstruct' provenance record. Returns reconstructed data and updated manifest. Shares must be ≥ threshold per chunk.

createProvenanceRecord(action: 'split' | 'access' | 'reconstruct' | 'verify', actor: string, dataHash: string, details?: string): ProvenanceRecord

Create a provenance entry with the current timestamp. Action describes what happened, actor identifies who (email/DID), dataHash is the SHA-256 hex hash at that point, details is optional context.

verifyProvenance(records: ProvenanceRecord[]): boolean

Verify a provenance chain: at least one record, all fields non-empty, valid actions, timestamps in non-decreasing order. Returns true if valid, false otherwise.

Section 11

Provenance Chain

Every split, access, reconstruction, and verification is recorded in an immutable append-only log with actor identity, timestamp, and data hash.

Provenance Actions

Action	When	Recorded Fields
`split`	guardDataset() called	actor (who split), timestamp (when), dataHash (SHA-256 of original dataset)
`access`	Manual provenance logging	actor (who accessed), timestamp, dataHash (current hash), details (what action)
`reconstruct`	reconstructDataset() called	actor (who reconstructed), timestamp, dataHash (SHA-256 after reconstruct, must match manifest)
`verify`	Manual integrity check	actor (who verified), timestamp, dataHash, details (verification result)

Example Provenance Chain

Provenance timeline

[
  {
    "action": "split",
    "actor": "ml-engineer@hospital.org",
    "timestamp": "2026-04-10T14:32:18.442Z",
    "dataHash": "a3f5...d8e2"
  },
  {
    "action": "reconstruct",
    "actor": "trainer@research-lab.edu",
    "timestamp": "2026-04-11T09:15:42.112Z",
    "dataHash": "a3f5...d8e2",
    "details": "Training run batch-042"
  },
  {
    "action": "verify",
    "actor": "compliance@hospital.org",
    "timestamp": "2026-04-12T11:22:05.881Z",
    "dataHash": "a3f5...d8e2",
    "details": "HIPAA audit verification"
  }
]

Integrity Enforcement

The verifyProvenance() function enforces chain integrity: all records must have non-empty fields, actions must be valid, and timestamps must be in non-decreasing order. This prevents backdating or out-of-order insertion. For cryptographic non-repudiation, sign each provenance record with the actor's Ed25519 key and store signatures in the manifest.

Section 12

Error Handling

All functions return Result<T, TrainingGuardError>. Six error codes cover validation, split, reconstruction, and provenance failures.

Error Code	When	Common Causes
INVALID_CONFIG	guardDataset() config validation	Fewer than 2 custodians, threshold > total custodians, threshold < 1, empty data
SPLIT_FAILED	XorIDA split operation	Internal crypto error during share generation, memory allocation failure
HMAC_FAILED	Share integrity check	Share modified in transit or storage, HMAC key mismatch, corrupted share
RECONSTRUCT_FAILED	XorIDA reconstruct or hash verify	SHA-256 hash mismatch (data poisoning), unpadding failed, insufficient entropy
INSUFFICIENT_SHARES	reconstructDataset() validation	Fewer than threshold shares per chunk, wrong number of chunk groups, missing custodian
PROVENANCE_INVALID	verifyProvenance() check	Missing fields, out-of-order timestamps, invalid action type, empty chain

Error Handling Pattern

Result pattern

const result = await guardDataset(data, metadata, config, actor);

if (!result.ok) {
  switch (result.error.code) {
    case 'INVALID_CONFIG':
      console.error('Config error:', result.error.message);
      break;
    case 'SPLIT_FAILED':
      console.error('Crypto failure:', result.error.message);
      break;
    default:
      console.error('Unknown error:', result.error);
  }
  return;
}

// Success path
const { manifest, shares } = result.value;

Pricing

PRICING

Coming Soon

Pricing details will be available when this ACI launches. Subscribe to updates to be notified.

Questions about this ACI? Contact us

Deployment Options

SaaS Recommended

Fully managed infrastructure. Call our REST API, we handle scaling, updates, and operations.

Zero infrastructure setup
Automatic updates
99.9% uptime SLA
Enterprise SLA available

View Pricing →

SDK Integration

Embed directly in your application. Runs in your codebase with full programmatic control.

npm install @private.me/trainingguard
TypeScript/JavaScript SDK
Full source access
Enterprise support available

Get Started →

On-Premise Upon Request

Enterprise CLI for compliance, air-gap, or data residency requirements.

Complete data sovereignty
Air-gap capable deployment
Custom SLA + dedicated support
Professional services included

Request Quote →

Enterprise On-Premise Deployment

While trainingGuard is primarily delivered as SaaS or SDK, we build dedicated on-premise infrastructure for customers with:

Regulatory mandates — HIPAA, SOX, FedRAMP, CMMC requiring self-hosted processing
Air-gapped environments — SCIF, classified networks, offline operations
Data residency requirements — EU GDPR, China data laws, government mandates
Custom integration needs — Embed in proprietary platforms, specialized workflows

Includes: Enterprise CLI, Docker/Kubernetes orchestration, RBAC, audit logging, and dedicated support.

Contact sales for assessment and pricing →