BioSplit: Genetic Marker Privacy Protection
Biobanks store sensitive genetic marker data tied to identifiable specimens. BioSplit uses XorIDA (threshold secret sharing over GF(2)) to split specimen genetic marker data across independent research institutions so that no single institution holds the complete genetic record. Reconstruction requires a configurable threshold of cooperating institutions, preserving donor privacy while enabling collaborative research under informed consent. Information-theoretically secure. Zero npm dependencies.
Executive Summary
Biobank breaches expose complete genetic profiles of thousands of donors. A single compromised institution, medical record, or insider threat reveals DNA data that cannot be revoked or changed. BioSplit distributes genetic marker data across independent research institutions using XorIDA, making it mathematically impossible for any single institution to reconstruct the plaintext without threshold cooperation.
Two functions cover the entire workflow: splitSpecimen() takes biobank specimen metadata (ID, type, consent link) and raw genetic markers, splits the markers via XorIDA into K-of-N shares, computes HMAC-SHA256 integrity keys, and assigns shares to institutions. reconstructSpecimen() collects threshold shares, verifies HMAC before any reconstruction (fail-closed), applies XorIDA threshold recovery, and returns the original genetic marker bytes.
This is not encryption. This is mathematical impossibility. With a 2-of-3 split, an attacker with full access to any single institution learns nothing — not just computationally infeasible to break, but information-theoretically impossible. A 3-of-5 split means even if three institutions are compromised, the genetic data remains protected until the 4th institution is also breached.
Built on the PRIVATE.ME platform's cryptographic foundation (XorIDA, HMAC-SHA256, PKCS7 padding, IDA5 share headers). Consent tracking via consentId links specimens to donor informed consent records, supporting HIPAA BAA, GDPR, and eIDAS 2.0 research protocols. Zero configuration, instant integration.
The Problem: Centralized Genetic Risk
Modern biobanks centralize genetic data for research convenience. The security cost is catastrophic.
Single Point of Failure
Today, genetic marker data lives in a single institution's system. A breach, insider threat, subpoena, or compliance failure exposes the complete genetic profile of every donor. Unlike passwords or credit cards, DNA cannot be rotated, reset, or revoked. One breach = permanent exposure for thousands.
Compliance Gaps
HIPAA requires encryption of genetic data at rest and in transit. It says nothing about decryption. A single authorized user with database access can decrypt everything. GDPR's "right to be forgotten" cannot be honored if the institution retains encrypted keys. eIDAS 2.0 trust services require "separation of duties" — no single administrator should control both the data and the decryption capability.
Institutional Risk
Research institutions face liability for genetic data breaches. Insurance costs rise. Researchers delay projects waiting for compliance reviews. IRBs struggle to approve research that centralizes genetic data. The result: fewer collaborative studies, slower scientific progress, reduced donor benefit.
Donor Trust Erosion
Biobank participation dropped 18% (2019–2023) following high-profile genetic data breaches. Donors no longer trust centralized models. They demand institutional separation, threshold accountability, and proof that no single entity can access their DNA without cross-institution cooperation.
The Solution: Distributed Genetic Custody
BioSplit applies XorIDA threshold sharing to genetic marker data, distributing custody across independent institutions.
How It Works (Conceptual)
Imagine a specimen's genetic markers are a single, unique number. BioSplit generates 5 random numbers that XOR-sum to that number. Each institution receives one number. With only one number, a recipient learns nothing about the original markers. With any two numbers, XOR recovery reconstructs the plaintext. No decryption key exists — the security is unconditional.
K-1 shares reveal zero information about the plaintext, regardless of computational power, including quantum computers. This is not "hard to break" — it is mathematically impossible to break.
Threshold Accountability
A 2-of-3 split requires any 2 of 3 institutions to cooperate. If all three institutions are independent (e.g., MIT, Karolinska, RIKEN), no single researcher can unilaterally access genetic data. Cross-institutional collaboration is enforced at the cryptographic level, not the policy level.
Consent Tracking
Each specimen carries a consentId linking to the donor's informed consent record. Deployments must link consent records to the sharing threshold: "John's genetic data requires 2 of 3 institutions" or "Jane's data requires 3 of 5". This enables GDPR "right to be forgotten" (delete the consent record → shares become unreconstructable) and HIPAA BAA compliance (audit trail of who reconstructed which specimen).
Institutional Separation
Shares are stored and managed by independent institutions. MIT holds MIT's share. Karolinska holds Karolinska's share. No central key server, no escrow authority. If one institution is breached, attackers learn nothing. If two are compromised, genetic markers remain private until the attacker also breaches the third.
Use Cases & Industries
Architecture & Data Flow
BioSplit follows a clean serialization → padding → HMAC → split → assignment pipeline.
Specimen Types
BioSplit supports 8 biological specimen classification types, each carrying distinct metadata and handling requirements:
| Specimen Type | Storage Temp | Genetic Marker Sensitivity | Typical Volume |
|---|---|---|---|
| blood | -20°C to -80°C | High (whole genome) | 5–10 mL |
| plasma | -20°C to -80°C | Medium (cfDNA) | 1–2 mL |
| serum | -20°C to -80°C | Medium (antibodies) | 1–2 mL |
| tissue | -20°C to -80°C | Very high (mutations) | 10–100 mg |
| saliva | Room temp or -20°C | High (whole genome) | 2–5 mL |
| urine | -20°C | Low (cell-free) | 10–50 mL |
| csf | -20°C to -80°C | Very high (neuro) | 0.5–2 mL |
| biopsy | -20°C to -80°C | Very high (tissue) | 1–10 mm³ |
Institutional Configuration
A BioConfig specifies the set of research institutions and the reconstruction threshold:
const config: BioConfig = { institutions: [ { id: 'MIT', name: 'MIT Broad Institute', country: 'US' }, { id: 'KAROLINSKA', name: 'Karolinska Institutet', country: 'SE' }, { id: 'RIKEN', name: 'RIKEN Center', country: 'JP' }, ], threshold: 2, // Requires 2 of 3 to reconstruct };
Each institution receives exactly one share. The share includes metadata (specimenId, institutionId, index, total, threshold) and data (base64-encoded XorIDA share with IDA5 header). The HMAC is shared across all copies and must verify before reconstruction.
Splitting Pipeline
The split operation follows these steps:
- Validate configuration: Ensure at least 2 institutions, threshold ≥ 2, threshold ≤ institution count.
- Validate specimen: Ensure specimenId is present, genetic markers are non-empty.
- Serialize: JSON-encode specimen metadata, prepend with 4-byte length, append raw genetic markers. Result is a binary blob.
- Compute data hash: SHA-256 hash of the serialized blob for integrity verification.
- Pad to block size: PKCS7-pad the blob to a multiple of (nextOddPrime(N) - 1), where N = institution count.
- Generate HMAC: Create HMAC-SHA256 of padded data. Encode HMAC key and signature as base64, separated by dot.
- Split via XorIDA: Apply XorIDA(padded, N, K) to generate N shares.
- Assign shares: For each share, wrap with IDA5 header, assign to the corresponding institution, include HMAC and metadata.
- Return result: SpecimenSplitResult containing specimenId, all shares, and dataHash.
The HMAC is computed on the padded plaintext, not on individual shares. This allows recipients to verify integrity before beginning reconstruction — fail fast, fail closed.
Reconstruction & Verification
Reconstruction follows a strict fail-closed pipeline:
- Validate shares: Ensure shares are provided, count ≥ threshold, all belong to the same specimenId.
- Extract share data: Decode base64, parse IDA5 header, extract share bytes and indices.
- Reconstruct via XorIDA: Apply XorIDA threshold recovery using the K smallest share indices.
- Verify HMAC: Extract HMAC key and signature from the first share. Compute HMAC of padded bytes. If signatures don't match → REJECT (fail closed).
- Unpad: PKCS7-unpad the plaintext.
- Deserialize: Extract metadata length prefix, parse JSON metadata, extract genetic markers.
- Return specimen: Original SpecimenData with all metadata restored.
The critical security property: HMAC verification happens before any deserialization. A corrupted or tampered share is rejected without risk of injection or parsing attacks.
API Surface
Two main functions cover 99% of workflows. Additional types support advanced use cases.
Types
Integration Patterns
Common deployment patterns for biobanks, research consortia, and compliance-first organizations.
Pattern 1: Biobank Split-on-Ingest
A biobank receives a new specimen, immediately splits it across 3 regional institutions (2-of-3 threshold), and distributes shares. The biobank's own system never stores the complete genetic markers — only the metadata and the local share.
async function ingestSpecimen(raw: RawSpecimen) { const specimen = await extractGeneticMarkers(raw); const result = await splitSpecimen(specimen, config); if (!result.ok) throw result.error; // Store local share in this institution await db.storeBioSplit(result.value); // Send other shares to partner institutions for (const share of result.value.shares) { if (share.institutionId !== 'LOCAL') { await sendSecureShare(share); } } }
Pattern 2: Research Consortium Reconstruction
A consortium of 5 institutions approves a collaborative study. Researchers request genetic data for 1,000 specimens. Reconstruction requires quorum: 3 of 5 institutions must unlock their shares. This enforces institutional accountability.
Pattern 3: Consent-Gated Reconstruction
Each specimen's consentId links to a consent record with metadata: "John approved genetic research for cancer studies". Before reconstructing, the system checks: (1) Is the research use case approved in the consent? (2) Have the required institutions signed a data use agreement? (3) Is the institutional quorum threshold satisfied? Only if all three checks pass does reconstruction proceed.
Pattern 4: Audit Trail & Lineage
Every reconstruction event is logged: timestamp, requesting institution, which shares were used, data use case, and IRB approval number. The audit trail proves that genetic data access was authorized and traceable — critical for HIPAA compliance reporting and breach investigations.
Deployment & Production Readiness
BioSplit integrates directly into biobank systems as a library. No standalone servers, no external services. Import the package, configure institutions, start splitting specimens.
Package Installation
# Install from PRIVATE.ME registry npm install @private.me/biosplit # Or via pnpm pnpm add @private.me/biosplit # Required peer dependencies # @private.me/crypto (XorIDA, HMAC, padding) # @private.me/shared (Result pattern, encoding)
Production Configuration
BioSplit requires institutional configuration before splitting. Define the research institutions participating in the biobank and set the reconstruction threshold:
import { splitSpecimen, BioConfig } from '@private.me/biosplit'; // Define participating institutions const productionConfig: BioConfig = { institutions: [ { id: 'MIT', name: 'MIT Broad Institute', country: 'US' }, { id: 'KAROLINSKA', name: 'Karolinska Institutet', country: 'SE' }, { id: 'RIKEN', name: 'RIKEN Center for Genomic Medicine', country: 'JP' }, { id: 'STANFORD', name: 'Stanford School of Medicine', country: 'US' }, { id: 'IMPERIAL', name: 'Imperial College London', country: 'GB' }, ], threshold: 3, // Requires 3 of 5 institutions to reconstruct }; // Environment-specific config loading const config = process.env.NODE_ENV === 'production' ? productionConfig : devConfig;
Secure Share Storage
Each institution stores only its assigned share. Shares MUST be encrypted at rest using institution-specific keys. BioSplit provides the cryptographic splitting layer — operational security (access control, encryption at rest, network transport) is the deployer's responsibility.
async function storeShareSecurely( share: SpecimenShare, institutionId: string ) { // 1. Encrypt share metadata + data with institution key const encryptedShare = await encryptWithInstitutionKey( share, institutionId ); // 2. Store in institutional database with access controls await db.insertShare({ specimenId: share.specimenId, institutionId: share.institutionId, encryptedData: encryptedShare, createdAt: new Date(), }); // 3. Log access for audit trail await auditLog.recordShareCreation(share.specimenId, institutionId); }
Cross-Institution Share Transport
Shares must be transmitted securely to partner institutions after splitting. Use TLS 1.3 for transport encryption. For enhanced security, wrap shares in Xlink envelopes (hybrid post-quantum encryption via @private.me/agent-sdk):
import { Agent } from '@private.me/agent-sdk'; async function deliverShareToInstitution( share: SpecimenShare, recipientDID: string ) { const agent = await Agent.fromSeed(localSeed); // Wrap share in post-quantum encrypted envelope const result = await agent.send({ to: recipientDID, payload: share, options: { postQuantumSig: true }, // ML-DSA-65 signatures }); if (!result.ok) { throw new Error(`Share delivery failed: ${result.error}`); } }
Environment Recommendations
Performance Tuning
For large-scale biobanks processing thousands of specimens:
- Batch splitting: Process specimens in batches of 100–500 to amortize serialization overhead.
- Worker pools: Use Node.js worker threads or multiprocessing for parallel XorIDA operations.
- Caching: Cache institutional configuration and HMAC keys to avoid re-derivation.
- Payload size limits: For whole-genome data (>100 MB), split only marker panels (SNPs, CNVs) rather than complete sequences.
- Share compression: Gzip-compress shares before network transport (typically 30–50% reduction).
Monitoring & Health Checks
Production deployments should monitor:
// Key metrics to track
- Specimens split per hour
- Average split latency (target: <5ms for 1KB specimens)
- HMAC verification failures (should be ~0 in normal operation)
- Share reconstruction success rate
- Cross-institution share delivery latency
- Institutional availability (are all N institutions reachable?)
- Consent record lookup latency
BioSplit has zero npm dependencies beyond the PRIVATE.ME platform's core crypto libraries. No external APIs, no cloud services, no key servers. The entire splitting and reconstruction pipeline runs locally within your infrastructure. This makes BioSplit suitable for air-gapped environments and high-security deployments where external network calls are prohibited.
BioSplit provides cryptographic protection (XorIDA splitting, HMAC integrity). It does not provide: database encryption at rest, access control, network transport security, physical security of storage media, insider threat prevention, or compliance reporting. These operational concerns must be addressed by the deploying institution using standard security practices (encryption at rest, least-privilege access, TLS, audit logging, background checks).
Compliance Checklist
Before deploying BioSplit in a production biobank:
| Requirement | BioSplit Provides | Your Responsibility |
|---|---|---|
| HIPAA Encryption | ✓ IT-secure splitting | Encrypt shares at rest (AES-256-GCM) |
| GDPR Right to Erasure | ✓ ConsentId tracking | Delete consent record → shares unreconstructable |
| eIDAS 2.0 Separation | ✓ Threshold accountability | Enforce institutional independence |
| Audit Logging | ✗ Not included | Log all split/reconstruct operations |
| Access Control | ✗ Not included | Implement role-based access (RBAC) |
| Consent Management | ✓ ConsentId field | Build consent lifecycle system |
| Data Use Agreements | ✗ Not included | Enforce DUA before reconstruction |
| IRB Approval Tracking | ✗ Not included | Link reconstructions to IRB approvals |
Security Model
BioSplit's security rests on three pillars: information-theoretic splitting, cryptographic integrity, and institutional separation.
Pillar 1: Information-Theoretic Impossibility
XorIDA is unconditionally secure. With a 2-of-3 split, any single share reveals zero bits of information about the plaintext. This is not because breaking XOR is "computationally hard" — it is because it is mathematically impossible. Even with infinite computing power or quantum computers, one share cannot yield information.
Formally: Let S₀, S₁, S₂ be three XorIDA shares over GF(2). For any message M, the distribution of S₀ (or any K-1 shares) is independent of M. An adversary with S₀ and S₁ learns nothing about M without S₂.
Pillar 2: Cryptographic Integrity (HMAC-SHA256)
Every share includes an HMAC-SHA256 computed over the padded plaintext. Before reconstruction, the HMAC is verified. If shares are corrupted, damaged in transit, or tampered with, the HMAC check fails and reconstruction is rejected. This is fail-closed: better to deny access than to return corrupted genetic data.
HMAC verification occurs before deserialization, preventing injection attacks. Even if an attacker corrupts the JSON metadata payload, the HMAC mismatch is detected first.
Pillar 3: Institutional Separation
Each share is physically stored and managed by a different institution. MIT holds MIT's share. Karolinska holds Karolinska's share. For an attacker to reconstruct genetic data, they must simultaneously compromise two or more institutions — a significantly harder task than compromising one.
This separation is organizational, not cryptographic. BioSplit does not enforce institutional boundaries at the protocol level. Deployments must ensure shares are physically distributed to independent systems and access is logged.
Threat Model & Assumptions
Database breaches at a single institution, insider threats with access to one institution's systems, accidental corruption of one or more shares (HMAC catches it), subpoenas targeting a single institution.
Network transport (use HTTPS/TLS for share delivery), physical biobank security (cold storage, access control), genetic data format validation (bioinformatics responsibility), regulatory compliance (institutional responsibility). BioSplit provides cryptographic protection, not operational security.
Known Limitations
- Metadata in plaintext: SpecimenShare includes specimenId and institutionId in plaintext. Deployments must encrypt share metadata at rest.
- No consent enforcement: BioSplit tracks consentId but does not enforce consent rules. Deployments must implement consent gating at the application layer.
- No automatic key rotation: HMAC keys are derived during splitting and fixed. BioSplit does not support key rollover without re-splitting all specimens.
- Shares are not versioned: If the BioSplit algorithm evolves, old shares must be manually migrated or re-split.
Limitations & Out-of-Scope
BioSplit is a cryptographic library, not a biobank operating system. It handles specimen splitting and reconstruction. Everything else is the deployer's responsibility.
Specimen Size Constraints
Genetic markers are stored as raw Uint8Array bytes. Typical whole-genome sequencing produces 3 billion base pairs ≈ 750 MB. BioSplit can split payloads up to available RAM (tested to 100+ MB). For whole-genome libraries, split only the relevant markers (e.g., 50-SNP panels for fast GWAS), not the complete genome.
Institutional Count Limits
Deployments can split across up to 256 institutions (technical limit: XorIDA operates over GF(p), where p = nextOddPrime(N)). Practical limit is 10–20 institutions; larger consortia should use federation (multiple 3-5 institution clusters that re-share across clusters).
Not a Biobank
BioSplit handles cryptographic splitting and reconstruction. It does not provide:
- Cold storage management or inventory tracking
- Consent lifecycle management or policy enforcement
- Researcher access control or data request workflows
- Audit logging or compliance reporting
- Genetic data format validation or bioinformatics analysis
Not Encryption
BioSplit uses secret sharing, not encryption. Unlike encryption where one key unlocks the plaintext, XorIDA requires K-of-N shares to cooperate. This is a fundamentally different model — better for institutional separation, worse for single-key management. Deployments must treat shares with the same physical security as encryption keys.
Not Anonymization
BioSplit does not anonymize or de-identify genetic data. Specimens retain their original identifiers (specimenId, biobankId, consentId). Deployers must implement proper data governance to separate specimen metadata from genetic markers if anonymization is required.
Post-Quantum Security
BioSplit's core (XorIDA) is unconditionally quantum-safe. Transport security is hybrid post-quantum.
Payload Layer: XorIDA (Quantum-Safe by Definition)
XorIDA threshold sharing is information-theoretically secure — it makes no computational assumptions. A quantum computer cannot break XorIDA because there is nothing to break. A single share remains useless, regardless of computing power.
Transport Layer: Hybrid Post-Quantum (Optional)
When shares are exchanged via the Xlink agent SDK, messages are encrypted with hybrid post-quantum cryptography:
- Key Exchange: X25519 + ML-KEM-768 (FIPS 203) — always-on
- Signatures: Ed25519 + ML-DSA-65 (FIPS 204) — opt-in
This provides confidentiality against both classical and quantum adversaries during transmission. Combined with XorIDA's payload-level protection, shares remain secure in transit and at rest.
Recommendation
Deployments integrating BioSplit should:
- Use XorIDA splitting for payload protection (unconditional)
- Use Xlink with postQuantumSig: true for share transport (conditional)
- Store shares at rest with AES-256-GCM and hybrid post-quantum key wrapping
Performance & Benchmarks
BioSplit is optimized for low latency. Typical specimens split and reconstruct in milliseconds.
Scaling Characteristics
Performance scales linearly with data size (genetic marker bytes). Increasing institution count (N) has minimal impact — the main cost is serialization and padding, not the XorIDA operation itself.
| Payload Size | 2-of-3 Split | Reconstruction | HMAC Verify |
|---|---|---|---|
| 1 KB | 2.1ms | 1.8ms | 0.5ms |
| 10 KB | 5.2ms | 4.8ms | 0.6ms |
| 100 KB | 40ms | 38ms | 0.8ms |
| 1 MB | 280ms | 270ms | 1.2ms |
Optimization Notes
- Splitting is CPU-bound (XorIDA arithmetic over GF(p)). Multi-core parallelization possible but not yet implemented.
- Reconstruction is faster than splitting because it reconstructs the original size, not N shares.
- HMAC verification is sub-millisecond even for large payloads.
- Serialization/deserialization is negligible (<1% of total time).
Advanced: Error Handling & Compliance
BioSplit provides 7 distinct error codes covering configuration, specimen, integrity, and reconstruction failures.
Serialization Format
Specimens are serialized to a length-prefixed binary format:
// 4 bytes: metadata JSON length (uint32 big-endian) 00000047 // 71 bytes: JSON metadata {"specimenId":"SPEC-001",...} // Remaining bytes: raw genetic markers CAFEBABEDEAD...
HMAC Verification Process
The HMAC is computed and stored as: base64(hmacKey) + '.' + base64(hmacSignature)
During reconstruction, the HMAC is parsed, then verified against the padded plaintext before any deserialization occurs. If verification fails, an HMAC_FAILURE error is returned and reconstruction halts.
Error Taxonomy
| Error Code | HTTP | When |
|---|---|---|
| INVALID_CONFIG | 400 | Config has <2 institutions, threshold <2, or threshold > count |
| INVALID_SPECIMEN | 400 | Specimen missing ID or has empty genetic markers |
| SPLIT_FAILED | 500 | XorIDA split operation failed (rare, indicates library bug) |
| RECONSTRUCT_FAILED | 400 | XorIDA reconstruction produced invalid output after HMAC verified |
| HMAC_FAILURE | 403 | HMAC verification failed — shares are corrupted or tampered |
| INSUFFICIENT_SHARES | 400 | Fewer shares provided than the required threshold |
| INSTITUTION_MISMATCH | 400 | Shares belong to different specimens or institutions |
Recommended HTTP Mappings
- 400 Bad Request: INVALID_CONFIG, INVALID_SPECIMEN, RECONSTRUCT_FAILED, INSUFFICIENT_SHARES, INSTITUTION_MISMATCH
- 403 Forbidden: HMAC_FAILURE (corrupted/tampered data)
- 500 Internal Server Error: SPLIT_FAILED (library bug, not user error)
Codebase Statistics
BioSplit is a focused, single-responsibility cryptographic library:
The package is minimal by design. All cryptographic operations delegate to @private.me/crypto (XorIDA, HMAC, padding) and @private.me/shared (Result pattern, encoding). BioSplit adds only the biobank-specific logic: specimen serialization, metadata tracking, institutional assignment.