Redact: PII Redaction Engine
Four-layer pipeline that progressively strips personally identifiable information from text before it reaches any external service. L1 regex with checksum validation. L2 schema-aware field matching. L3 named entity recognition. L4 optional LLM for ambiguous cases. Deterministic placeholders enable reinjection after the LLM responds — the user never sees redacted output. macro-F1 0.94 across 40+ entity types.
Enterprise toolkit shipped:
@private.me/redact-cli — self-hosted redaction server on port 3200, 3-role RBAC, batch processing, pattern management, Docker deployment. Part of the Enterprise CLI Suite — 21 self-hosted servers, Docker-ready, air-gapped capable.
The Problem: PII Leaks in the AI Pipeline
Every LLM API call sends the full prompt to remote servers. If the prompt contains a patient name, social security number, credit card, or email address, that PII is now stored on third-party infrastructure. This violates HIPAA, GDPR, CCPA, and internal data governance policies.
Organizations face an impossible choice: use AI and leak PII, or protect PII and forfeit AI. Existing redaction tools force this tradeoff by offering detection without reinjection — the LLM response arrives with placeholder tokens the end user cannot understand. The sanitized text cannot be restored to its original form without manual effort.
Regex-only tools miss context-dependent entities: names in non-standard formats, international phone numbers, PII embedded in JSON payloads. Cloud-based redaction services like AWS Comprehend and Google DLP solve some detection problems but create new ones — the PII must leave the network to be detected, which is the exact problem you are trying to solve.
Competitive Failure Table
| Tool | PII Types | Reinjection | Confidence Scores | Local Processing |
|---|---|---|---|---|
| Presidio (Microsoft) | 50+ | No | Partial | Yes |
| AWS Comprehend | ~30 | No | Yes | No (cloud) |
| Google DLP | 120+ | No | Partial | No (cloud) |
| spaCy NER | ~18 | No | No | Yes |
| Redact (PRIVATE.ME) | 40+ | Yes | Yes | Yes |
No existing tool combines local processing, numeric confidence scoring, broad entity coverage, and deterministic reinjection. Redact is the first pipeline designed for the LLM-era workflow: strip, send, reinject.
The Old Way
The Solution: Strip, Send, Reinject
Redact is a four-layer pipeline that progressively detects PII with increasing sophistication. Entities are replaced with deterministic placeholders. After the LLM responds, entities are reinjected to restore the original context — transparently to the end user.
Layer 1 — Regex: High-precision patterns for structured PII. SSN with area/group checksum validation. Credit cards with Luhn algorithm verification. Email addresses, 11 international phone formats, dates, IP addresses. Zero false positives on well-structured data. Sub-millisecond latency.
Layer 2 — Schema: Context-aware matching using JSON field names, CSV column headers, and document structure. Detects PII that regex misses — a field named "patient_ssn" containing a value without dashes, or a "dob" column with non-standard date formats.
Layer 3 — NER: Named entity recognition via compromise.js for person names, organizations, locations, and titles. Lazy-loaded on first invocation (~200ms), cached for subsequent calls. Catches context-dependent entities that neither regex nor schema patterns can detect.
Layer 4 — LLM (optional): Ollama-based local LLM for ambiguous cases. Activated only when confidence scores from L1-L3 fall below configurable thresholds. Entirely optional — L1-L3 cover 95%+ of PII patterns without any external model dependency.
The New Way
Architecture: Four-Layer Pipeline
Four detection layers execute in sequence. Each layer catches progressively harder entities. Confidence scores aggregate across layers. Deterministic entity maps enable bidirectional replacement.
Deterministic placeholders: Same entity always maps to the same placeholder.
[PERSON_1] is always the first detected person name. Consistent across calls for the same input.Bidirectional entity map: The entity map is returned alongside the redacted text. Pass both to the reinject function after the LLM responds. The user never sees placeholders.
Configurable thresholds: Per-entity-type confidence thresholds. High precision for SSNs (0.99), lower threshold for names (0.7). Tune per use case.
Deep Dive: Layer Analysis
Each layer is purpose-built for a specific class of PII detection. Together they cover 40+ entity types from structured identifiers to context-dependent names.
L1: Regex Engine
L1 handles structured, well-defined PII patterns with deterministic detection. Each pattern includes validation beyond simple matching.
SSN detection: Matches XXX-XX-XXXX format with area number validation (no 000, 666, or 900-999 area groups), group number validation (no 00), and serial number validation (no 0000). This eliminates false positives from random 9-digit sequences that happen to contain dashes.
Credit card detection: Matches 13-19 digit sequences with Luhn checksum verification. Covers Visa (4xxx), Mastercard (5xxx/2xxx), Amex (34xx/37xx), Discover (6011/65xx), and Diners Club. The Luhn check reduces false positives by 99%+ compared to digit-only matching.
Phone number detection: Eleven international formats: US (+1, parenthetical, dash, dot), UK (+44), Germany (+49), France (+33), Japan (+81), China (+86), India (+91), Brazil (+55), Australia (+61), Mexico (+52), South Korea (+82). Each format includes country-specific digit grouping validation.
Other L1 patterns: Email addresses (RFC 5322 subset), IPv4 and IPv6 addresses, dates (ISO 8601, US, EU formats), US zip codes (5-digit and ZIP+4), passport numbers (US format), IBAN (2-letter country + check digits + BBAN).
L1 confidence: 0.95-1.0 for checksum-validated matches. Sub-millisecond per document.
L2: Schema-Aware Matching
L2 uses structural context to detect PII that regex alone would miss. When the input is JSON, L2 examines field names; when the input contains labeled data, L2 uses the labels as classification hints.
JSON field matching: Fields named ssn, social_security, patient_id, credit_card, dob, date_of_birth, phone, mobile, email, address, zip, or variations. Values in these fields are classified as the corresponding entity type regardless of format.
CSV/TSV header matching: Column headers map to entity types. A column named "Patient Name" causes all values in that column to be classified as person names. Handles common header aliases and abbreviations.
Label proximity: Inline labels followed by values — "Name: John Smith", "SSN: 123456789" (no dashes), "DOB: Jan 15 1990". The label provides the classification that regex alone cannot.
L2 confidence: 0.80-0.95 depending on label specificity. Negligible additional latency.
L3: Named Entity Recognition
L3 uses compromise.js — a lightweight NLP library — for named entity recognition. It detects person names, organizations, locations, and titles that have no structural markers.
Lazy loading: The NER model loads on first invocation (~200ms) and stays cached for subsequent calls. This avoids startup overhead when L1-L2 are sufficient.
Entity types: Person names (first, last, full), organizations, geographic locations (cities, countries, regions), honorifics and titles. Handles compound names, hyphenated surnames, and common name formats.
Disambiguation: "Washington" could be a person, city, or state. L3 uses sentence context to disambiguate. When context is insufficient, the entity is tagged with both possible types and the higher-confidence interpretation is used.
L3 confidence: 0.65-0.90. Runs only on text spans not already matched by L1-L2.
L4: LLM Layer (Optional)
L4 sends remaining unclassified text spans to a local Ollama instance for LLM-based PII detection. This catches entities that all previous layers missed — nicknames, code names, indirect identifiers ("the patient from room 302").
Activation: Only for spans where L1-L3 produced confidence below the configured threshold. In practice, L4 activates for less than 5% of typical documents.
Local-only: Ollama runs as a local sidecar. No data leaves the machine. The LLM itself never sees the original PII that L1-L3 already detected — it only sees unmatched residual text.
Entirely optional: L1-L3 achieve macro-F1 0.94 without L4. The LLM layer is available for organizations that need maximum recall on ambiguous content, but the pipeline works fully without it.
L4 confidence: 0.50-0.85. Latency depends on model size (~100ms-1s per span).
Benchmarks
Measured against real-world PII patterns. L1-L3 processing completes in under 5ms per document for typical payloads. Reinjection is instantaneous.
Performance Comparison
| Metric | Presidio | AWS Comprehend | spaCy NER | Redact |
|---|---|---|---|---|
| F1 Score | ~0.85 | ~0.90 | ~0.82 | 0.94 |
| Latency (per doc) | ~50ms | ~200ms (network) | ~30ms | <5ms (L1-L3) |
| Reinjection | Manual | Not supported | Not supported | Built-in |
| Offline capable | Yes | No | Yes | Yes |
| Checksum validation | Partial | No | No | SSN + Luhn |
| Schema awareness | No | No | No | L2 JSON/CSV |
| Confidence scoring | Partial | Yes | No | Per-entity numeric |
ACI Surface
Four core functions. Import, call, and compose with any ACI in the platform.
redact(). Deterministic replacement ensures correct bidirectional mapping. Call this after receiving the LLM response to produce user-facing output with original names, numbers, and identifiers.[PERSON_1].import { redact, reinject } from '@private.me/redact'; // Redact PII from a prompt const result = await redact( 'Patient John Smith SSN 123-45-6789 needs refill', { layers: ['regex', 'schema', 'ner'], threshold: 0.7 } ); // result.text: "Patient [PERSON_1] SSN [SSN_1] needs refill" // result.entities: Map { "[PERSON_1]" => "John Smith", "[SSN_1]" => "123-45-6789" } // result.scores: Map { "[PERSON_1]" => 0.88, "[SSN_1]" => 0.99 } // Send clean text to any LLM const llmResponse = await llm.complete(result.text); // Reinject originals in the response const final = reinject(llmResponse, result.entities); // User sees real names and numbers โ never sees placeholders
Use Cases
Redact protects PII across every industry that uses AI on sensitive data.
Strip PII from prompts before sending to any LLM API — OpenAI, Anthropic, Google, or self-hosted. Reinject entities in the response. Use any cloud AI safely without leaking personal data.
L1-L4 PipelineRemove patient names, SSNs, medical record numbers, and dates of birth before AI analysis. L1-L3 detects all 18 HIPAA Safe Harbor identifiers. Compliant de-identification without expert review.
HIPAA Safe HarborMinimize personal data sent to third-party services per GDPR Article 5(1)(c). Strip PII before data transfer, analytics processing, or cross-border transmission. Automated enforcement of data minimization principles.
GDPR Art 5Redact credit card numbers with Luhn validation, bank account numbers, routing numbers, and financial identifiers before processing. PCI DSS Requirement 3 compatible cardholder data masking.
PCI DSS Req 3Redact proprietary model weights and training data signatures from inference outputs to prevent model extraction attacks. L4 semantic analysis detects output patterns that leak model architecture. IP protection for vendors shipping model APIs without exposing the underlying model.
Model ProtectionRegulatory Compliance
Redact maps directly to specific regulatory requirements. Each regulation mandates PII protection — Redact provides the technical enforcement layer.
| Regulation | Requirement | Redact Capability |
|---|---|---|
| HIPAA Safe Harbor | De-identify 18 PHI types (names, dates, SSN, MRN, geographic data, phone, fax, email, etc.) | L1-L3 detects all 18 identifier categories. Automated de-identification without expert determination. |
| GDPR Art 5(1)(c) | Data minimization — collect/process only what is necessary | Strip PII before third-party transfer. Configurable entity types per processing purpose. |
| CCPA §1798.100 | Consumer data protection — right to know, right to delete | Automated PII detection for data inventory. Entity maps enable targeted deletion. |
| PCI DSS Req 3 | Protect stored cardholder data — mask PAN, do not store CVV | L1 Luhn-validated credit card detection. Automatic masking before storage or transmission. |
| FERPA | Protect student education records and PII | L2 schema matching detects student IDs, grades, and enrollment data in structured records. |
| SOX §302 | Protect financial reporting data integrity | Redact financial identifiers before external analysis. Audit trail via entity maps. |
Cross-ACI Composition
Redact integrates with other PRIVATE.ME ACIs to create end-to-end data protection pipelines. Each composition multiplies the security guarantees.
Security Properties
Five core security guarantees. Each backed by a specific mechanism and verified by the test suite.
| Property | Mechanism | Guarantee |
|---|---|---|
| PII detection coverage | 4-layer pipeline (L1-L4) | macro-F1 0.94 across 40+ entity types |
| SSN/CC detection | L1 regex + checksum/Luhn validation | Zero false negatives on valid format inputs |
| Name detection | L3 NER (compromise.js) | Context-aware disambiguation |
| International support | 11 phone format patterns | Global coverage (US, UK, DE, FR, JP, CN, IN, BR, AU, MX, KR) |
| Deterministic replacement | Entity map with ordered placeholders | Same input always produces same placeholders |
| Reinjection integrity | Bidirectional entity map | Every placeholder correctly maps back to its original value |
| Local processing | L1-L3 run entirely in-process | PII never leaves the machine for detection |
Redact vs. Traditional Approaches
| Dimension | Regex-Only Tools | Cloud Redaction (AWS/GCP) | Redact |
|---|---|---|---|
| Detection depth | 1 layer (pattern match) | ML models (cloud) | 4 layers (regex + schema + NER + LLM) |
| Reinjection | Not supported | Not supported | Built-in deterministic reinjection |
| Data residency | Local | Sent to cloud provider | Local (L1-L3) + optional local LLM (L4) |
| Confidence scoring | Binary match/no-match | Provider-dependent | Per-entity numeric confidence 0.0-1.0 |
| Schema awareness | No context | No context | JSON fields, CSV headers, label proximity |
| Offline capable | Yes | No | Yes (L1-L3 fully offline) |
Verifiable Data Protection
Every Redact operation produces a verifiable audit trail via xProve. HMAC-chained integrity proofs let auditors confirm that PII was detected, redacted, and reinjected correctly — without accessing the original data.
Read the xProve white paper →
Ship Proofs, Not Source
Redact generates cryptographic proofs of correct execution without exposing proprietary algorithms. Verify integrity using zero-knowledge proofs โ no source code required.
- Tier 1 HMAC (~0.7KB)
- Tier 2 Commit-Reveal (~0.5KB)
- Tier 3 IT-MAC (~0.3KB)
- Tier 4 KKW ZK (~0.4KB)
Use Cases
Honest Limitations
Honest engineering requires honest documentation. Four known limitations with their mitigations.
| Limitation | Impact | Mitigation |
|---|---|---|
| L4 requires Ollama | No LLM-based detection layer without a local Ollama sidecar. Organizations that cannot run local models lose the L4 fallback for ambiguous entities. | L1-L3 cover 95%+ of PII patterns. macro-F1 0.94 is achieved without L4. The LLM layer provides marginal recall improvement for edge cases only. |
| Non-English NER | L3 NER (compromise.js) is optimized for English. Lower recall for non-Latin names, non-English organization names, and non-Western address formats. | L1 regex works for all scripts (SSN, CC, email, phone patterns are language-independent). L2 schema matching is language-agnostic. Non-English NER is a targeted improvement area. |
| No image PII | Cannot detect PII embedded in images, scanned documents, or screenshots. Only text input is processed. | Pair with OCR preprocessing (Tesseract, AWS Textract output) to convert images to text first, then pipe through Redact. |
| Entity count scaling | Documents with more than 1,000 detected entities per document may experience increased processing time as the entity map grows. | Chunk documents larger than ~50KB into segments. Process each segment independently. Entity maps can be merged post-processing. |
Enterprise CLI
@private.me/redact-cli is a self-hosted PII redaction server — Docker-ready, air-gapped capable, with three-role RBAC, batch processing, custom pattern management, and append-only audit logging.
Key Endpoints
| Method | Path | Permission | Description |
|---|---|---|---|
| GET | /health | — | Health check |
| POST | /redact | redact:execute | Redact PII from text |
| POST | /reinject | redact:execute | Reinject entities into response |
| POST | /detect | redact:detect | Detect entities without redacting |
| POST | /batch | batch:execute | Batch redaction job |
| GET | /batch/:id | batch:read | Get batch job status |
| POST | /patterns | pattern:create | Add custom regex pattern |
| GET | /patterns | pattern:list | List custom patterns |
| DELETE | /patterns/:id | pattern:delete | Remove custom pattern |
| GET | /config | config:read | Get server configuration |
| PUT | /config | config:write | Update configuration |
| GET | /audit | audit:read | Audit log entries |
RBAC Roles
Admin — full access: redaction, batch processing, pattern management, configuration, API keys, audit logs.
Operator — redaction execution, batch processing, pattern listing, configuration viewing.
Auditor — read-only: audit logs, configuration, pattern listing. Cannot execute redactions.
docker build -t redact-cli -f packages/redact-cli/Dockerfile . docker run -d --name redact -p 3200:3200 \ -v redact-data:/data \ -e REDACT_ADMIN_KEY=your-secret-key \ redact-cli
Enhanced Identity with Xid
Redact can optionally integrate with Xid to enable unlinkable inference identity โ verifiable within each redaction context, but uncorrelatable across documents, tenants, or time.
Three Identity Modes
How Ephemeral Inference Identity Works
// Initialize Redact with Xid integration import { RedactClient } from '@private.me/redact-cli'; import { XidClient } from '@private.me/xid'; const xid = new XidClient({ mode: 'ephemeral' }); const redact = new RedactClient({ identityProvider: xid }); // Each inference derives unlinkable DID automatically const result = await redact.process({ document: medicalRecord, policy: 'PHI' }); → [Redact] Deriving ephemeral operator DID from master seed... → [Redact] DID: did:key:z6MkJ... (unique to this inference + tenant + epoch) → [Redact] Ran inference with ephemeral identity (~50ยตs) → [Redact] Key purged (<1ms exposure) // Inference is audited with unlinkable DID // Verification works within document, but cross-document correlation fails
See the Xid white paper for details on ephemeral identity primitives and K-of-N convergence.
Market Positioning
| Industry | Use Case | Compliance Driver |
|---|---|---|
| Healthcare | PHI redaction with HIPAA-compliant unlinkable inference logs | HIPAA, 42 CFR Part 2, HITECH |
| Legal | Attorney-client privilege redaction with unlinkable operators | ABA Model Rules, work product doctrine |
| Government | Classified document redaction with IAL3 operator identity | FISMA, CJIS, FedRAMP, DoD IL5/6 |
| Finance | PII redaction with SOX-compliant audit trails | SOX, GLBA, GDPR, SEC 17a-4 |
Key Benefits
- Cross-document unlinkability โ Can't track operators across redaction jobs
- Per-inference derivation โ Same operator has different DIDs per document
- Epoch rotation โ DIDs automatically rotate daily/weekly/monthly
- Split-protected derivation โ Master seed is XorIDA-split, never reconstructed
- Privacy-preserving ML โ Inference identity without cross-context correlation
- Multi-tenant isolation โ Customer unlinkability built into identity layer
Get Started
Install Redact, strip PII from your first prompt, and reinject entities in the LLM response — all in a few lines of code.
npm install @private.me/redact
import { redact, reinject, detectEntities } from '@private.me/redact'; // 1. Redact PII from a prompt const result = await redact( 'Patient John Smith (SSN 123-45-6789) called from 415-555-0142.' ); // result.text: // "Patient [PERSON_1] (SSN [SSN_1]) called from [PHONE_1]." // 2. Send clean text to any LLM API const llmResponse = await llm.complete(result.text); // LLM never sees John Smith, SSN, or phone number // 3. Reinject originals into the LLM response const userOutput = reinject(llmResponse, result.entities); // User sees real names and numbers โ never placeholders // 4. Detect without redacting (for audit/analysis) const entities = await detectEntities( 'Email: jane@example.com, Card: 4111-1111-1111-1111' ); // [{ type: 'EMAIL', value: 'jane@example.com', confidence: 0.99, span: [7, 23] }, // { type: 'CREDIT_CARD', value: '4111-1111-1111-1111', confidence: 1.0, span: [31, 50] }]
const result = await redact(input, { layers: ['regex', 'schema', 'ner'], // skip L4 LLM threshold: 0.7, // minimum confidence entityTypes: [ // only these types 'SSN', 'CREDIT_CARD', 'PERSON', 'EMAIL', 'PHONE' ], thresholds: { // per-type overrides SSN: 0.99, PERSON: 0.7, PHONE: 0.85, } });
Ready to deploy Redact?
Talk to Sol, our AI platform engineer, or book a live demo with our team.