Skip to content

gantz-ai/pii.engineer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PII Engineer

PII Engineer

Fast, multilingual PII detection. 50+ languages, single model, no GPU required.

CI Made with Rust License: Apache-2.0 HuggingFace F1 Score

Live Demo Β· Benchmarks Β· API Docs Β· Models Β· Blog


Why PII Engineer?

PII Engineer Presidio spaCy AWS Comprehend
F1 (multilingual) 0.86 0.44 0.64 0.52
F1 (English) 0.88 0.80 0.83 0.82
Languages 50+ ~10 locales 1 per model 12
Latency (p50) 180ms 80ms (w/ NER) 120ms 200ms
GPU required No No Optional N/A
Self-hosted Yes Yes Yes No
Cost (1M req/mo) $42 $42 $42 ~$1,000

Full benchmarks β†’

Features

  • Multilingual β€” single model handles 50+ languages including CJK, SEA, South Asian, and European languages
  • High accuracy β€” 0.90 F1 overall, outperforms regex-based tools on non-English text
  • Fast β€” ~180ms p50 on CPU (INT8 quantized ONNX inference)
  • Zero-shot labels β€” detect custom entity types without retraining
  • Self-hosted β€” runs on a $42/mo VPS, no external API calls, your data never leaves your server
  • Single binary β€” Rust binary with embedded static assets, no Python runtime or dependency hell
  • Auto-redaction β€” returns both detected entities and redacted text in one call
  • 9 PII types β€” person names, phone numbers, government IDs, addresses, DOB, emails, passports, license plates, bank accounts

Quick Start

From Source

cargo build --release --package pii-engineer-server
cargo run --release --package pii-engineer-server
# Models auto-download from HuggingFace on first run
# API ready at http://localhost:8000

Docker

docker build -t pii-engineer .
docker run -p 8000:8000 -v ./models:/app/models pii-engineer

Test It

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{"text": "John Doe, NRIC S9012345B, born 12 March 1985"}'

Response:

{
  "entities": [
    { "type": "person_name", "value": "John Doe", "score": 0.99 },
    { "type": "government_id", "value": "S9012345B", "score": 0.99 },
    { "type": "date_of_birth", "value": "12 March 1985", "score": 0.97 }
  ],
  "redacted": "[PERSON_NAME], NRIC [GOVERNMENT_ID], born [DATE_OF_BIRTH]"
}

Integration Examples

Python

import requests

response = requests.post("http://localhost:8000/api/detect", json={
    "text": "Ahmad bin Abdullah, +60 12-345 6789, IC 901201-14-5678"
})
data = response.json()
print(data["redacted"])
# [PERSON_NAME], [PHONE_NUMBER], IC [GOVERNMENT_ID]

JavaScript / Node.js

const res = await fetch("http://localhost:8000/api/detect", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    text: "Nguyễn Văn A, CCCD 079201012345, sinh ngày 15/03/1990"
  }),
});
const { entities, redacted } = await res.json();
console.log(redacted);
// [PERSON_NAME], CCCD [GOVERNMENT_ID], sinh ngΓ y [DATE_OF_BIRTH]

cURL (batch labels)

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Call me at 9123 4567 or email john@acme.com",
    "labels": ["phone_number", "email_address"]
  }'

PII Types

Type Examples
person_name Sarah Lim, Ahmad bin Abdullah, ι™ˆδΌŸζ˜Ž
phone_number +65 9123 4567, 0812-3456-7890
government_id S9012345B (NRIC), 3201010512890001 (NIK), Aadhaar
street_address 42 Orchard Road #08-12, Jl. Sudirman No. 1
date_of_birth 12 March 1985, 1990-05-15
email_address john@example.com
passport_number E12345678
license_plate SBA1234A, B 1234 CD
bank_account_number 1234-5678-9012

Supported Languages

Primary (highest accuracy): English, Malay, Tamil, Chinese, Indonesian, Vietnamese

Secondary: Thai, Hindi, Bengali, Korean, Japanese, German, French, Spanish, Portuguese, Russian, Arabic, Turkish, Polish, Dutch, Italian, Swedish, and 35+ more

The model handles multilingual text natively β€” mixed-language documents (e.g., English + Chinese + Malay in one paragraph) work without language selection.

Use Cases

  • PDPA / GDPR compliance β€” scan documents, databases, and logs for personal data before audits
  • LLM guardrails β€” redact PII before sending user input to GPT/Claude/Gemini
  • Data pipelines β€” clean PII from ETL outputs, data warehouse columns, Kafka streams
  • Chat moderation β€” detect PII in real-time in Slack, support tickets, or chat apps
  • Code review β€” catch hardcoded PII in test fixtures, config files, and documentation
  • Document redaction β€” auto-redact contracts, resumes, medical records before sharing

API Reference

POST /api/detect

Field Type Default Description
text string required Input text (max 50,000 chars)
labels string[] all 9 types PII types to detect
boost string[] [] Labels to boost with description matching

Response:

{
  "entities": [
    { "type": "person_name", "value": "John Doe", "start": 0, "end": 8, "score": 0.99, "needs_review": false }
  ],
  "redacted": "[PERSON_NAME] lives at [STREET_ADDRESS]",
  "original": "John Doe lives at 123 Main St"
}

GET /api/health

{ "status": "ok", "version": "1.0.0", "gliner_loaded": true, "chinese_loaded": true }

Architecture

Request β†’ Language detection β†’ GLiNER2 NER + (Chinese NER if CJK)
            ↓
  Post-processing pipeline (8 stages)
  reclassify β†’ validate β†’ filter β†’ normalize β†’ email/IP detect β†’ threshold β†’ dedup β†’ merge
            ↓
  Response (entities + redacted text)

Model: Fine-tuned GLiNER2 (mDeBERTa-v3-base, 280M params) split into 5 ONNX models. INT8 quantized encoder for CPU inference.

Stack: Rust + Axum + ONNX Runtime + HuggingFace Tokenizers + mimalloc

How it works:

  1. Text and entity labels are encoded together by the transformer encoder
  2. Span representation layer scores all possible token spans (up to 8 tokens wide)
  3. Classifier determines which spans match which PII labels
  4. 8-stage post-processing pipeline validates, deduplicates, and merges results
  5. Regex-based detection supplements NER for emails and IP addresses

Configuration

Variable Default Description
PORT 8000 Server port
GLINER_MODELS models/PII-Engineer-Multi-NER-v2.1 GLiNER model path
CHINESE_NER_MODEL models/PII-Engineer-Chinese-NER-v1.0 Chinese NER model path
ORT_DYLIB_PATH auto-detect Path to libonnxruntime.so / .dylib
ORT_INTRA_THREADS 4 ONNX Runtime intra-op threads
ORT_INTER_THREADS 1 ONNX Runtime inter-op threads
PII_ENGINEER_RATE_LIMIT_RPM 120 Max requests per minute per IP

Performance

Setup Latency (p50) Throughput
MacBook M-series (FP32) ~150ms ~6 req/s
4-vCPU AMD (INT8) ~250ms ~4 req/s
8-vCPU AMD (INT8) ~180ms ~5 req/s

Memory usage: ~800MB (model weights loaded in RAM).

Tips:

  • Set ORT_INTRA_THREADS equal to your vCPU count
  • INT8 encoder gives ~40% speedup with <0.5% accuracy loss
  • First request after idle is slower β€” the server runs periodic warmup to mitigate this

Development

cargo build --workspace
cargo test --workspace
cargo clippy --workspace
cargo run --release -p pii-engineer-server

Project Structure

crates/
β”œβ”€β”€ pii-engineer-core/     # NER engine, pipeline, model loading
β”‚   └── src/
β”‚       β”œβ”€β”€ gliner/        # GLiNER2 ONNX inference (v1, v2-compat, v2-full)
β”‚       β”œβ”€β”€ pipeline.rs    # 8-stage post-processing
β”‚       β”œβ”€β”€ labels.rs      # PII label definitions and canonicalization
β”‚       └── lang.rs        # Language detection (CJK)
β”œβ”€β”€ pii-engineer-server/   # HTTP server (Axum)
β”‚   └── src/
β”‚       β”œβ”€β”€ routes.rs      # API endpoints
β”‚       β”œβ”€β”€ state.rs       # App state, model loading
β”‚       └── middleware.rs  # Rate limiting, error handling
static/                    # Embedded frontend (rust-embed)
models/                    # ONNX models (auto-downloaded)

Contributing

See CONTRIBUTING.md for guidelines. We especially welcome:

  • Validation rules for country-specific ID formats
  • Test cases for underrepresented languages
  • Performance optimizations

License

Apache-2.0

See NOTICE for upstream attributions.

About

πŸ† Fast, multilingual PII detection for privacy compliance (PDPA, PDPD, PDP Law, PIPL). 9 entity types, 13+ languages, CPU-only ONNX Runtime inference.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages