The tool is designed with a modular, layered architecture to separate responsibilities and allow for extensibility.
graph TD
A[User] -- "uv run anon.py <file> [args]" --> B(anon.py CLI);
subgraph "1. Orchestration & File Processing"
B -- "Instantiates" --> Orch(AnonymizationOrchestrator);
B -- "Gets Processor" --> F(ProcessorRegistry);
F -- "e.g., .pdf" --> P_PDF(PdfFileProcessor);
P_PDF -- "Extracts" --> RawText(Raw Text Content);
end
subgraph "2. Anonymization Core"
RawText -- "orchestrator.anonymize()" --> Orch;
Orch -- "Selects Strategy (--strategy)" --> STR_CHOICE{Strategy};
STR_CHOICE -- "'presidio', 'filtered', 'hybrid', 'standalone'" --> PRESIDIO_STR(Traditional Strategy);
STR_CHOICE -- "'slm'" --> SLM_STR(SLM Strategy);
end
subgraph "3. Traditional Engine (Presidio/Regex)"
PRESIDIO_STR -- "Uses" --> Presidio(Presidio Engine);
Presidio -- "Loads Models" --> Models(NLP Models);
Presidio -- "Uses Recognizers" --> Regex(Custom Recognizers);
Presidio -- "Generates Slug" --> Anonymizer(CustomSlugAnonymizer);
Anonymizer -- "HMAC + DB" --> DB[(entities.db)];
Anonymizer -- "Replaces PII" --> AnonymizedText[Anonymized Text];
end
subgraph "3b. SLM Engine (Ollama)"
SLM_STR -- "Queries" --> Ollama(OllamaClient);
Ollama -- "Local LLM inference" --> AnonymizedText;
end
subgraph "4. Output Generation"
AnonymizedText --> P_PDF;
P_PDF -- "Writes File" --> OUT(output/anon_file...);
B -- "Writes Report" --> LOG(logs/report.txt);
end
- Presidio: Core engine for PII identification and anonymization.
- spaCy & Hugging Face Transformers: NLP and Named Entity Recognition (NER).
- Pandas: Structured data processing (CSV, XLSX).
- PyMuPDF & python-docx: PDF and DOCX parsing.
- Pytesseract: OCR for text extraction from images.
- ijson: Streaming large JSON files.
- orjson: JSON serialization/deserialization.
- openpyxl: Excel file processing.
- lxml: XML parsing and processing.
For each detected entity:
- Normalize entity text (remove extra spaces).
- Generate an HMAC-SHA256 hash using
ANON_SECRET_KEY. - Store the full hash (64 characters) as a unique identifier in the database.
- Replace the entity in text with a slug of configurable length (e.g.,
[PERSON_a1b2c3d4]).
The same entity always produces the same slug, maintaining referential consistency across the anonymized output.
SQLite database at db/entities.db:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
entity_type |
TEXT | Entity type (e.g., PERSON, LOCATION) |
original_name |
TEXT | Original entity text |
slug_name |
TEXT | Short hash displayed in anonymized output |
full_hash |
TEXT | Full HMAC-SHA256 hash (UNIQUE) |
first_seen |
TEXT | Timestamp of first detection |
last_seen |
TEXT | Timestamp of last detection |
The composition root — parses arguments, instantiates and wires all core components (CacheManager, HashGenerator, EntityDetector, DatabaseContext), injects dependencies into AnonymizationOrchestrator, dispatches files to processors, and generates performance reports.
Central coordinator. Responsibilities:
- Initializes Presidio
AnalyzerEngineandAnonymizerEngine. - Selects and injects dependencies into the chosen strategy.
- Manages the batch fallback mechanism.
- Collects entity statistics for reporting.
Template Method Pattern with a base FileProcessor and specialized subclasses:
| Processor | Handles |
|---|---|
TextFileProcessor |
.txt, .log — line-by-line |
ImageFileProcessor |
Images — OCR extraction |
DocxFileProcessor |
.docx — paragraphs + embedded images |
PdfFileProcessor |
.pdf — text blocks + images |
CsvFileProcessor |
.csv — column-wise with translation maps |
XlsxFileProcessor |
.xlsx — in-memory workbook processing |
XmlFileProcessor |
.xml — structure-preserving with XPath tracking |
JsonFileProcessor |
.json, .jsonl — hybrid streaming/in-memory |
JSON Processing Modes:
- JSONL: line-by-line streaming
- Small JSON (<100 MB): in-memory
- Large JSON arrays:
ijsonstreaming - Fallback to in-memory if streaming fails
Repository Pattern (repository.py): EntityRepository handles connection management (thread-local storage), schema initialization, batch insertion with INSERT OR IGNORE, and entity lookup by slug.
Thread-Safe Queue (database.py): All writes go through a queue.Queue consumed by a dedicated background writer thread, preventing DB write locks from blocking processing. Graceful shutdown ensures the queue is fully drained before exit.
- Should Anonymize Check: Config-based exclusion → forced anonymization → text filters (stoplist, min length, numeric) → explicit/implicit mode.
- Entity Detection: spaCy NER + Transformer (XLM-RoBERTa) + custom regex recognizers → merge and deduplicate.
- Hash Generation: Normalize → HMAC-SHA256 with secret key → create slug.
- Database Storage: Queue entity for async write.
- Text Replacement: Replace entity with
[TYPE_hash].
JSON/XML: Parse tree → collect strings by path → create translation map → reconstruct tree.
CSV/XLSX: Process unique values per column → create translation map → apply vectorized transformations → preserve headers.
- PDF: Page-by-page with explicit cleanup (
page.clean_contents(),del page). - JSON:
ijsonstreaming for large arrays; line-by-line for JSONL. - CSV/XLSX: Chunked Pandas reads; XLSX iterates cells without loading full workbook.
- GC Control:
--disable-gcdisables automatic GC for large single files; explicitgc.collect()calls are placed strategically.
LRU cache (collections.OrderedDict):
- Configurable size via
--max-cache-size(default: 10,000 items). - Enabled by default; disable with
--no-use-cache. - Caches
(original_text → anonymized_slug)pairs to avoid redundant detection and hashing.
After batch processing, the orchestrator verifies input count == output count. On mismatch:
_safe_fallback_processingre-processes items one-by-one.- Errors are logged; problematic items return original text to preserve structure.
- Prevents misaligned output in structured files (CSV, JSON, XML) and accidental PII exposure.
.
├── anon.py # CLI entry point
├── pyproject.toml # Project metadata and dependencies
├── uv.lock # Dependency lock file
├── run.sh # Docker orchestration script
│
├── examples/
│ ├── anonymization_config.json # Default anonymization config
│ ├── anonymization_config_cve.json # CVE-specific config example
│ ├── word_list.example.json # Word list format example
│ └── exemplo.docx / exemplo.xlsx # Sample documents
│
├── docker/
│ ├── Dockerfile # Multi-stage build (CPU + GPU)
│ ├── docker-compose.yml # Service profiles
│ └── docker-entrypoint.sh # Container entrypoint
│
├── src/anon/ # Core library
│ ├── config.py # Entity mappings, language lists
│ ├── engine.py # AnonymizationOrchestrator
│ ├── strategies.py # FullPresidio, Filtered, Hybrid strategies
│ ├── standalone_strategy.py # StandaloneStrategy
│ ├── entity_detector.py # NER entity detection
│ ├── processors.py # File processors
│ ├── repository.py # EntityRepository (SQLite)
│ ├── database.py # Thread-safe DB writer queue
│ ├── hash_generator.py # HMAC-SHA256 hash generation
│ ├── cache_manager.py # LRU cache
│ ├── security.py # Key validation
│ ├── model_manager.py # Model loading and management
│ ├── tqdm_handler.py # Progress bar handler
│ ├── core/
│ │ ├── config_loader.py # Configuration loading
│ │ └── protocols.py # Protocol interfaces
│ ├── slm/ # Small Language Model integration
│ │ ├── client.py # OllamaClient (SLMClient protocol)
│ │ ├── prompts.py # PromptManager
│ │ ├── ollama_manager.py # Ollama process management
│ │ ├── anonymizers/
│ │ │ └── slm_anonymizer.py # End-to-end SLM anonymization
│ │ ├── detectors/
│ │ │ └── slm_detector.py # SLM as entity detector
│ │ └── mappers/
│ │ └── entity_mapper.py # SLM entity mapping
│ └── evaluation/ # Evaluation support
│ ├── ground_truth.py # Ground truth loading
│ ├── hash_tracker.py # Hash tracking for evaluation
│ └── metrics_calculator.py # TP/FP/FN metrics
│
├── scripts/ # Utility scripts
│ ├── deanonymize.py # Controlled de-anonymization
│ ├── evaluate.py # Evaluation metrics
│ ├── create_ground_truth.py # Ground truth generation
│ ├── sample.py # Data sampling
│ ├── generate_cve_dataset.py # CVE dataset generation
│ ├── analyze_entity_map.py # Entity map analysis
│ ├── cluster_entities.py # Entity clustering (HDBSCAN)
│ ├── get_metrics.py # Performance statistics
│ ├── export_and_clear_db.py # DB export/clear
│ └── utils.py # Shared utilities
│
├── tests/ # Test suite
├── benchmark/ # Benchmarking suite
│ └── README.md # Benchmark documentation
└── docs/ # Documentation
└── developers/
├── ARCHITECTURE.md
├── ANONYMIZATION_STRATEGIES.md
├── EXTENSIBILITY.md
├── SLM_INTEGRATION_GUIDE.md
└── UTILITY_SCRIPTS_GUIDE.md
- Extensibility Guide — all extension points with worked examples (strategies, processors, cache, storage, SLM client, model providers, etc.)
- Anonymization Strategies — detailed description of each built-in strategy
- SLM Integration Guide — deep dive into the SLM module architecture
- Contributing — development setup, conventions, and pull-request process
- Changelog — release history