Architecture Reference

System Architecture

The tool is designed with a modular, layered architecture to separate responsibilities and allow for extensibility.

graph TD
    A[User] -- "uv run anon.py <file> [args]" --> B(anon.py CLI);

    subgraph "1. Orchestration & File Processing"
        B -- "Instantiates" --> Orch(AnonymizationOrchestrator);
        B -- "Gets Processor" --> F(ProcessorRegistry);
        F -- "e.g., .pdf" --> P_PDF(PdfFileProcessor);
        P_PDF -- "Extracts" --> RawText(Raw Text Content);
    end

    subgraph "2. Anonymization Core"
        RawText -- "orchestrator.anonymize()" --> Orch;
        Orch -- "Selects Strategy (--strategy)" --> STR_CHOICE{Strategy};
        STR_CHOICE -- "'presidio', 'filtered', 'hybrid', 'standalone'" --> PRESIDIO_STR(Traditional Strategy);
        STR_CHOICE -- "'slm'" --> SLM_STR(SLM Strategy);
    end

    subgraph "3. Traditional Engine (Presidio/Regex)"
        PRESIDIO_STR -- "Uses" --> Presidio(Presidio Engine);
        Presidio -- "Loads Models" --> Models(NLP Models);
        Presidio -- "Uses Recognizers" --> Regex(Custom Recognizers);
        Presidio -- "Generates Slug" --> Anonymizer(CustomSlugAnonymizer);
        Anonymizer -- "HMAC + DB" --> DB[(entities.db)];
        Anonymizer -- "Replaces PII" --> AnonymizedText[Anonymized Text];
    end

    subgraph "3b. SLM Engine (Ollama)"
        SLM_STR -- "Queries" --> Ollama(OllamaClient);
        Ollama -- "Local LLM inference" --> AnonymizedText;
    end

    subgraph "4. Output Generation"
        AnonymizedText --> P_PDF;
        P_PDF -- "Writes File" --> OUT(output/anon_file...);
        B -- "Writes Report" --> LOG(logs/report.txt);
    end

Technology Stack

Presidio: Core engine for PII identification and anonymization.
spaCy & Hugging Face Transformers: NLP and Named Entity Recognition (NER).
Pandas: Structured data processing (CSV, XLSX).
PyMuPDF & python-docx: PDF and DOCX parsing.
Pytesseract: OCR for text extraction from images.
ijson: Streaming large JSON files.
orjson: JSON serialization/deserialization.
openpyxl: Excel file processing.
lxml: XML parsing and processing.

Anonymization Mechanism

For each detected entity:

Normalize entity text (remove extra spaces).
Generate an HMAC-SHA256 hash using ANON_SECRET_KEY.
Store the full hash (64 characters) as a unique identifier in the database.
Replace the entity in text with a slug of configurable length (e.g., [PERSON_a1b2c3d4]).

The same entity always produces the same slug, maintaining referential consistency across the anonymized output.

Database Schema

SQLite database at db/entities.db:

Column	Type	Description
`id`	INTEGER	Primary key
`entity_type`	TEXT	Entity type (e.g., `PERSON`, `LOCATION`)
`original_name`	TEXT	Original entity text
`slug_name`	TEXT	Short hash displayed in anonymized output
`full_hash`	TEXT	Full HMAC-SHA256 hash (UNIQUE)
`first_seen`	TEXT	Timestamp of first detection
`last_seen`	TEXT	Timestamp of last detection

Core Components

1. CLI Layer (`anon.py`)

The composition root — parses arguments, instantiates and wires all core components (CacheManager, HashGenerator, EntityDetector, DatabaseContext), injects dependencies into AnonymizationOrchestrator, dispatches files to processors, and generates performance reports.

2. Anonymization Orchestrator (`engine.py`)

Central coordinator. Responsibilities:

Initializes Presidio AnalyzerEngine and AnonymizerEngine.
Selects and injects dependencies into the chosen strategy.
Manages the batch fallback mechanism.
Collects entity statistics for reporting.

3. File Processors (`processors.py`)

Template Method Pattern with a base FileProcessor and specialized subclasses:

Processor	Handles
`TextFileProcessor`	`.txt`, `.log` — line-by-line
`ImageFileProcessor`	Images — OCR extraction
`DocxFileProcessor`	`.docx` — paragraphs + embedded images
`PdfFileProcessor`	`.pdf` — text blocks + images
`CsvFileProcessor`	`.csv` — column-wise with translation maps
`XlsxFileProcessor`	`.xlsx` — in-memory workbook processing
`XmlFileProcessor`	`.xml` — structure-preserving with XPath tracking
`JsonFileProcessor`	`.json`, `.jsonl` — hybrid streaming/in-memory

JSON Processing Modes:

JSONL: line-by-line streaming
Small JSON (<100 MB): in-memory
Large JSON arrays: ijson streaming
Fallback to in-memory if streaming fails

4. Database Layer

Repository Pattern (repository.py): EntityRepository handles connection management (thread-local storage), schema initialization, batch insertion with INSERT OR IGNORE, and entity lookup by slug.

Thread-Safe Queue (database.py): All writes go through a queue.Queue consumed by a dedicated background writer thread, preventing DB write locks from blocking processing. Graceful shutdown ensures the queue is fully drained before exit.

Processing Pipeline

Anonymization Pipeline

Should Anonymize Check: Config-based exclusion → forced anonymization → text filters (stoplist, min length, numeric) → explicit/implicit mode.
Entity Detection: spaCy NER + Transformer (XLM-RoBERTa) + custom regex recognizers → merge and deduplicate.
Hash Generation: Normalize → HMAC-SHA256 with secret key → create slug.
Database Storage: Queue entity for async write.
Text Replacement: Replace entity with [TYPE_hash].

Structure Preservation

JSON/XML: Parse tree → collect strings by path → create translation map → reconstruct tree.

CSV/XLSX: Process unique values per column → create translation map → apply vectorized transformations → preserve headers.

Memory Management

PDF: Page-by-page with explicit cleanup (page.clean_contents(), del page).
JSON: ijson streaming for large arrays; line-by-line for JSONL.
CSV/XLSX: Chunked Pandas reads; XLSX iterates cells without loading full workbook.
GC Control: --disable-gc disables automatic GC for large single files; explicit gc.collect() calls are placed strategically.

Caching Strategy

LRU cache (collections.OrderedDict):

Configurable size via --max-cache-size (default: 10,000 items).
Enabled by default; disable with --no-use-cache.
Caches (original_text → anonymized_slug) pairs to avoid redundant detection and hashing.

Fallback Architecture

After batch processing, the orchestrator verifies input count == output count. On mismatch:

_safe_fallback_processing re-processes items one-by-one.
Errors are logged; problematic items return original text to preserve structure.
Prevents misaligned output in structured files (CSV, JSON, XML) and accidental PII exposure.

Repository Structure

.
├── anon.py                          # CLI entry point
├── pyproject.toml                   # Project metadata and dependencies
├── uv.lock                          # Dependency lock file
├── run.sh                           # Docker orchestration script
│
├── examples/
│   ├── anonymization_config.json    # Default anonymization config
│   ├── anonymization_config_cve.json # CVE-specific config example
│   ├── word_list.example.json       # Word list format example
│   └── exemplo.docx / exemplo.xlsx  # Sample documents
│
├── docker/
│   ├── Dockerfile                   # Multi-stage build (CPU + GPU)
│   ├── docker-compose.yml           # Service profiles
│   └── docker-entrypoint.sh        # Container entrypoint
│
├── src/anon/                        # Core library
│   ├── config.py                    # Entity mappings, language lists
│   ├── engine.py                    # AnonymizationOrchestrator
│   ├── strategies.py                # FullPresidio, Filtered, Hybrid strategies
│   ├── standalone_strategy.py       # StandaloneStrategy
│   ├── entity_detector.py           # NER entity detection
│   ├── processors.py                # File processors
│   ├── repository.py                # EntityRepository (SQLite)
│   ├── database.py                  # Thread-safe DB writer queue
│   ├── hash_generator.py            # HMAC-SHA256 hash generation
│   ├── cache_manager.py             # LRU cache
│   ├── security.py                  # Key validation
│   ├── model_manager.py             # Model loading and management
│   ├── tqdm_handler.py              # Progress bar handler
│   ├── core/
│   │   ├── config_loader.py         # Configuration loading
│   │   └── protocols.py             # Protocol interfaces
│   ├── slm/                         # Small Language Model integration
│   │   ├── client.py                # OllamaClient (SLMClient protocol)
│   │   ├── prompts.py               # PromptManager
│   │   ├── ollama_manager.py        # Ollama process management
│   │   ├── anonymizers/
│   │   │   └── slm_anonymizer.py    # End-to-end SLM anonymization
│   │   ├── detectors/
│   │   │   └── slm_detector.py      # SLM as entity detector
│   │   └── mappers/
│   │       └── entity_mapper.py     # SLM entity mapping
│   └── evaluation/                  # Evaluation support
│       ├── ground_truth.py          # Ground truth loading
│       ├── hash_tracker.py          # Hash tracking for evaluation
│       └── metrics_calculator.py    # TP/FP/FN metrics
│
├── scripts/                         # Utility scripts
│   ├── deanonymize.py               # Controlled de-anonymization
│   ├── evaluate.py                  # Evaluation metrics
│   ├── create_ground_truth.py       # Ground truth generation
│   ├── sample.py                    # Data sampling
│   ├── generate_cve_dataset.py      # CVE dataset generation
│   ├── analyze_entity_map.py        # Entity map analysis
│   ├── cluster_entities.py          # Entity clustering (HDBSCAN)
│   ├── get_metrics.py               # Performance statistics
│   ├── export_and_clear_db.py       # DB export/clear
│   └── utils.py                     # Shared utilities
│
├── tests/                           # Test suite
├── benchmark/                       # Benchmarking suite
│   └── README.md                    # Benchmark documentation
└── docs/                            # Documentation
    └── developers/
        ├── ARCHITECTURE.md
        ├── ANONYMIZATION_STRATEGIES.md
        ├── EXTENSIBILITY.md
        ├── SLM_INTEGRATION_GUIDE.md
        └── UTILITY_SCRIPTS_GUIDE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Reference

System Architecture

Technology Stack

Anonymization Mechanism

Database Schema

Core Components

1. CLI Layer (`anon.py`)

2. Anonymization Orchestrator (`engine.py`)

3. File Processors (`processors.py`)

4. Database Layer

Processing Pipeline

Anonymization Pipeline

Structure Preservation

Memory Management

Caching Strategy

Fallback Architecture

Repository Structure

See Also

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture Reference

System Architecture

Technology Stack

Anonymization Mechanism

Database Schema

Core Components

1. CLI Layer (anon.py)

2. Anonymization Orchestrator (engine.py)

3. File Processors (processors.py)

4. Database Layer

Processing Pipeline

Anonymization Pipeline

Structure Preservation

Memory Management

Caching Strategy

Fallback Architecture

Repository Structure

See Also

1. CLI Layer (`anon.py`)

2. Anonymization Orchestrator (`engine.py`)

3. File Processors (`processors.py`)