DataFog · sidmohan0 · May 3, 2025 · May 2, 2025 · May 2, 2025 · May 2, 2025
diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
@@ -0,0 +1,82 @@
+name: Performance Benchmarks
+
+on:
+  push:
+    branches: [main, develop]
+  pull_request:
+    branches: [main, develop]
+  # Schedule benchmarks to run weekly
+  schedule:
+    - cron: "0 0 * * 0" # Run at midnight on Sundays
+
+jobs:
+  benchmark:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0 # Fetch all history for proper comparison
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+          cache: "pip"
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e .
+          pip install -r requirements-dev.txt
+          pip install pytest-benchmark
+
+      - name: Restore benchmark data
+        uses: actions/cache@v3
+        with:
+          path: .benchmarks
+          key: benchmark-${{ runner.os }}-${{ hashFiles('**/requirements*.txt') }}
+          restore-keys: |
+            benchmark-${{ runner.os }}-
+
+      - name: Run benchmarks and save baseline
+        run: |
+          # Run benchmarks and save results
+          pytest tests/benchmark_text_service.py -v --benchmark-autosave
+
+      - name: Check for performance regression
+        run: |
+          # Compare against the previous benchmark if available
+          # Fail if performance degrades by more than 10%
+          if [ -d ".benchmarks" ]; then
+            BASELINE=$(ls -t .benchmarks/Linux-CPython-3.10-64bit | head -n 2 | tail -n 1)
+            CURRENT=$(ls -t .benchmarks/Linux-CPython-3.10-64bit | head -n 1)
+            if [ -n "$BASELINE" ] && [ "$BASELINE" != "$CURRENT" ]; then
+              # Set full paths to the benchmark files
+              BASELINE_FILE="$benchmark_dir/$BASELINE"
+              CURRENT_FILE="$benchmark_dir/$CURRENT"
+
+              echo "Comparing current run ($CURRENT) against baseline ($BASELINE)"
+              # First just show the comparison
+              pytest tests/benchmark_text_service.py --benchmark-compare
+
+              # Then check for significant regressions
+              echo "Checking for performance regressions (>10% slower)..."
+              # Use our Python script for benchmark comparison
+              python scripts/compare_benchmarks.py "$BASELINE_FILE" "$CURRENT_FILE"
+            else
+              echo "No previous benchmark found for comparison or only one benchmark exists"
+            fi
+          else
+            echo "No benchmarks directory found"
+          fi
+
+      - name: Upload benchmark results
+        uses: actions/upload-artifact@v3
+        with:
+          name: benchmark-results
+          path: .benchmarks/
+
+      - name: Alert on regression
+        if: failure()
+        run: |
+          echo "::warning::Performance regression detected! Check benchmark results."
diff --git a/.github/workflows/wheel_size.yml b/.github/workflows/wheel_size.yml
@@ -0,0 +1,47 @@
+name: Wheel Size Check
+
+on:
+  push:
+    branches: [main, develop]
+  pull_request:
+    branches: [main, develop]
+
+jobs:
+  check-wheel-size:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+          cache: "pip"
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install build wheel
+
+      - name: Build wheel
+        run: python -m build --wheel
+
+      - name: Check wheel size
+        run: |
+          WHEEL_PATH=$(find dist -name "*.whl")
+          WHEEL_SIZE=$(du -m "$WHEEL_PATH" | cut -f1)
+          echo "Wheel size: $WHEEL_SIZE MB"
+          if [ "$WHEEL_SIZE" -ge 8 ]; then
+            echo "::error::Wheel size exceeds 8 MB limit: $WHEEL_SIZE MB"
+            exit 1
+          else
+            echo "::notice::Wheel size is within limit: $WHEEL_SIZE MB"
+          fi
+
+      - name: Upload wheel artifact
+        uses: actions/upload-artifact@v3
+        with:
+          name: wheel
+          path: dist/*.whl
diff --git a/.gitignore b/.gitignore
@@ -36,4 +36,6 @@ error_log.txt
 docs/*
 !docs/*.rst
 !docs/conf.py
-scratch.py
+scratch.py
+.coverage*
+.benchmarks
diff --git a/CHANGELOG.MD b/CHANGELOG.MD
@@ -1,8 +1,22 @@
 # ChangeLog
 
+## [2025-05-02]
+
+### `datafog-python` [4.1.0]
+
+- Added engine selection functionality to TextService class, allowing users to choose between 'regex', 'spacy', or 'auto' annotation engines
+- Enhanced TextService with intelligent fallback mechanism in 'auto' mode that tries regex first and falls back to spaCy if no entities are found
+- Added comprehensive integration tests for the new engine selection feature
+- Implemented performance benchmarks showing regex engine is ~123x faster than spaCy
+- Added CI pipeline for continuous performance monitoring with regression detection
+- Added wheel-size gate (< 8 MB) to CI pipeline
+- Added 'When do I need spaCy?' guidance to documentation
+- Created scripts for running benchmarks locally and comparing results
+- Improved documentation with performance metrics and engine selection guidance
+
 ## [2024-03-25]
 
-### `datafog-python` [2.3.2]
+### `datafog-python` [4.0.0]
 
 - Added datafog-python/examples/uploading-file-types.ipynb to show JSON uploading example (#16)
 - Added datafog-python/tests/regex_issue.py to show issue with regex recognizer creation

diff --git a/README.md b/README.md
@@ -190,6 +190,29 @@ client = DataFog(operations="scan")
 ocr_client = DataFog(operations="extract")
 ```
 
+## Engine Selection
+
+DataFog now supports multiple annotation engines through the `TextService` class. You can choose between different engines for PII detection:
+
+```python
+from datafog.services.text_service import TextService
+
+# Use regex engine only (fastest, pattern-based detection)
+regex_service = TextService(engine="regex")
+
+# Use spaCy engine only (more comprehensive NLP-based detection)
+spacy_service = TextService(engine="spacy")
+
+# Use auto mode (default) - tries regex first, falls back to spaCy if no entities found
+auto_service = TextService()  # engine="auto" is the default
+```
+
+Each engine has different strengths:
+
+- **regex**: Fast pattern matching, good for structured data like emails, phone numbers, credit cards, etc.
+- **spacy**: NLP-based entity recognition, better for detecting names, organizations, locations, etc.
+- **auto**: Best of both worlds - uses regex for speed, falls back to spaCy for comprehensive detection
+
 ## Text PII Annotation
 
 Here's an example of how to annotate PII in a text document:
@@ -300,6 +323,73 @@ Output:
 
 You can choose from SHA256 (default), SHA3-256, and MD5 hashing algorithms by specifying the `hash_type` parameter
 
+## Performance
+
+DataFog provides multiple annotation engines with different performance characteristics:
+
+### Engine Selection
+
+The `TextService` class supports three engine modes:
+
+```python
+# Use regex engine only (fastest, pattern-based detection)
+regex_service = TextService(engine="regex")
+
+# Use spaCy engine only (more comprehensive NLP-based detection)
+spacy_service = TextService(engine="spacy")
+
+# Use auto mode (default) - tries regex first, falls back to spaCy if no entities found
+auto_service = TextService()  # engine="auto" is the default
+```
+
+### Performance Comparison
+
+Benchmark tests show that the regex engine is significantly faster than spaCy for PII detection:
+
+| Engine | Processing Time (10KB text) | Entities Detected                                    |
+| ------ | --------------------------- | ---------------------------------------------------- |
+| Regex  | ~0.004 seconds              | EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS, DOB, ZIP |
+| SpaCy  | ~0.48 seconds               | PERSON, ORG, GPE, CARDINAL, FAC                      |
+| Auto   | ~0.004 seconds              | Same as regex when patterns are found                |
+
+**Key findings:**
+
+- The regex engine is approximately **123x faster** than spaCy for processing the same text
+- The auto engine provides the best balance between speed and comprehensiveness
+  - Uses fast regex patterns first
+  - Falls back to spaCy only when no regex patterns are matched
+
+### When to Use Each Engine
+
+- **Regex Engine**: Use when processing large volumes of text or when performance is critical
+- **SpaCy Engine**: Use when you need to detect a wider range of named entities beyond structured PII
+- **Auto Engine**: Recommended for most use cases as it combines the speed of regex with the capability to fall back to spaCy when needed
+
+### When do I need spaCy?
+
+While the regex engine is significantly faster (123x faster in our benchmarks), there are specific scenarios where you might want to use spaCy:
+
+1. **Complex entity recognition**: When you need to identify entities not covered by regex patterns, such as organization names, locations, or product names that don't follow predictable formats.
+
+2. **Context-aware detection**: When the meaning of text depends on surrounding context that regex cannot easily capture, such as distinguishing between a person's name and a company with the same name based on context.
+
+3. **Multi-language support**: When processing text in languages other than English where regex patterns might be insufficient or need significant customization.
+
+4. **Research and exploration**: When experimenting with NLP capabilities and need the full power of a dedicated NLP library with features like part-of-speech tagging, dependency parsing, etc.
+
+5. **Unknown entity types**: When you don't know in advance what types of entities might be present in your text and need a more general-purpose entity recognition approach.
+
+For high-performance production systems processing large volumes of text with known entity types (emails, phone numbers, credit cards, etc.), the regex engine is strongly recommended due to its significant speed advantage.
+
+### Running Benchmarks Locally
+
+You can run the performance benchmarks locally using pytest-benchmark:
+
+```bash
+pip install pytest-benchmark
+pytest tests/benchmark_text_service.py -v
+```
+
 ## Examples
 
 For more detailed examples, check out our Jupyter notebooks in the `examples/` directory:

diff --git a/datafog/processing/text_processing/regex_annotator/__init__.py b/datafog/processing/text_processing/regex_annotator/__init__.py
@@ -0,0 +1,7 @@
+from datafog.processing.text_processing.regex_annotator.regex_annotator import (
+    AnnotationResult,
+    RegexAnnotator,
+    Span,
+)
+
+__all__ = ["RegexAnnotator", "Span", "AnnotationResult"]