Skip to content

Feat/regex fallback #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
name: Performance Benchmarks

on:
push:
branches: [main, develop]
pull_request:
branches: [main, develop]
# Schedule benchmarks to run weekly
schedule:
- cron: "0 0 * * 0" # Run at midnight on Sundays

jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # Fetch all history for proper comparison

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
cache: "pip"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .
pip install -r requirements-dev.txt
pip install pytest-benchmark

- name: Restore benchmark data
uses: actions/cache@v3
with:
path: .benchmarks
key: benchmark-${{ runner.os }}-${{ hashFiles('**/requirements*.txt') }}
restore-keys: |
benchmark-${{ runner.os }}-

- name: Run benchmarks and save baseline
run: |
# Run benchmarks and save results
pytest tests/benchmark_text_service.py -v --benchmark-autosave

- name: Check for performance regression
run: |
# Compare against the previous benchmark if available
# Fail if performance degrades by more than 10%
if [ -d ".benchmarks" ]; then
BASELINE=$(ls -t .benchmarks/Linux-CPython-3.10-64bit | head -n 2 | tail -n 1)
CURRENT=$(ls -t .benchmarks/Linux-CPython-3.10-64bit | head -n 1)
if [ -n "$BASELINE" ] && [ "$BASELINE" != "$CURRENT" ]; then
# Set full paths to the benchmark files
BASELINE_FILE="$benchmark_dir/$BASELINE"
CURRENT_FILE="$benchmark_dir/$CURRENT"

echo "Comparing current run ($CURRENT) against baseline ($BASELINE)"
# First just show the comparison
pytest tests/benchmark_text_service.py --benchmark-compare

# Then check for significant regressions
echo "Checking for performance regressions (>10% slower)..."
# Use our Python script for benchmark comparison
python scripts/compare_benchmarks.py "$BASELINE_FILE" "$CURRENT_FILE"
else
echo "No previous benchmark found for comparison or only one benchmark exists"
fi
else
echo "No benchmarks directory found"
fi

- name: Upload benchmark results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: .benchmarks/

- name: Alert on regression
if: failure()
run: |
echo "::warning::Performance regression detected! Check benchmark results."
47 changes: 47 additions & 0 deletions .github/workflows/wheel_size.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: Wheel Size Check

on:
push:
branches: [main, develop]
pull_request:
branches: [main, develop]

jobs:
check-wheel-size:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
cache: "pip"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build wheel

- name: Build wheel
run: python -m build --wheel

- name: Check wheel size
run: |
WHEEL_PATH=$(find dist -name "*.whl")
WHEEL_SIZE=$(du -m "$WHEEL_PATH" | cut -f1)
echo "Wheel size: $WHEEL_SIZE MB"
if [ "$WHEEL_SIZE" -ge 8 ]; then
echo "::error::Wheel size exceeds 8 MB limit: $WHEEL_SIZE MB"
exit 1
else
echo "::notice::Wheel size is within limit: $WHEEL_SIZE MB"
fi

- name: Upload wheel artifact
uses: actions/upload-artifact@v3
with:
name: wheel
path: dist/*.whl
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,6 @@ error_log.txt
docs/*
!docs/*.rst
!docs/conf.py
scratch.py
scratch.py
.coverage*
.benchmarks
16 changes: 15 additions & 1 deletion CHANGELOG.MD
Original file line number Diff line number Diff line change
@@ -1,8 +1,22 @@
# ChangeLog

## [2025-05-02]

### `datafog-python` [4.1.0]

- Added engine selection functionality to TextService class, allowing users to choose between 'regex', 'spacy', or 'auto' annotation engines
- Enhanced TextService with intelligent fallback mechanism in 'auto' mode that tries regex first and falls back to spaCy if no entities are found
- Added comprehensive integration tests for the new engine selection feature
- Implemented performance benchmarks showing regex engine is ~123x faster than spaCy
- Added CI pipeline for continuous performance monitoring with regression detection
- Added wheel-size gate (< 8 MB) to CI pipeline
- Added 'When do I need spaCy?' guidance to documentation
- Created scripts for running benchmarks locally and comparing results
- Improved documentation with performance metrics and engine selection guidance

## [2024-03-25]

### `datafog-python` [2.3.2]
### `datafog-python` [4.0.0]

- Added datafog-python/examples/uploading-file-types.ipynb to show JSON uploading example (#16)
- Added datafog-python/tests/regex_issue.py to show issue with regex recognizer creation
Expand Down
90 changes: 90 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,29 @@ client = DataFog(operations="scan")
ocr_client = DataFog(operations="extract")
```

## Engine Selection

DataFog now supports multiple annotation engines through the `TextService` class. You can choose between different engines for PII detection:

```python
from datafog.services.text_service import TextService

# Use regex engine only (fastest, pattern-based detection)
regex_service = TextService(engine="regex")

# Use spaCy engine only (more comprehensive NLP-based detection)
spacy_service = TextService(engine="spacy")

# Use auto mode (default) - tries regex first, falls back to spaCy if no entities found
auto_service = TextService() # engine="auto" is the default
```

Each engine has different strengths:

- **regex**: Fast pattern matching, good for structured data like emails, phone numbers, credit cards, etc.
- **spacy**: NLP-based entity recognition, better for detecting names, organizations, locations, etc.
- **auto**: Best of both worlds - uses regex for speed, falls back to spaCy for comprehensive detection

## Text PII Annotation

Here's an example of how to annotate PII in a text document:
Expand Down Expand Up @@ -300,6 +323,73 @@ Output:

You can choose from SHA256 (default), SHA3-256, and MD5 hashing algorithms by specifying the `hash_type` parameter

## Performance

DataFog provides multiple annotation engines with different performance characteristics:

### Engine Selection

The `TextService` class supports three engine modes:

```python
# Use regex engine only (fastest, pattern-based detection)
regex_service = TextService(engine="regex")

# Use spaCy engine only (more comprehensive NLP-based detection)
spacy_service = TextService(engine="spacy")

# Use auto mode (default) - tries regex first, falls back to spaCy if no entities found
auto_service = TextService() # engine="auto" is the default
```

### Performance Comparison

Benchmark tests show that the regex engine is significantly faster than spaCy for PII detection:

| Engine | Processing Time (10KB text) | Entities Detected |
| ------ | --------------------------- | ---------------------------------------------------- |
| Regex | ~0.004 seconds | EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS, DOB, ZIP |
| SpaCy | ~0.48 seconds | PERSON, ORG, GPE, CARDINAL, FAC |
| Auto | ~0.004 seconds | Same as regex when patterns are found |

**Key findings:**

- The regex engine is approximately **123x faster** than spaCy for processing the same text
- The auto engine provides the best balance between speed and comprehensiveness
- Uses fast regex patterns first
- Falls back to spaCy only when no regex patterns are matched

### When to Use Each Engine

- **Regex Engine**: Use when processing large volumes of text or when performance is critical
- **SpaCy Engine**: Use when you need to detect a wider range of named entities beyond structured PII
- **Auto Engine**: Recommended for most use cases as it combines the speed of regex with the capability to fall back to spaCy when needed

### When do I need spaCy?

While the regex engine is significantly faster (123x faster in our benchmarks), there are specific scenarios where you might want to use spaCy:

1. **Complex entity recognition**: When you need to identify entities not covered by regex patterns, such as organization names, locations, or product names that don't follow predictable formats.

2. **Context-aware detection**: When the meaning of text depends on surrounding context that regex cannot easily capture, such as distinguishing between a person's name and a company with the same name based on context.

3. **Multi-language support**: When processing text in languages other than English where regex patterns might be insufficient or need significant customization.

4. **Research and exploration**: When experimenting with NLP capabilities and need the full power of a dedicated NLP library with features like part-of-speech tagging, dependency parsing, etc.

5. **Unknown entity types**: When you don't know in advance what types of entities might be present in your text and need a more general-purpose entity recognition approach.

For high-performance production systems processing large volumes of text with known entity types (emails, phone numbers, credit cards, etc.), the regex engine is strongly recommended due to its significant speed advantage.

### Running Benchmarks Locally

You can run the performance benchmarks locally using pytest-benchmark:

```bash
pip install pytest-benchmark
pytest tests/benchmark_text_service.py -v
```

## Examples

For more detailed examples, check out our Jupyter notebooks in the `examples/` directory:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from datafog.processing.text_processing.regex_annotator.regex_annotator import (
AnnotationResult,
RegexAnnotator,
Span,
)

__all__ = ["RegexAnnotator", "Span", "AnnotationResult"]
Loading