Skip to content

Commit caf8c06

Browse files
authored
Merge pull request #69 from DataFog/feat/regex-fallback
Feat/regex fallback
2 parents ef44a86 + 3a4381f commit caf8c06

22 files changed

+2820
-37
lines changed

.github/workflows/benchmark.yml

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
name: Performance Benchmarks
2+
3+
on:
4+
push:
5+
branches: [main, develop]
6+
pull_request:
7+
branches: [main, develop]
8+
# Schedule benchmarks to run weekly
9+
schedule:
10+
- cron: "0 0 * * 0" # Run at midnight on Sundays
11+
12+
jobs:
13+
benchmark:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- uses: actions/checkout@v3
17+
with:
18+
fetch-depth: 0 # Fetch all history for proper comparison
19+
20+
- name: Set up Python
21+
uses: actions/setup-python@v4
22+
with:
23+
python-version: "3.10"
24+
cache: "pip"
25+
26+
- name: Install dependencies
27+
run: |
28+
python -m pip install --upgrade pip
29+
pip install -e .
30+
pip install -r requirements-dev.txt
31+
pip install pytest-benchmark
32+
33+
- name: Restore benchmark data
34+
uses: actions/cache@v3
35+
with:
36+
path: .benchmarks
37+
key: benchmark-${{ runner.os }}-${{ hashFiles('**/requirements*.txt') }}
38+
restore-keys: |
39+
benchmark-${{ runner.os }}-
40+
41+
- name: Run benchmarks and save baseline
42+
run: |
43+
# Run benchmarks and save results
44+
pytest tests/benchmark_text_service.py -v --benchmark-autosave
45+
46+
- name: Check for performance regression
47+
run: |
48+
# Compare against the previous benchmark if available
49+
# Fail if performance degrades by more than 10%
50+
if [ -d ".benchmarks" ]; then
51+
BASELINE=$(ls -t .benchmarks/Linux-CPython-3.10-64bit | head -n 2 | tail -n 1)
52+
CURRENT=$(ls -t .benchmarks/Linux-CPython-3.10-64bit | head -n 1)
53+
if [ -n "$BASELINE" ] && [ "$BASELINE" != "$CURRENT" ]; then
54+
# Set full paths to the benchmark files
55+
BASELINE_FILE="$benchmark_dir/$BASELINE"
56+
CURRENT_FILE="$benchmark_dir/$CURRENT"
57+
58+
echo "Comparing current run ($CURRENT) against baseline ($BASELINE)"
59+
# First just show the comparison
60+
pytest tests/benchmark_text_service.py --benchmark-compare
61+
62+
# Then check for significant regressions
63+
echo "Checking for performance regressions (>10% slower)..."
64+
# Use our Python script for benchmark comparison
65+
python scripts/compare_benchmarks.py "$BASELINE_FILE" "$CURRENT_FILE"
66+
else
67+
echo "No previous benchmark found for comparison or only one benchmark exists"
68+
fi
69+
else
70+
echo "No benchmarks directory found"
71+
fi
72+
73+
- name: Upload benchmark results
74+
uses: actions/upload-artifact@v3
75+
with:
76+
name: benchmark-results
77+
path: .benchmarks/
78+
79+
- name: Alert on regression
80+
if: failure()
81+
run: |
82+
echo "::warning::Performance regression detected! Check benchmark results."

.github/workflows/wheel_size.yml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: Wheel Size Check
2+
3+
on:
4+
push:
5+
branches: [main, develop]
6+
pull_request:
7+
branches: [main, develop]
8+
9+
jobs:
10+
check-wheel-size:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v3
14+
with:
15+
fetch-depth: 0
16+
17+
- name: Set up Python
18+
uses: actions/setup-python@v4
19+
with:
20+
python-version: "3.10"
21+
cache: "pip"
22+
23+
- name: Install dependencies
24+
run: |
25+
python -m pip install --upgrade pip
26+
pip install build wheel
27+
28+
- name: Build wheel
29+
run: python -m build --wheel
30+
31+
- name: Check wheel size
32+
run: |
33+
WHEEL_PATH=$(find dist -name "*.whl")
34+
WHEEL_SIZE=$(du -m "$WHEEL_PATH" | cut -f1)
35+
echo "Wheel size: $WHEEL_SIZE MB"
36+
if [ "$WHEEL_SIZE" -ge 8 ]; then
37+
echo "::error::Wheel size exceeds 8 MB limit: $WHEEL_SIZE MB"
38+
exit 1
39+
else
40+
echo "::notice::Wheel size is within limit: $WHEEL_SIZE MB"
41+
fi
42+
43+
- name: Upload wheel artifact
44+
uses: actions/upload-artifact@v3
45+
with:
46+
name: wheel
47+
path: dist/*.whl

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,6 @@ error_log.txt
3636
docs/*
3737
!docs/*.rst
3838
!docs/conf.py
39-
scratch.py
39+
scratch.py
40+
.coverage*
41+
.benchmarks

CHANGELOG.MD

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,22 @@
11
# ChangeLog
22

3+
## [2025-05-02]
4+
5+
### `datafog-python` [4.1.0]
6+
7+
- Added engine selection functionality to TextService class, allowing users to choose between 'regex', 'spacy', or 'auto' annotation engines
8+
- Enhanced TextService with intelligent fallback mechanism in 'auto' mode that tries regex first and falls back to spaCy if no entities are found
9+
- Added comprehensive integration tests for the new engine selection feature
10+
- Implemented performance benchmarks showing regex engine is ~123x faster than spaCy
11+
- Added CI pipeline for continuous performance monitoring with regression detection
12+
- Added wheel-size gate (< 8 MB) to CI pipeline
13+
- Added 'When do I need spaCy?' guidance to documentation
14+
- Created scripts for running benchmarks locally and comparing results
15+
- Improved documentation with performance metrics and engine selection guidance
16+
317
## [2024-03-25]
418

5-
### `datafog-python` [2.3.2]
19+
### `datafog-python` [4.0.0]
620

721
- Added datafog-python/examples/uploading-file-types.ipynb to show JSON uploading example (#16)
822
- Added datafog-python/tests/regex_issue.py to show issue with regex recognizer creation

README.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,29 @@ client = DataFog(operations="scan")
190190
ocr_client = DataFog(operations="extract")
191191
```
192192

193+
## Engine Selection
194+
195+
DataFog now supports multiple annotation engines through the `TextService` class. You can choose between different engines for PII detection:
196+
197+
```python
198+
from datafog.services.text_service import TextService
199+
200+
# Use regex engine only (fastest, pattern-based detection)
201+
regex_service = TextService(engine="regex")
202+
203+
# Use spaCy engine only (more comprehensive NLP-based detection)
204+
spacy_service = TextService(engine="spacy")
205+
206+
# Use auto mode (default) - tries regex first, falls back to spaCy if no entities found
207+
auto_service = TextService() # engine="auto" is the default
208+
```
209+
210+
Each engine has different strengths:
211+
212+
- **regex**: Fast pattern matching, good for structured data like emails, phone numbers, credit cards, etc.
213+
- **spacy**: NLP-based entity recognition, better for detecting names, organizations, locations, etc.
214+
- **auto**: Best of both worlds - uses regex for speed, falls back to spaCy for comprehensive detection
215+
193216
## Text PII Annotation
194217

195218
Here's an example of how to annotate PII in a text document:
@@ -300,6 +323,73 @@ Output:
300323

301324
You can choose from SHA256 (default), SHA3-256, and MD5 hashing algorithms by specifying the `hash_type` parameter
302325

326+
## Performance
327+
328+
DataFog provides multiple annotation engines with different performance characteristics:
329+
330+
### Engine Selection
331+
332+
The `TextService` class supports three engine modes:
333+
334+
```python
335+
# Use regex engine only (fastest, pattern-based detection)
336+
regex_service = TextService(engine="regex")
337+
338+
# Use spaCy engine only (more comprehensive NLP-based detection)
339+
spacy_service = TextService(engine="spacy")
340+
341+
# Use auto mode (default) - tries regex first, falls back to spaCy if no entities found
342+
auto_service = TextService() # engine="auto" is the default
343+
```
344+
345+
### Performance Comparison
346+
347+
Benchmark tests show that the regex engine is significantly faster than spaCy for PII detection:
348+
349+
| Engine | Processing Time (10KB text) | Entities Detected |
350+
| ------ | --------------------------- | ---------------------------------------------------- |
351+
| Regex | ~0.004 seconds | EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS, DOB, ZIP |
352+
| SpaCy | ~0.48 seconds | PERSON, ORG, GPE, CARDINAL, FAC |
353+
| Auto | ~0.004 seconds | Same as regex when patterns are found |
354+
355+
**Key findings:**
356+
357+
- The regex engine is approximately **123x faster** than spaCy for processing the same text
358+
- The auto engine provides the best balance between speed and comprehensiveness
359+
- Uses fast regex patterns first
360+
- Falls back to spaCy only when no regex patterns are matched
361+
362+
### When to Use Each Engine
363+
364+
- **Regex Engine**: Use when processing large volumes of text or when performance is critical
365+
- **SpaCy Engine**: Use when you need to detect a wider range of named entities beyond structured PII
366+
- **Auto Engine**: Recommended for most use cases as it combines the speed of regex with the capability to fall back to spaCy when needed
367+
368+
### When do I need spaCy?
369+
370+
While the regex engine is significantly faster (123x faster in our benchmarks), there are specific scenarios where you might want to use spaCy:
371+
372+
1. **Complex entity recognition**: When you need to identify entities not covered by regex patterns, such as organization names, locations, or product names that don't follow predictable formats.
373+
374+
2. **Context-aware detection**: When the meaning of text depends on surrounding context that regex cannot easily capture, such as distinguishing between a person's name and a company with the same name based on context.
375+
376+
3. **Multi-language support**: When processing text in languages other than English where regex patterns might be insufficient or need significant customization.
377+
378+
4. **Research and exploration**: When experimenting with NLP capabilities and need the full power of a dedicated NLP library with features like part-of-speech tagging, dependency parsing, etc.
379+
380+
5. **Unknown entity types**: When you don't know in advance what types of entities might be present in your text and need a more general-purpose entity recognition approach.
381+
382+
For high-performance production systems processing large volumes of text with known entity types (emails, phone numbers, credit cards, etc.), the regex engine is strongly recommended due to its significant speed advantage.
383+
384+
### Running Benchmarks Locally
385+
386+
You can run the performance benchmarks locally using pytest-benchmark:
387+
388+
```bash
389+
pip install pytest-benchmark
390+
pytest tests/benchmark_text_service.py -v
391+
```
392+
303393
## Examples
304394

305395
For more detailed examples, check out our Jupyter notebooks in the `examples/` directory:
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from datafog.processing.text_processing.regex_annotator.regex_annotator import (
2+
AnnotationResult,
3+
RegexAnnotator,
4+
Span,
5+
)
6+
7+
__all__ = ["RegexAnnotator", "Span", "AnnotationResult"]

0 commit comments

Comments
 (0)