You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Auto |~0.004 seconds | Same as regex when patterns are found |
354
+
355
+
**Key findings:**
356
+
357
+
- The regex engine is approximately **123x faster** than spaCy for processing the same text
358
+
- The auto engine provides the best balance between speed and comprehensiveness
359
+
- Uses fast regex patterns first
360
+
- Falls back to spaCy only when no regex patterns are matched
361
+
362
+
### When to Use Each Engine
363
+
364
+
-**Regex Engine**: Use when processing large volumes of text or when performance is critical
365
+
-**SpaCy Engine**: Use when you need to detect a wider range of named entities beyond structured PII
366
+
-**Auto Engine**: Recommended for most use cases as it combines the speed of regex with the capability to fall back to spaCy when needed
367
+
368
+
### When do I need spaCy?
369
+
370
+
While the regex engine is significantly faster (123x faster in our benchmarks), there are specific scenarios where you might want to use spaCy:
371
+
372
+
1.**Complex entity recognition**: When you need to identify entities not covered by regex patterns, such as organization names, locations, or product names that don't follow predictable formats.
373
+
374
+
2.**Context-aware detection**: When the meaning of text depends on surrounding context that regex cannot easily capture, such as distinguishing between a person's name and a company with the same name based on context.
375
+
376
+
3.**Multi-language support**: When processing text in languages other than English where regex patterns might be insufficient or need significant customization.
377
+
378
+
4.**Research and exploration**: When experimenting with NLP capabilities and need the full power of a dedicated NLP library with features like part-of-speech tagging, dependency parsing, etc.
379
+
380
+
5.**Unknown entity types**: When you don't know in advance what types of entities might be present in your text and need a more general-purpose entity recognition approach.
381
+
382
+
For high-performance production systems processing large volumes of text with known entity types (emails, phone numbers, credit cards, etc.), the regex engine is strongly recommended due to its significant speed advantage.
383
+
384
+
### Running Benchmarks Locally
385
+
386
+
You can run the performance benchmarks locally using pytest-benchmark:
387
+
388
+
```bash
389
+
pip install pytest-benchmark
390
+
pytest tests/benchmark_text_service.py -v
391
+
```
392
+
303
393
## Examples
304
394
305
395
For more detailed examples, check out our Jupyter notebooks in the `examples/` directory:
0 commit comments