A: A lightweight rule-based NER library. Extracts named entities via simplematch patterns, regex, and pre-built annotators for common entity types (email, names, locations, dates, numbers, etc.).
A: 16 built-in annotators: email, names, locations (countries/capitals/cities), temporal (datetime/duration), numbers, lookup, URL, phone, currency, organization, hashtag, date. Custom types via RuleNER or RegexNER.
A: Python 3.10–3.13.
A: simple_NER is lighter and rule-based — no ML models, no training data, easy to customize. spaCy/NLTK are better when statistical accuracy matters more than speed and interpretability.
A: Language support varies by annotator:
| Annotator | Language coverage |
|---|---|
TemporalNER, NumberNER |
Multi-language via ovos-date-parser / ovos-number-parser; pass lang="de-de" etc. |
DateAnnotator |
Written month names in EN/ES/FR/DE/PT/IT/NL; numeric formats are language-agnostic |
CurrencyAnnotator |
Currency words in EN/ES/FR/DE/PT/IT/NL; symbols/ISO codes are language-agnostic |
OrganizationAnnotator |
Company suffixes for DE/FR/ES/PT/IT/NL/BE + English; accented Latin characters in name regex |
HashtagAnnotator |
Fully Unicode-aware (re.UNICODE) — matches hashtags in any script |
LocationNER |
Language-agnostic (bundled JSON city/country names) |
EmailAnnotator, URLAnnotator, PhoneAnnotator |
Language-agnostic (structural patterns) |
LookUpNER |
Per-language via .entity resource files under res/<lang>/ |
Pass lang to any annotator or to create_pipeline(names, lang="de-de") — the factory forwards it to all constructors.
A: Yes. SimpleNERIntentTransformer (simple_NER.opm) is an opm.transformer.intent plugin. It runs after intent matching and injects extracted entities into intent.match_data. Plugin ID: simple-ner-transformer.
{
"intent_transformers": {
"simple-ner-transformer": {
"annotators": ["email", "names", "locations", "temporal", "numbers"],
"confidence_threshold": 0.5
}
}
}pip install simple_NERTemporalNER and NumberNER degrade gracefully but datetime/number extraction won't work.
pip install ovos-date-parser ovos-number-parserpip install --user simple_NERfrom simple_NER.rules import RuleNER
ner = RuleNER()
ner.add_rule("product", "I want {product}")
for ent in ner.extract_entities("I want iphone"):
print(ent.value) # iphonefrom simple_NER.annotators.factory import create_pipeline
pipeline = create_pipeline(["email", "names", "locations"])
for ent in pipeline.process(text):
print(ent.value, ent.entity_type)from simple_NER.annotators.names_ner import NamesNER
ner = NamesNER(confidence_threshold=0.9) # fewer, more confidentfor ent in entities:
print(ent.value, ent.spans) # e.g. [(8, 24)]import json
data = {"text": text, "entities": [e.as_json() for e in entities]}
with open("results.json", "w") as f:
json.dump(data, f, indent=2)from simple_NER.utils.batch import BatchProcessor
processor = BatchProcessor(pipeline, batch_size=100)
results = processor.process_batch(texts, use_multiprocessing=True)Use streaming — process one text at a time without storing all results:
from simple_NER.utils.batch import StreamingProcessor
for text, entities in StreamingProcessor(pipeline).process_stream(gen()):
handle(entities)from simple_NER.utils.cache import LRUCache
cache = LRUCache(max_size=1000)
entities = cache.get(text) or cache.set(text, pipeline.process(text)) or cache.get(text)from simple_NER.annotators.factory import list_available_annotators
print(list_available_annotators()) # check exact key names# use await or asyncio.run()
entities = await pipeline.process_async(text)
# or
entities = asyncio.run(pipeline.process_async(text))A: Create locale/<lang>/phone.rx (one regex per line). PhoneAnnotator calls BaseAnnotator._load_rx("phone", lang) at init time and merges those patterns into its compiled set alongside the built-in en-us patterns.
A: Create locale/<lang>/currency.intent with one simplematch template per line (e.g. {amount} euros). CurrencyAnnotator loads these via BaseAnnotator._load_intents("currency", lang) and converts them to regex via intent_to_regex().
A: Yes, since v0.9.0. NERPipeline._deduplicate applies a longest-span-wins strategy across all annotator types. When two entities overlap (e.g. $500 matched as both money and written_number), only the entity with the larger span is kept. Exact-span duplicates with different labels are also de-duplicated.
A: Yes. Pass label_confidence={"City": 0.7, "Country": 0.95} at construction time. Labels not listed fall back to the global confidence parameter. Same interface applies to LocationNER.
- Most annotators are English-only (see language support Q above).
LocationNERis case-sensitive by default; passlowercase=Trueto disable.- Overlapping spans: use
dedup_strategy="keep_longest"inNERPipeline. - Rule-based extraction only matches patterns you define — add more rules for more variation.
- README — quick start
- API Reference
- Tutorials
- Issues
A: simple_NER integrates with ahocorasick-ner, which provides fast multi-entity matching via pre-built HuggingFace dataset loaders. Install the optional dependency:
pip install "ahocorasick-ner[datasets]"Then use dataset NER classes in a simple_NER pipeline:
from ahocorasick_ner.datasets import WikidataEntityNER, BC5CDRMedicalNER
from simple_NER.annotators.ahocorasick_wrapper import AhocorasickAnnotatorWrapper
from simple_NER.pipeline import NERPipeline
pipeline = NERPipeline()
# Extract animals from Wikidata
animals = WikidataEntityNER(entity_type="Animal", wikidata_qid="Q729")
pipeline.add_annotator(AhocorasickAnnotatorWrapper(animals))
# Extract diseases from biomedical literature
diseases = BC5CDRMedicalNER(entity_type="Disease")
pipeline.add_annotator(AhocorasickAnnotatorWrapper(diseases))
for entity in pipeline.process("Dogs can get arthritis and heart disease"):
print(entity.entity_type, entity.value)
# Animal Dogs
# Disease arthritis
# Disease heart diseaseAvailable dataset loaders (in ahocorasick-ner):
Wikidata Entities (easy-access subclasses — no QID needed):
WikidataAnimalNER— animals (1M+ names, all languages)WikidataPlantNER— plants (500k+ names)WikidataCountryNER— countries (195 names)WikidataCityNER— cities (worldwide)WikidataPersonNER— person names (100M+ from Q5)WikidataProfessionNER— professions/occupationsWikidataDiseaseNER— diseases and medical conditionsWikidataLanguageNER— languages of the worldWikidataSportNER— sports and athletic activitiesWikidataBodyPartNER— anatomical body partsWikidataFamilyRelationNER— family relationships- Or generic:
WikidataEntityNER(entity_type="...", wikidata_qid="...")for custom QIDs
Names & Locations:
PersonNamesNER— person surnames (30+ countries/languages)GeoNamesNER— 280k+ cities and locations worldwide
Generic HuggingFace datasets:
GenericHFDatasetNER— any HF dataset with entities in a columnBC5CDRMedicalNER— diseases and chemicals from biomedical NER
Media & Entertainment (Jarbas/TigreGotico):
MovieActorNER— 6.3M movie actor namesMovieDirectorNER— 128k movie director namesMovieComposerNER— 221k movie composer namesMetalArchivesBandsNER— 4.6k metal band namesMetalArchivesTrackNER— 205k metal tracks + albumsJazzNER— jazz artists and genresProgRockNER— prog rock artists and genresMusicNER— comprehensive music dataset (all genres combined)EncyclopediaMetallvmNER,ImdbNER— pre-built combined datasets
See ahocorasick-ner docs for full list.
A: Many dataset loaders support filtering by column values. This reduces the automaton size and improves matching speed.
Supported loaders:
MetalArchivesBandsNER(origin="Portugal")— metal bands from a countryMetalArchivesTrackNER(band_origin="Sweden")— tracks from bands in a countrySpotifyTracksNER(genre="rock")— tracks from a genreGenericHFDatasetNER(..., filter_column="col", filter_value="val")— any HF dataset
Example:
from ahocorasick_ner.datasets import MetalArchivesBandsNER, SpotifyTracksNER
from simple_NER.annotators.ahocorasick_wrapper import AhocorasickAnnotatorWrapper
from simple_NER.pipeline import NERPipeline
pipeline = NERPipeline()
# Only Portuguese bands
pt_bands = MetalArchivesBandsNER(origin="Portugal")
pipeline.add_annotator(AhocorasickAnnotatorWrapper(pt_bands))
# Only rock tracks
rock = SpotifyTracksNER(genre="rock")
pipeline.add_annotator(AhocorasickAnnotatorWrapper(rock))
for entity in pipeline.process("Moonspell and Queen"):
print(entity.entity_type, entity.value)
# metal_band Moonspell
# track_name (Queen songs if available in rock)Benefits:
- 50–90% smaller automaton (less memory)
- Faster matching on domain-specific data
- Semantic clarity (fewer false positives)
See DATASET_INTEGRATION.md for more examples.
| Version | Notes |
|---|---|
| 0.9.0 | Major refactor: async, caching, CLI, 16 annotators, OVOS plugin |
| 0.8.1 | Type hints, linting, tests |
| 0.4.x | Original release |
Use AhocorasickAnnotatorWrapper from simple_NER.annotators.ahocorasick_wrapper:
from ahocorasick_ner.datasets import WikidataAnimalNER
from simple_NER.annotators.ahocorasick_wrapper import AhocorasickAnnotatorWrapper
from simple_NER.pipeline import NERPipeline
pipeline = NERPipeline()
pipeline.add_annotator(AhocorasickAnnotatorWrapper(WikidataAnimalNER()))
for entity in pipeline.process("I saw a dog and a cat"):
print(entity.entity_type, entity.value)See DATASET_INTEGRATION.md for the full guide.
By design. Single capitalised words at sentence boundaries (position 0, or after .!?) score 0.55 confidence — below the default threshold of 0.65 — to suppress false positives like "Send" or "Meeting". Multi-word names ("John Doe") always score 0.85 regardless of position. Lower confidence_threshold to capture sentence-initial single names if needed:
ner = NamesNER(confidence_threshold=0.5)