Skip to content

Commit

Permalink
edits based on PR-review
Browse files Browse the repository at this point in the history
* add documentation and doc strings
* change yaml field names to be more logical
  • Loading branch information
David Pollack committed Jul 15, 2020
1 parent 9a4b9f3 commit b23e551
Show file tree
Hide file tree
Showing 10 changed files with 70 additions and 24 deletions.
32 changes: 32 additions & 0 deletions docs/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,3 +179,35 @@ Edit [charts/presidio/values.yaml](../charts/presidio/values.yaml) to:
- Setup secret name (for private registries)
- Change presidio services version
- Change default scale


## NLP Engine Configuration

1. The nlp engines deployed are set on start up based on the yaml configuration files in `presidio-analyzer/conf/`. The default nlp engine is the large English SpaCy model (`en_core_web_lg`) set in `default.yaml`.

2. The format of the yaml file is as follows:

```yaml
nlp_engine_name: spacy # {spacy, stanza}
models:
-
lang_code: en # code corresponds to `supported_language` in any custom recognizers
model_name: en_core_web_lg # the name of the SpaCy or Stanza model
-
lang_code: de # more than one model is optional, just add more items
model_name: de
```

3. By default, we call the method `load_predefined_recognizers` of the `RecognizerRegistry` class to load language specific and language agnostic recognizers.

4. Downloading additional engines.
* SpaCy NLP Models: [models download page](https://spacy.io/usage/models)
* Stanza NLP Models: [models download page](https://stanfordnlp.github.io/stanza/available_models.html)

```sh
# download models - tldr
# spacy
python -m spacy download en_core_web_lg
# stanza
python -c 'import stanza; stanza.download("en");'
```
4 changes: 2 additions & 2 deletions presidio-analyzer/conf/default.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
nlp_engine_name: spacy
models:
-
name: en
lang: en_core_web_lg
lang_code: en
model_name: en_core_web_lg

4 changes: 2 additions & 2 deletions presidio-analyzer/conf/spacy.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
nlp_engine_name: spacy
models:
-
lang: en
name: en
lang_code: en
model_name: en_core_web_sm
8 changes: 4 additions & 4 deletions presidio-analyzer/conf/spacy_multilingual.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
nlp_engine_name: spacy
models:
-
name: en
lang: en
lang_code: en
model_name: en
-
name: de
lang: de
lang_code: de
model_name: de
4 changes: 2 additions & 2 deletions presidio-analyzer/conf/stanza.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
nlp_engine_name: stanza
models:
-
lang: en
name: en
lang_code: en
model_name: en

8 changes: 4 additions & 4 deletions presidio-analyzer/conf/stanza_multilingual.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
nlp_engine_name: stanza
models:
-
lang: en
name: en
lang_code: en
model_name: en
-
lang: de
name: de
lang_code: de
model_name: de

13 changes: 11 additions & 2 deletions presidio-analyzer/presidio_analyzer/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,15 @@ def serve_command_handler(
nlp_conf_path="conf/default.yaml",
max_workers=10,
):
"""
:param enable_trace_pii: boolean to enable trace pii
:param env_grpc_port: boolean to use environmental variables
for grpc ports (default: False)
:param grpc_port: port for grpc server (default: 3000)
:param nlp_conf_path: str to path of nlp engine configuration
(default: 'conf/default.yaml')
:param max_workers: int for number of workers of grpc server (default: 10)
"""
logger.info("Starting GRPC server")
server = grpc.server(futures.ThreadPoolExecutor(max_workers=max_workers))
logger.info("GRPC started")
Expand All @@ -79,11 +88,11 @@ def serve_command_handler(
)
nlp_conf = {
"nlp_engine_name": "spacy",
"models": [{"lang": "en", "name": "en_core_web_lg"}],
"models": [{"lang_code": "en", "model_name": "en_core_web_lg"}],
}
nlp_engine_name = nlp_conf["nlp_engine_name"]
nlp_engine_class = NLP_ENGINES[nlp_engine_name]
nlp_engine_opts = {m["lang"]: m["name"] for m in nlp_conf["models"]}
nlp_engine_opts = {m["lang_code"]: m["model_name"] for m in nlp_conf["models"]}
nlp_engine = nlp_engine_class(nlp_engine_opts)
logger.info(f"{nlp_engine_class.__name__} created")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@ class SpacyNlpEngine(NlpEngine):

def __init__(self, models=None):
if not models:
models = {"en": "en"}
logger.debug(f"Loading NLP models: {models.values()}")
models = {"en": "en_core_web_lg"}
logger.debug(f"Loading SpaCy models: {models.values()}")

self.nlp = {
lang: spacy.load(model_name, disable=['parser', 'tagger'])
for lang, model_name in models.items()
lang_code: spacy.load(model_name, disable=['parser', 'tagger'])
for lang_code, model_name in models.items()
}

for model_name in models.values():
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,14 @@ class StanzaNlpEngine(SpacyNlpEngine):
def __init__(self, models=None):
if not models:
models = {"en": "en"}
logger.debug(f"Loading NLP models: {models.values()}")
logger.debug(f"Loading Stanza models: {models.values()}")

self.nlp = {
lang: StanzaLanguage(
lang_code: StanzaLanguage(
stanza.Pipeline(
model_name,
processors="tokenize,mwt,pos,lemma,ner",
processors="tokenize,pos,lemma,ner",
)
)
for lang, model_name in models.items()
for lang_code, model_name in models.items()
}
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,11 @@ def __init__(
supported_entity="CREDIT_CARD",
replacement_pairs=None,
):
"""
:param replacement_pairs: list of tuples to replace in the string.
( default: [("-", ""), (" ", "")] )
i.e. remove dashes and spaces from the string during recognition.
"""
self.replacement_pairs = replacement_pairs \
if replacement_pairs \
else [("-", ""), (" ", "")]
Expand Down

0 comments on commit b23e551

Please sign in to comment.