Skip to content

Commit

Permalink
[WIP] analyzer - multiple languages and nlp engines (#312)
Browse files Browse the repository at this point in the history
* analyzer - multiple languages and nlp engines

Initially this was my attempt to use stanza, which is an nlp engine by
Stanford.  But generally, it's an update to allow for one to add NLP
engines and custom recognizers more easily.  Specifically, I
standardized the format of the recognizers, removed use of global
variables when possible, and removed a lot of hard-coding of defaults.

I am thinking of using presidio for several non-english projects at work
and these are several of the changes that I made.

Below is a list of the changes in list form:

* make spacy and/or stanza optional
  * remove requirement of en_core_web_lg from install
* allow predefined recognizers to take parameters
  * this allows for easily using these as non-english recognizers
* create config files for different NLP engines
* create tests for stanza
* make all spacy and stanza tests optional
* create a Dockerfile for an anaconda-based image
  * pytorch is built with MKL and is much faster on cpu from conda
* completely rewrote the IBAN recognizer
  * the current version only recognizes IBANs if they are the entirety
    of the string.  This version will find IBANs in sentences.
* fixed some tests
* created a `run.sh` file, so just run dockers without rebuilding them

"Breaking" Changes:

* I would like to use [black](https://github.com/psf/black), but it's
  not super friendly with pylint.  My suggestion is to drop pylint and
  use black instead.
* Default spacy model is `en` rather than `en_core_web_lg` and no spacy
  models are downloaded by default.  The idea is to let the user choose
  which models they want.  For non-english users, it saves a lot of time
  at installation because you don't need to install the large spacy
  model that you aren't using.

Signed-off-by: David Pollack <d.pollack@solvemate.com>

* spacy required, spacy-stanza, update tests

* made spacy required
* using spacy-stanza for stanza models
* refactor tests to use pytest
* make one test reliant on big model optional

* refactor tests to pytest

All tests have been refactored to use pytest.  Previously, there was a
mix of unittest, pytest and miscellaneous global initializations.  This
commit moves everything to pytest.  There is now extensive use of
fixtures instead of global variables and parametrized tests instead of
duplicated code for each test.  The major difference is that
parametrized tests are not individually named.

* changes based on PR comments

* fixes to Dockerfiles

* remove sys.path.append

* fix pipeline errors (i.e. install spacy model)

this installs the big spacy model by default in the Docker and the Azure
pipeline.

* fix rebase errors

* use Pattern class

* update docs

* use PresidioLogger

* linting fixes

* move imports to top level

* edits based on PR-review

* add documentation and doc strings
* change yaml field names to be more logical

* fix pipelines based on PR comments
  • Loading branch information
David Pollack authored Jul 22, 2020
1 parent 569b100 commit e5fe414
Show file tree
Hide file tree
Showing 75 changed files with 3,486 additions and 4,706 deletions.
2 changes: 2 additions & 0 deletions Dockerfile.python.deps
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ RUN pip install pipenv
RUN pip install --upgrade setuptools
# Installing specified packages from Pipfile.lock
RUN bash -c 'PIPENV_VENV_IN_PROJECT=1 pipenv sync'
# Install for tests, consider making this optional
RUN pipenv run python -m spacy download en_core_web_lg

# Print to screen the installed packages for easy debugging
RUN pipenv run pip freeze
Expand Down
20 changes: 12 additions & 8 deletions build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,22 @@

# Build the images

export DOCKER_REGISTRY=presidio
export PRESIDIO_LABEL=latest
DOCKER_REGISTRY=${DOCKER_REGISTRY:-presidio}
PRESIDIO_LABEL=${PRESIDIO_LABEL:-latest}
make DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL} docker-build-deps
make DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL} docker-build

# Run the containers

docker network create mynetwork
docker run --rm --name redis --network mynetwork -d -p 6379:6379 redis
docker run --rm --name presidio-analyzer --network mynetwork -d -p 3000:3000 -e GRPC_PORT=3000 -e RECOGNIZERS_STORE_SVC_ADDRESS=presidio-recognizers-store:3004 ${DOCKER_REGISTRY}/presidio-analyzer:${PRESIDIO_LABEL}
docker run --rm --name presidio-anonymizer --network mynetwork -d -p 3001:3001 -e GRPC_PORT=3001 ${DOCKER_REGISTRY}/presidio-anonymizer:${PRESIDIO_LABEL}
docker run --rm --name presidio-recognizers-store --network mynetwork -d -p 3004:3004 -e GRPC_PORT=3004 -e REDIS_URL=redis:6379 ${DOCKER_REGISTRY}/presidio-recognizers-store:${PRESIDIO_LABEL}
NETWORKNAME=${NETWORKNAME:-presidio-network}
if [[ ! "$(docker network ls)" =~ (^|[[:space:]])"$NETWORKNAME"($|[[:space:]]) ]]; then
docker network create $NETWORKNAME
fi
docker run --rm --name redis --network $NETWORKNAME -d -p 6379:6379 redis
docker run --rm --name presidio-analyzer --network $NETWORKNAME -d -p 3000:3000 -e GRPC_PORT=3000 -e RECOGNIZERS_STORE_SVC_ADDRESS=presidio-recognizers-store:3004 ${DOCKER_REGISTRY}/presidio-analyzer:${PRESIDIO_LABEL}
docker run --rm --name presidio-anonymizer --network $NETWORKNAME -d -p 3001:3001 -e GRPC_PORT=3001 ${DOCKER_REGISTRY}/presidio-anonymizer:${PRESIDIO_LABEL}
docker run --rm --name presidio-recognizers-store --network $NETWORKNAME -d -p 3004:3004 -e GRPC_PORT=3004 -e REDIS_URL=redis:6379 ${DOCKER_REGISTRY}/presidio-recognizers-store:${PRESIDIO_LABEL}

echo "waiting 30 seconds for analyzer model to load..."
sleep 30 # Wait for the analyzer model to load
docker run --rm --name presidio-api --network mynetwork -d -p 8080:8080 -e WEB_PORT=8080 -e ANALYZER_SVC_ADDRESS=presidio-analyzer:3000 -e ANONYMIZER_SVC_ADDRESS=presidio-anonymizer:3001 -e RECOGNIZERS_STORE_SVC_ADDRESS=presidio-recognizers-store:3004 ${DOCKER_REGISTRY}/presidio-api:${PRESIDIO_LABEL}
docker run --rm --name presidio-api --network $NETWORKNAME -d -p 8080:8080 -e WEB_PORT=8080 -e ANALYZER_SVC_ADDRESS=presidio-analyzer:3000 -e ANONYMIZER_SVC_ADDRESS=presidio-anonymizer:3001 -e RECOGNIZERS_STORE_SVC_ADDRESS=presidio-recognizers-store:3004 ${DOCKER_REGISTRY}/presidio-api:${PRESIDIO_LABEL}
57 changes: 46 additions & 11 deletions docs/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,21 +54,24 @@ Most of Presidio's services are written in Go. The `presidio-analyzer` module, i
Additional installation instructions: https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv
3. Create virtualenv for the project and install all requirements in the Pipfile, including dev requirements. In the `presidio-analyzer` folder, run:
```
pipenv install --dev --sequential --skip-lock
```
4. Run all tests
4. Download spacy model
```
pipenv run python -m spacy download en_core_web_lg
```
```
pipenv run pytest
```
5. Run all tests
```
pipenv run pytest
```
5. To run arbitrary scripts within the virtual env, start the command with `pipenv run`. For example:
1. `pipenv run flake8 analyzer --exclude "*pb2*.py"`
2. `pipenv run pylint analyzer`
3. `pipenv run pip freeze`
6. To run arbitrary scripts within the virtual env, start the command with `pipenv run`. For example:
1. `pipenv run flake8 analyzer --exclude "*pb2*.py"`
2. `pipenv run pylint analyzer`
3. `pipenv run pip freeze`
#### Alternatively, activate the virtual environment and use the commands by starting a pipenv shell:
Expand Down Expand Up @@ -144,13 +147,13 @@ pipenv install --dev --sequential
3. If you want to experiment with `analyze` requests, navigate into the `analyzer` folder and start serving the analyzer service:

```sh
pipenv run python __main__.py serve --grpc-port 3000
pipenv run python app.py serve --grpc-port 3000
```

4. In a new `pipenv shell` window you can run `analyze` requests, for example:

```
pipenv run python __main__.py analyze --text "John Smith drivers license is AC432223" --fields "PERSON" "US_DRIVER_LICENSE" --grpc-port 3000
pipenv run python app.py analyze --text "John Smith drivers license is AC432223" --fields "PERSON" "US_DRIVER_LICENSE" --grpc-port 3000
```

## Load test
Expand All @@ -175,3 +178,35 @@ Edit [charts/presidio/values.yaml](../charts/presidio/values.yaml) to:
- Setup secret name (for private registries)
- Change presidio services version
- Change default scale


## NLP Engine Configuration

1. The nlp engines deployed are set on start up based on the yaml configuration files in `presidio-analyzer/conf/`. The default nlp engine is the large English SpaCy model (`en_core_web_lg`) set in `default.yaml`.

2. The format of the yaml file is as follows:

```yaml
nlp_engine_name: spacy # {spacy, stanza}
models:
-
lang_code: en # code corresponds to `supported_language` in any custom recognizers
model_name: en_core_web_lg # the name of the SpaCy or Stanza model
-
lang_code: de # more than one model is optional, just add more items
model_name: de
```

3. By default, we call the method `load_predefined_recognizers` of the `RecognizerRegistry` class to load language specific and language agnostic recognizers.

4. Downloading additional engines.
* SpaCy NLP Models: [models download page](https://spacy.io/usage/models)
* Stanza NLP Models: [models download page](https://stanfordnlp.github.io/stanza/available_models.html)

```sh
# download models - tldr
# spacy
python -m spacy download en_core_web_lg
# stanza
python -c 'import stanza; stanza.download("en");'
```
2 changes: 1 addition & 1 deletion docs/interpretability_logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ The `textual_explanation` field in `AnalysisExplanation` class allows you to add
Interpretability traces are enabled by default. Disable App Tracing by setting the `enabled` constructor parameter to `False`.
PII entities are not stored in the Traces by default. Enable it by either set an evironment variable `ENABLE_TRACE_PII` to `True`, or you can set it directly in the command line, using the `enable-trace-pii` argument as follows:
```bash
pipenv run python __main__.py serve --grpc-port 3001 --enable-trace-pii True
pipenv run python app.py serve --grpc-port 3001 --enable-trace-pii True
```

## Notes
Expand Down
1 change: 1 addition & 0 deletions pipelines/templates/build-python-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ steps:
# regex
pipenv sync --dev --sequential
pipenv install --dev --skip-lock regex pytest-azurepipelines
pipenv run python -m spacy download en_core_web_lg
- task: Bash@3
displayName: 'Lint'
inputs:
Expand Down
3 changes: 2 additions & 1 deletion presidio-analyzer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ FROM ${REGISTRY}/presidio-python-deps:${PRESIDIO_DEPS_LABEL}

ARG NAME=presidio-analyzer
ADD ./${NAME}/presidio_analyzer /usr/bin/${NAME}/presidio_analyzer
ADD ./${NAME}/conf /usr/bin/${NAME}/presidio_analyzer/conf
WORKDIR /usr/bin/${NAME}/presidio_analyzer

CMD pipenv run python __main__.py serve --env-grpc-port
CMD pipenv run python app.py serve --env-grpc-port
3 changes: 2 additions & 1 deletion presidio-analyzer/Dockerfile.local
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ FROM ${REGISTRY}/presidio-python-deps:${PRESIDIO_DEPS_LABEL}

ARG NAME=presidio-analyzer
ADD ./${NAME}/presidio_analyzer /usr/bin/${NAME}/presidio_analyzer
ADD ./${NAME}/conf /usr/bin/${NAME}/presidio_analyzer/conf
WORKDIR /usr/bin/${NAME}/presidio_analyzer

CMD pipenv run python __main__.py serve --env-grpc-port
CMD pipenv run python app.py serve --env-grpc-port
3 changes: 1 addition & 2 deletions presidio-analyzer/Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ name = "pypi"

[packages]
cython = "*"
spacy = "==2.2.3"
en_core_web_lg = {file = "https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz"}
spacy = "==2.2.4"
regex = "*"
pyre2 = {file = "https://github.com/torosent/pyre2/archive/release/0.2.23.zip"}
grpcio = "*"
Expand Down
6 changes: 6 additions & 0 deletions presidio-analyzer/conf/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en_core_web_lg

5 changes: 5 additions & 0 deletions presidio-analyzer/conf/spacy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en_core_web_sm
8 changes: 8 additions & 0 deletions presidio-analyzer/conf/spacy_multilingual.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en
-
lang_code: de
model_name: de
6 changes: 6 additions & 0 deletions presidio-analyzer/conf/stanza.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
nlp_engine_name: stanza
models:
-
lang_code: en
model_name: en

9 changes: 9 additions & 0 deletions presidio-analyzer/conf/stanza_multilingual.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
nlp_engine_name: stanza
models:
-
lang_code: en
model_name: en
-
lang_code: de
model_name: de

161 changes: 0 additions & 161 deletions presidio-analyzer/presidio_analyzer/__main__.py

This file was deleted.

Loading

0 comments on commit e5fe414

Please sign in to comment.