-
Notifications
You must be signed in to change notification settings - Fork 567
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[WIP] analyzer - multiple languages and nlp engines (#312)
* analyzer - multiple languages and nlp engines Initially this was my attempt to use stanza, which is an nlp engine by Stanford. But generally, it's an update to allow for one to add NLP engines and custom recognizers more easily. Specifically, I standardized the format of the recognizers, removed use of global variables when possible, and removed a lot of hard-coding of defaults. I am thinking of using presidio for several non-english projects at work and these are several of the changes that I made. Below is a list of the changes in list form: * make spacy and/or stanza optional * remove requirement of en_core_web_lg from install * allow predefined recognizers to take parameters * this allows for easily using these as non-english recognizers * create config files for different NLP engines * create tests for stanza * make all spacy and stanza tests optional * create a Dockerfile for an anaconda-based image * pytorch is built with MKL and is much faster on cpu from conda * completely rewrote the IBAN recognizer * the current version only recognizes IBANs if they are the entirety of the string. This version will find IBANs in sentences. * fixed some tests * created a `run.sh` file, so just run dockers without rebuilding them "Breaking" Changes: * I would like to use [black](https://github.com/psf/black), but it's not super friendly with pylint. My suggestion is to drop pylint and use black instead. * Default spacy model is `en` rather than `en_core_web_lg` and no spacy models are downloaded by default. The idea is to let the user choose which models they want. For non-english users, it saves a lot of time at installation because you don't need to install the large spacy model that you aren't using. Signed-off-by: David Pollack <d.pollack@solvemate.com> * spacy required, spacy-stanza, update tests * made spacy required * using spacy-stanza for stanza models * refactor tests to use pytest * make one test reliant on big model optional * refactor tests to pytest All tests have been refactored to use pytest. Previously, there was a mix of unittest, pytest and miscellaneous global initializations. This commit moves everything to pytest. There is now extensive use of fixtures instead of global variables and parametrized tests instead of duplicated code for each test. The major difference is that parametrized tests are not individually named. * changes based on PR comments * fixes to Dockerfiles * remove sys.path.append * fix pipeline errors (i.e. install spacy model) this installs the big spacy model by default in the Docker and the Azure pipeline. * fix rebase errors * use Pattern class * update docs * use PresidioLogger * linting fixes * move imports to top level * edits based on PR-review * add documentation and doc strings * change yaml field names to be more logical * fix pipelines based on PR comments
- Loading branch information
David Pollack
authored
Jul 22, 2020
1 parent
569b100
commit e5fe414
Showing
75 changed files
with
3,486 additions
and
4,706 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
nlp_engine_name: spacy | ||
models: | ||
- | ||
lang_code: en | ||
model_name: en_core_web_lg | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
nlp_engine_name: spacy | ||
models: | ||
- | ||
lang_code: en | ||
model_name: en_core_web_sm |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
nlp_engine_name: spacy | ||
models: | ||
- | ||
lang_code: en | ||
model_name: en | ||
- | ||
lang_code: de | ||
model_name: de |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
nlp_engine_name: stanza | ||
models: | ||
- | ||
lang_code: en | ||
model_name: en | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
nlp_engine_name: stanza | ||
models: | ||
- | ||
lang_code: en | ||
model_name: en | ||
- | ||
lang_code: de | ||
model_name: de | ||
|
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.