-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyzer - multiple languages and nlp engines #312
Analyzer - multiple languages and nlp engines #312
Commits on Jul 15, 2020
-
analyzer - multiple languages and nlp engines
Initially this was my attempt to use stanza, which is an nlp engine by Stanford. But generally, it's an update to allow for one to add NLP engines and custom recognizers more easily. Specifically, I standardized the format of the recognizers, removed use of global variables when possible, and removed a lot of hard-coding of defaults. I am thinking of using presidio for several non-english projects at work and these are several of the changes that I made. Below is a list of the changes in list form: * make spacy and/or stanza optional * remove requirement of en_core_web_lg from install * allow predefined recognizers to take parameters * this allows for easily using these as non-english recognizers * create config files for different NLP engines * create tests for stanza * make all spacy and stanza tests optional * create a Dockerfile for an anaconda-based image * pytorch is built with MKL and is much faster on cpu from conda * completely rewrote the IBAN recognizer * the current version only recognizes IBANs if they are the entirety of the string. This version will find IBANs in sentences. * fixed some tests * created a `run.sh` file, so just run dockers without rebuilding them "Breaking" Changes: * I would like to use [black](https://github.com/psf/black), but it's not super friendly with pylint. My suggestion is to drop pylint and use black instead. * Default spacy model is `en` rather than `en_core_web_lg` and no spacy models are downloaded by default. The idea is to let the user choose which models they want. For non-english users, it saves a lot of time at installation because you don't need to install the large spacy model that you aren't using. Signed-off-by: David Pollack <d.pollack@solvemate.com>
David Pollack committedJul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for ceadb04 - Browse repository at this point
Copy the full SHA ceadb04View commit details -
spacy required, spacy-stanza, update tests
* made spacy required * using spacy-stanza for stanza models * refactor tests to use pytest * make one test reliant on big model optional
David Pollack committedJul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for 68cc8d9 - Browse repository at this point
Copy the full SHA 68cc8d9View commit details -
All tests have been refactored to use pytest. Previously, there was a mix of unittest, pytest and miscellaneous global initializations. This commit moves everything to pytest. There is now extensive use of fixtures instead of global variables and parametrized tests instead of duplicated code for each test. The major difference is that parametrized tests are not individually named.
David Pollack committedJul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for 5487945 - Browse repository at this point
Copy the full SHA 5487945View commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for b5fd25a - Browse repository at this point
Copy the full SHA b5fd25aView commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for bb84544 - Browse repository at this point
Copy the full SHA bb84544View commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for af0caef - Browse repository at this point
Copy the full SHA af0caefView commit details -
fix pipeline errors (i.e. install spacy model)
this installs the big spacy model by default in the Docker and the Azure pipeline.
David Pollack committedJul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for d07f464 - Browse repository at this point
Copy the full SHA d07f464View commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for d0ab7ed - Browse repository at this point
Copy the full SHA d0ab7edView commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for f281950 - Browse repository at this point
Copy the full SHA f281950View commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for 4200436 - Browse repository at this point
Copy the full SHA 4200436View commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for 3bd2d30 - Browse repository at this point
Copy the full SHA 3bd2d30View commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for 33cfc20 - Browse repository at this point
Copy the full SHA 33cfc20View commit details -
David Pollack committed
Jul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for 9a4b9f3 - Browse repository at this point
Copy the full SHA 9a4b9f3View commit details -
* add documentation and doc strings * change yaml field names to be more logical
David Pollack committedJul 15, 2020 Configuration menu - View commit details
-
Copy full SHA for b23e551 - Browse repository at this point
Copy the full SHA b23e551View commit details
Commits on Jul 19, 2020
-
Merge remote-tracking branch 'upstream/master' into dhp/allow_multipl…
…e_langs * fix merge conflicts with documentation
David Pollack committedJul 19, 2020 Configuration menu - View commit details
-
Copy full SHA for 59e3b2a - Browse repository at this point
Copy the full SHA 59e3b2aView commit details
Commits on Jul 22, 2020
-
fix pipelines based on PR comments
David Pollack committedJul 22, 2020 Configuration menu - View commit details
-
Copy full SHA for d7458e6 - Browse repository at this point
Copy the full SHA d7458e6View commit details