Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyzer - multiple languages and nlp engines #312

Merged
merged 16 commits into from
Jul 22, 2020

Commits on Jul 15, 2020

  1. analyzer - multiple languages and nlp engines

    Initially this was my attempt to use stanza, which is an nlp engine by
    Stanford.  But generally, it's an update to allow for one to add NLP
    engines and custom recognizers more easily.  Specifically, I
    standardized the format of the recognizers, removed use of global
    variables when possible, and removed a lot of hard-coding of defaults.
    
    I am thinking of using presidio for several non-english projects at work
    and these are several of the changes that I made.
    
    Below is a list of the changes in list form:
    
    * make spacy and/or stanza optional
      * remove requirement of en_core_web_lg from install
    * allow predefined recognizers to take parameters
      * this allows for easily using these as non-english recognizers
    * create config files for different NLP engines
    * create tests for stanza
    * make all spacy and stanza tests optional
    * create a Dockerfile for an anaconda-based image
      * pytorch is built with MKL and is much faster on cpu from conda
    * completely rewrote the IBAN recognizer
      * the current version only recognizes IBANs if they are the entirety
        of the string.  This version will find IBANs in sentences.
    * fixed some tests
    * created a `run.sh` file, so just run dockers without rebuilding them
    
    "Breaking" Changes:
    
    * I would like to use [black](https://github.com/psf/black), but it's
      not super friendly with pylint.  My suggestion is to drop pylint and
      use black instead.
    * Default spacy model is `en` rather than `en_core_web_lg` and no spacy
      models are downloaded by default.  The idea is to let the user choose
      which models they want.  For non-english users, it saves a lot of time
      at installation because you don't need to install the large spacy
      model that you aren't using.
    
    Signed-off-by: David Pollack <d.pollack@solvemate.com>
    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    ceadb04 View commit details
    Browse the repository at this point in the history
  2. spacy required, spacy-stanza, update tests

    * made spacy required
    * using spacy-stanza for stanza models
    * refactor tests to use pytest
    * make one test reliant on big model optional
    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    68cc8d9 View commit details
    Browse the repository at this point in the history
  3. refactor tests to pytest

    All tests have been refactored to use pytest.  Previously, there was a
    mix of unittest, pytest and miscellaneous global initializations.  This
    commit moves everything to pytest.  There is now extensive use of
    fixtures instead of global variables and parametrized tests instead of
    duplicated code for each test.  The major difference is that
    parametrized tests are not individually named.
    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    5487945 View commit details
    Browse the repository at this point in the history
  4. changes based on PR comments

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    b5fd25a View commit details
    Browse the repository at this point in the history
  5. fixes to Dockerfiles

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    bb84544 View commit details
    Browse the repository at this point in the history
  6. remove sys.path.append

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    af0caef View commit details
    Browse the repository at this point in the history
  7. fix pipeline errors (i.e. install spacy model)

    this installs the big spacy model by default in the Docker and the Azure
    pipeline.
    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    d07f464 View commit details
    Browse the repository at this point in the history
  8. fix rebase errors

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    d0ab7ed View commit details
    Browse the repository at this point in the history
  9. use Pattern class

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    f281950 View commit details
    Browse the repository at this point in the history
  10. update docs

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    4200436 View commit details
    Browse the repository at this point in the history
  11. use PresidioLogger

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    3bd2d30 View commit details
    Browse the repository at this point in the history
  12. linting fixes

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    33cfc20 View commit details
    Browse the repository at this point in the history
  13. move imports to top level

    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    9a4b9f3 View commit details
    Browse the repository at this point in the history
  14. edits based on PR-review

    * add documentation and doc strings
    * change yaml field names to be more logical
    David Pollack committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    b23e551 View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2020

  1. Merge remote-tracking branch 'upstream/master' into dhp/allow_multipl…

    …e_langs
    
    * fix merge conflicts with documentation
    David Pollack committed Jul 19, 2020
    Configuration menu
    Copy the full SHA
    59e3b2a View commit details
    Browse the repository at this point in the history

Commits on Jul 22, 2020

  1. fix pipelines based on PR comments

    David Pollack committed Jul 22, 2020
    Configuration menu
    Copy the full SHA
    d7458e6 View commit details
    Browse the repository at this point in the history