This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models
- Anyone interested in developing or evaluating a PII detection model, an existing Presidio instance or a Presidio PII recognizer.
- Anyone interested in generating new data based on previous datasets or sentence templates (e.g. to increase the coverage of entity values) for Named Entity Recognition models.
To install the package, clone the repo and install all dependencies, preferably in a virtual environment:
# Create conda env (optional)
conda create --name presidio python=3.8
conda activate presidio
# Install package+dependencies
pip install -r requirements.txt
python setup.py install
# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg
# Verify installation
pytest
Note that some dependencies (such as Flair) are not installed to reduce installation complexity.
- Data generator for PII recognizers and NER models
- Data representation layer for data generation, modeling and analysis
- Multiple Model/Recognizer evaluation files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
- Training and modeling code for multiple models
- Helper functions for results analysis
See Data Generator README for more details.
The data generation process receives a file with templates, e.g. My name is [FIRST_NAME]
and a data frame with fake PII data.
Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.
-
For information on data generation/augmentation, see the data generator README.
-
For an example for running the generation process, see this notebook.
-
For an understanding of the underlying fake PII data used, see this exploratory data analysis notebook. Note that the generation process might not work off-the-shelf as we are not sharing the fake PII datasets and templates used in this analysis, do to copyright and other restrictions.
Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See this notebook for more details.
In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see data_objects.py.
The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall. The main logic lies in the Evaluator class. It provides a structured way of evaluating models and recognizers.
Some evaluators were developed for analysis and references. These include:
Allows you to evaluate an existing Presidio instance. See this notebook for details.
Evaluate one specific recognizer for precision and recall. Similar to the analyzer evaluation just focusing on one type of PII recognizer. See presidio_recognizer_wrapper.py
To train a CRF on a new dataset, see this notebook. To evaluate a CRF model, see the the same notebook or this class.
There are three ways of interacting with spaCy models:
- Evaluate an existing trained model
- Train with pretrained embeddings
- Fine tune an existing spaCy model
Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API. See this notebook for creating spaCy datasets.
To evaluate spaCy based models, see this notebook.
To train a new model, see the FlairTrainer object.
For experimenting with other embedding types, change the embeddings
object in the train
method.
To train a Flair model, run:
from presidio_evaluator.models import FlairTrainer
train_samples = "../data/generated_train.json"
test_samples = "../data/generated_test.json"
val_samples = "../data/generated_validation.json"
trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)
corpus = trainer.read_corpus("")
trainer.train(corpus)
To evaluate an existing model, see this notebook.
- Blog post on NLP approaches to data anonymization
- Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Copyright notice:
Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.