Skip to content

Commit

Permalink
new faker based generator and package updates
Browse files Browse the repository at this point in the history
  • Loading branch information
omri374 committed Dec 26, 2021
1 parent 5ab101b commit f575731
Show file tree
Hide file tree
Showing 85 changed files with 49,413 additions and 162,313 deletions.
44 changes: 27 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# Presidio-research
This package features data-science related tasks for developing new recognizers for [Presidio](https://github.com/microsoft/presidio).

This package features data-science related tasks for developing new recognizers for [Presidio](https://github.com/microsoft/presidio).
It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models

## Who should use it?

- Anyone interested in **developing or evaluating a PII detection model**, an existing Presidio instance or a Presidio PII recognizer.
- Anyone interested in **generating new data based on previous datasets** or sentence templates (e.g. to increase the coverage of entity values) for Named Entity Recognition models.

## Getting started

To install the package, clone the repo and install all dependencies, preferably in a virtual environment:

``` sh
Expand All @@ -24,28 +27,27 @@ python -m spacy download en_core_web_lg
# Verify installation
pytest
```
Note that some dependencies (such as Flair) are not installed to reduce installation complexity.

Note that some dependencies (such as Flair) are not installed to reduce installation complexity.

## What's in this package?

1. **Data generator** for PII recognizers and NER models
1. **Fake data generator** for PII recognizers and NER models
2. **Data representation layer** for data generation, modeling and analysis
3. Multiple **Model/Recognizer evaluation** files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
4. **Training and modeling code** for multiple models
4. Helper functions for **results analysis**


5. Helper functions for **results analysis**

## 1. Data generation

See [Data Generator README](presidio_evaluator/data_generator/README.md) for more details.

The data generation process receives a file with templates, e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
The data generation process receives a file with templates, e.g. `My name is [FIRST_NAME]` and a data frame with fake PII data.
Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/IOB/BILOU) and spans for the newly created samples.

- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).

- For an example for running the generation process, see [this notebook](notebooks/data%20generation/Generate%20data.ipynb).
- For an example for running the generation process, see [this notebook](notebooks/data%20generation/Generate%20data.ipynb).

- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/PII%20EDA.ipynb).
Note that the generation process might not work off-the-shelf as we are not sharing the fake PII datasets and templates used in this analysis, do to copyright and other restrictions.
Expand All @@ -57,40 +59,47 @@ Once data is generated, it could be split into train/test/validation sets while
In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see [data_objects.py](presidio_evaluator/data_objects.py).

## 3. Recognizer evaluation

The presidio-evaluator framework allows you to evaluate Presidio as a system, or a specific PII recognizer for precision and recall.
The main logic lies in the [Evaluator](presidio_evaluator/evaluation/evaluator.py) class. It provides a structured way of evaluating models and recognizers.


### Ready model / engine wrappers

Some evaluators were developed for analysis and references. These include:

#### Presidio analyzer evaluation

Allows you to evaluate an existing Presidio instance. [See this notebook for details](notebooks/Evaluate%20Presidio%20Analyzer.ipynb).

#### One recognizer evaluation
Evaluate one specific recognizer for precision and recall.

Evaluate one specific recognizer for precision and recall.
Similar to the analyzer evaluation just focusing on one type of PII recognizer.
See [presidio_recognizer_wrapper.py](presidio_evaluator/models/presidio_recognizer_wrapper.py)

#### Conditional Random Fields
To train a CRF on a new dataset, see [this notebook](notebooks/models/CRF.ipynb).
To evaluate a CRF model, see the the [same notebook](notebooks/models/CRF.ipynb) or [this class](presidio_evaluator/models/crf_model.py).

To train a CRF on a new dataset, see [this notebook](notebooks/models/Train CRF.ipynb).
To evaluate a CRF model, see the the [same notebook](notebooks/models/Train CRF.ipynb) or [this class](presidio_evaluator/models/crf_model.py).

#### spaCy based models
There are three ways of interacting with spaCy models:

There are three ways of interacting with spaCy models:

1. Evaluate an existing trained model
2. Train with pretrained embeddings
3. Fine tune an existing spaCy model

Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API.
Before interacting with spaCy models, the data needs to be adapted to fit spaCy's API.
See [this notebook for creating spaCy datasets](notebooks/models/Create%20datasets%20for%20Spacy%20training.ipynb).

##### Evaluate an existing spaCy model

To evaluate spaCy based models, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb).

#### Flair based models
To train a new model, see the [FlairTrainer](https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py) object.

To train a new model, see the [FlairTrainer](https://github.com/microsoft/presidio-research/blob/master/models/flair_train.py) object.
For experimenting with other embedding types, change the `embeddings` object in the `train` method.
To train a Flair model, run:

Expand All @@ -110,14 +119,15 @@ trainer.train(corpus)
To evaluate an existing model, see [this notebook](notebooks/models/Evaluate%20flair%20models.ipynb).

# For more information

- [Blog post on NLP approaches to data anonymization](https://towardsdatascience.com/nlp-approaches-to-data-anonymization-1fb5bde6b929)
- [Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization](https://youtu.be/Tl773LANRwY)

# Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
the rights to use your contribution. For details, visit <https://cla.opensource.microsoft.com>.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
Expand All @@ -130,5 +140,5 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
Copyright notice:

Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/)
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/).
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/).
Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.
Loading

0 comments on commit f575731

Please sign in to comment.