Support LLM based de-identification

**Is your feature request related to a problem? Please describe.**
LLMs usually do well in PII detection and de-identification. Using LLMs to identify PII in text could allow users to easily expand Presidio's capabilities with arbitrary PII entities and PII which is a characteristic of a person rather than an identifier (e.g. "He recently got divorced" vs. "His SSN is 1234")

**Describe the solution you'd like**
Presidio currently supports multiple NER and NLP approaches for PII detection. Presidio proposes several `NLPEngine` instances for transformers, stanza and spacy. Creating one for LLM would be a simple integration of an LLM into Presidio. One possible way to achieve this is using [spacy-llm](https://github.com/explosion/spacy-llm) which already has integrations with many LLM frameworks and models, and takes care of things like identifying the span of a PII entity discovered by an LLM.


**Describe alternatives you've considered**
We can use LLMs in many steps in the de-identification pipeline. We have examples for using LLMs to generate fake data, we can use LLMs to identify PII in text, and we can use LLMs to do the end-to-end de-identification. While we can consider building all three capabilities, we should start with PII detection, in order to conform with the Presidio structure, and be able to leverage existing de-identification operators in presidio-anonymizer.


**Additional context**

Contributions welcome! There's plenty of docs on how the `NlpEngine` is structured, and existing code samples for integrating NLP frameworks into Presidio.

- [NLP Engine docs](https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/)
- [Contribution guidelines](https://github.com/microsoft/presidio/blob/main/CONTRIBUTING.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support LLM based de-identification #1234

omri374
openedon Dec 17, 2023

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support LLM based de-identification #1234

Description

omri374openedon Dec 17, 2023

Metadata