Skip to content

Support LLM based de-identification #1234

Open

Description

Is your feature request related to a problem? Please describe.
LLMs usually do well in PII detection and de-identification. Using LLMs to identify PII in text could allow users to easily expand Presidio's capabilities with arbitrary PII entities and PII which is a characteristic of a person rather than an identifier (e.g. "He recently got divorced" vs. "His SSN is 1234")

Describe the solution you'd like
Presidio currently supports multiple NER and NLP approaches for PII detection. Presidio proposes several NLPEngine instances for transformers, stanza and spacy. Creating one for LLM would be a simple integration of an LLM into Presidio. One possible way to achieve this is using spacy-llm which already has integrations with many LLM frameworks and models, and takes care of things like identifying the span of a PII entity discovered by an LLM.

Describe alternatives you've considered
We can use LLMs in many steps in the de-identification pipeline. We have examples for using LLMs to generate fake data, we can use LLMs to identify PII in text, and we can use LLMs to do the end-to-end de-identification. While we can consider building all three capabilities, we should start with PII detection, in order to conform with the Presidio structure, and be able to leverage existing de-identification operators in presidio-anonymizer.

Additional context

Contributions welcome! There's plenty of docs on how the NlpEngine is structured, and existing code samples for integrating NLP frameworks into Presidio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions