Skip to content

A local tool that automatically detects and bleeps out Protected Health Information (PHI) from audio files.

Notifications You must be signed in to change notification settings

abdulmalik97/phi-redactor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PHI Audio Redactor

A local tool that automatically detects and bleeps out Protected Health Information (PHI) from audio files.

What It Detects

  • Names (via NER + spelled-out names like L-I-S-A)
  • Dates (birthdates, appointment dates, etc.)
  • Social Security Numbers
  • Phone Numbers
  • Email Addresses
  • Medical Record Numbers
  • Addresses/Locations
  • Ages
  • ZIP Codes
  • Spelled-out names (e.g., "L-I-S-A" or "J O H N")

Requirements

  • Python 3.9+
  • FFmpeg (for audio processing)
  • uv (recommended) or pip

Installation

1. Install FFmpeg

macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt update && sudo apt install ffmpeg

Windows: Download from https://ffmpeg.org/download.html and add to PATH.

2. Install with uv (Recommended)

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
cd phi-redactor
uv sync

# Download spaCy English model
uv pip install --python .venv/bin/python \
  https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

Alternative: Install with pip

pip install -e .
python -m spacy download en_core_web_sm

Usage

Basic Usage

# With uv
uv run phi-redactor input_audio.mp3

# Or directly with Python
uv run python redact.py input_audio.mp3

This creates input_audio_redacted.mp3 with PHI bleeped out.

Command-Line Options

phi-redactor input.wav [OPTIONS]

Options:
  -o, --output FILE      Output file path (default: input_redacted.ext)
  -m, --model MODEL      Whisper model: tiny, base, small, medium, large (default: base)
  -f, --beep-freq HZ     Beep frequency in Hz (default: 1000)
  -t, --save-transcript  Save transcript to JSON file
  -j, --json             Output results as JSON
  -i, --interactive      Review and confirm each redaction before processing
  -r, --report FILE      Save redaction report to markdown file
  -h, --help             Show help message

Examples

Basic redaction:

uv run phi-redactor patient_call.mp3

Interactive mode (review each detection):

uv run phi-redactor patient_call.mp3 -i

Save a report of what was redacted:

uv run phi-redactor patient_call.mp3 -r redaction_report.md

High-accuracy with transcript and report:

uv run phi-redactor call.wav -m medium -t -r report.md -o cleaned_call.wav

Output as JSON:

uv run phi-redactor call.mp3 -j > results.json

Interactive Mode

Use -i to review each detected PHI segment before redaction:

============================================================
INTERACTIVE MODE: Review detected PHI segments
============================================================
Commands: [y]es, [n]o, [a]ll (redact all remaining), [q]uit (cancel)
------------------------------------------------------------

[1/5] PERSON
  Time:  12.3s - 13.1s
  Context: ...my name is ["John Smith"] and I'm calling...
  Redact? [y/n/a/q]: y
  -> Will redact

[2/5] SPELLED_NAME
  Time:  25.8s - 27.2s
  Context: ...spelled ["L-I-S-A"] for the...
  Redact? [y/n/a/q]: y
  -> Will redact
...

Report Output

The -r flag generates a markdown report:

# PHI Redaction Report

**Input file:** patient_call.mp3
**Output file:** patient_call_redacted.mp3
**Total redactions:** 5

## Transcript

Hello, my name is John Smith...

## Redacted Segments

| # | Time | Duration | Type | Text |
|---|------|----------|------|------|
| 1 | 12.3s | 0.8s | PERSON | John Smith |
| 2 | 25.8s | 1.4s | SPELLED_NAME | J-O-H-N |
...

Whisper Model Sizes

Model Speed Accuracy VRAM
tiny Fastest Lower ~1 GB
base Fast Good ~1 GB
small Medium Better ~2 GB
medium Slow High ~5 GB
large Slowest Highest ~10 GB

For most use cases, base or small is recommended.

How It Works

  1. Transcription: Uses OpenAI's Whisper model to convert speech to text with word-level timestamps
  2. PHI Detection: Combines spaCy NER (Named Entity Recognition) with regex patterns to identify PHI
  3. Review (optional): Interactive mode lets you confirm each detection
  4. Redaction: Replaces detected PHI audio segments with a beep tone
  5. Report (optional): Generates a detailed report of all redactions

Customization

Adding Custom PHI Patterns

Edit redact.py and add patterns to PHIDetector.PATTERNS:

PATTERNS = {
    # ... existing patterns ...
    'PATIENT_ID': r'\b(?:patient|pt)[:\s#]*\d{6,10}\b',
    'CUSTOM': r'your-regex-here',
}

Adjusting Detection Sensitivity

Modify PHI_ENTITY_TYPES to include/exclude spaCy entity types:

PHI_ENTITY_TYPES = {'PERSON', 'GPE', 'LOC', 'DATE'}  # More conservative

Supported Audio Formats

Any format supported by FFmpeg:

  • MP3, WAV, M4A, MP4, FLAC, OGG, WMA, AAC, etc.

Limitations

  • Accuracy depends on audio quality and Whisper model size
  • May miss PHI in heavily accented speech or poor audio
  • Timing of bleeps may occasionally be slightly off
  • Does not detect PHI in background conversations

Privacy Note

All processing happens locally on your machine. No audio is sent to any external service.

License

MIT License - Use freely for any purpose.

About

A local tool that automatically detects and bleeps out Protected Health Information (PHI) from audio files.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages