PHI Audio Redactor

A local tool that automatically detects and bleeps out Protected Health Information (PHI) from audio files.

What It Detects

Names (via NER + spelled-out names like L-I-S-A)
Dates (birthdates, appointment dates, etc.)
Social Security Numbers
Phone Numbers
Email Addresses
Medical Record Numbers
Addresses/Locations
Ages
ZIP Codes
Spelled-out names (e.g., "L-I-S-A" or "J O H N")

Requirements

Python 3.9+
FFmpeg (for audio processing)
uv (recommended) or pip

Installation

1. Install FFmpeg

macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt update && sudo apt install ffmpeg

Windows: Download from https://ffmpeg.org/download.html and add to PATH.

2. Install with uv (Recommended)

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
cd phi-redactor
uv sync

# Download spaCy English model
uv pip install --python .venv/bin/python \
  https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

Alternative: Install with pip

pip install -e .
python -m spacy download en_core_web_sm

Usage

Basic Usage

# With uv
uv run phi-redactor input_audio.mp3

# Or directly with Python
uv run python redact.py input_audio.mp3

This creates input_audio_redacted.mp3 with PHI bleeped out.

Command-Line Options

phi-redactor input.wav [OPTIONS]

Options:
  -o, --output FILE      Output file path (default: input_redacted.ext)
  -m, --model MODEL      Whisper model: tiny, base, small, medium, large (default: base)
  -f, --beep-freq HZ     Beep frequency in Hz (default: 1000)
  -t, --save-transcript  Save transcript to JSON file
  -j, --json             Output results as JSON
  -i, --interactive      Review and confirm each redaction before processing
  -r, --report FILE      Save redaction report to markdown file
  -h, --help             Show help message

Examples

Basic redaction:

uv run phi-redactor patient_call.mp3

Interactive mode (review each detection):

uv run phi-redactor patient_call.mp3 -i

Save a report of what was redacted:

uv run phi-redactor patient_call.mp3 -r redaction_report.md

High-accuracy with transcript and report:

uv run phi-redactor call.wav -m medium -t -r report.md -o cleaned_call.wav

Output as JSON:

uv run phi-redactor call.mp3 -j > results.json

Interactive Mode

Use -i to review each detected PHI segment before redaction:

============================================================
INTERACTIVE MODE: Review detected PHI segments
============================================================
Commands: [y]es, [n]o, [a]ll (redact all remaining), [q]uit (cancel)
------------------------------------------------------------

[1/5] PERSON
  Time:  12.3s - 13.1s
  Context: ...my name is ["John Smith"] and I'm calling...
  Redact? [y/n/a/q]: y
  -> Will redact

[2/5] SPELLED_NAME
  Time:  25.8s - 27.2s
  Context: ...spelled ["L-I-S-A"] for the...
  Redact? [y/n/a/q]: y
  -> Will redact
...

Report Output

The -r flag generates a markdown report:

# PHI Redaction Report

**Input file:** patient_call.mp3
**Output file:** patient_call_redacted.mp3
**Total redactions:** 5

## Transcript

Hello, my name is John Smith...

## Redacted Segments

| # | Time | Duration | Type | Text |
|---|------|----------|------|------|
| 1 | 12.3s | 0.8s | PERSON | John Smith |
| 2 | 25.8s | 1.4s | SPELLED_NAME | J-O-H-N |
...

Whisper Model Sizes

Model	Speed	Accuracy	VRAM
tiny	Fastest	Lower	~1 GB
base	Fast	Good	~1 GB
small	Medium	Better	~2 GB
medium	Slow	High	~5 GB
large	Slowest	Highest	~10 GB

For most use cases, base or small is recommended.

How It Works

Transcription: Uses OpenAI's Whisper model to convert speech to text with word-level timestamps
PHI Detection: Combines spaCy NER (Named Entity Recognition) with regex patterns to identify PHI
Review (optional): Interactive mode lets you confirm each detection
Redaction: Replaces detected PHI audio segments with a beep tone
Report (optional): Generates a detailed report of all redactions

Customization

Adding Custom PHI Patterns

Edit redact.py and add patterns to PHIDetector.PATTERNS:

PATTERNS = {
    # ... existing patterns ...
    'PATIENT_ID': r'\b(?:patient|pt)[:\s#]*\d{6,10}\b',
    'CUSTOM': r'your-regex-here',
}

Adjusting Detection Sensitivity

Modify PHI_ENTITY_TYPES to include/exclude spaCy entity types:

PHI_ENTITY_TYPES = {'PERSON', 'GPE', 'LOC', 'DATE'}  # More conservative

Supported Audio Formats

Any format supported by FFmpeg:

MP3, WAV, M4A, MP4, FLAC, OGG, WMA, AAC, etc.

Limitations

Accuracy depends on audio quality and Whisper model size
May miss PHI in heavily accented speech or poor audio
Timing of bleeps may occasionally be slightly off
Does not detect PHI in background conversations

Privacy Note

All processing happens locally on your machine. No audio is sent to any external service.

License

MIT License - Use freely for any purpose.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
redact.py		redact.py
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHI Audio Redactor

What It Detects

Requirements

Installation

1. Install FFmpeg

2. Install with uv (Recommended)

Alternative: Install with pip

Usage

Basic Usage

Command-Line Options

Examples

Interactive Mode

Report Output

Whisper Model Sizes

How It Works

Customization

Adding Custom PHI Patterns

Adjusting Detection Sensitivity

Supported Audio Formats

Limitations

Privacy Note

License

About

Uh oh!

Releases

Packages

Languages

abdulmalik97/phi-redactor

Folders and files

Latest commit

History

Repository files navigation

PHI Audio Redactor

What It Detects

Requirements

Installation

1. Install FFmpeg

2. Install with uv (Recommended)

Alternative: Install with pip

Usage

Basic Usage

Command-Line Options

Examples

Interactive Mode

Report Output

Whisper Model Sizes

How It Works

Customization

Adding Custom PHI Patterns

Adjusting Detection Sensitivity

Supported Audio Formats

Limitations

Privacy Note

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages