π₯β‘οΈπ΅β‘οΈπ Professional Audio-to-Transcript Pipeline with Multiple Providers
A comprehensive, production-ready Python package that transforms video recordings into structured, actionable documentation. Features a unified CLI with interactive TUI, robust error handling, event streaming, and extensive transcription analysis including speaker diarization, topic detection, and sentiment analysis. Supports multiple transcription providers including Deepgram Nova 3, ElevenLabs, OpenAI Whisper, and NVIDIA Parakeet.
# Clone and install
git clone https://github.com/edlsh/audio-extraction-analysis.git
cd audio-extraction-analysis
uv sync
# Install FFmpeg
# macOS: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg
# Windows: choco install ffmpeg# Option 1: Use Deepgram (cloud-based, full features)
# Get your API key from: https://console.deepgram.com/
export DEEPGRAM_API_KEY='your-key-here'
# Option 2: Use Whisper (local processing, no API key needed)
uv add openai-whisper torch
# Or create .env file for API keys
echo "DEEPGRAM_API_KEY=your-key-here" > .env# Complete pipeline (concise default): video β audio β transcript β single analysis
audio-extraction-analysis process meeting.mp4
# Process from URL (YouTube, Vimeo, etc.)
audio-extraction-analysis process --url "https://youtube.com/watch?v=..." --output-dir ./results
# Custom output directory
audio-extraction-analysis process video.mp4 --output-dir ./results
# Full 5-file analysis output
audio-extraction-analysis process video.mp4 --analysis-style full --output-dir ./resultsMP4/URL β Audio Extraction β Transcription β AI Analysis β Structured Output
β β β β β
FFmpeg Quality Presets 4 Providers Smart Analysis Actionable Docs
(speech/high) (Cloud/Local) (GPT/Gemini) (MD/HTML/JSON)
- Standardized Health Checks: Consistent response format across all providers
- Simplified Codebase: Reduced duplication with shared helpers and centralized exceptions
- Better CLI: Consolidated transcription options with shared argument helpers
- Improved Maintainability: ~175 lines of redundant code removed
- URL Ingestion: Direct processing from YouTube, Vimeo, and other platforms
- Multiple Providers: Deepgram, ElevenLabs, Whisper, Parakeet support
- Interactive TUI: Terminal UI with live progress and health monitoring
- Event Streaming: JSONL output for integration and monitoring
- Enhanced CI/CD: Full test coverage with black, ruff, and bandit checks
- Security Hardening: Path sanitization, input validation, secure temp files
# Full pipeline (recommended)
audio-extraction-analysis process video.mp4 # Complete workflow (concise)
audio-extraction-analysis process video.mp4 --output-dir ./results
audio-extraction-analysis process video.mp4 --analysis-style full --output-dir ./results # five files
# Interactive TUI (Terminal User Interface) - NEW!
audio-extraction-analysis tui # Launch interactive interface
audio-extraction-analysis tui --input video.mp4 # Pre-populate input file
audio-extraction-analysis tui --output-dir ./results # Pre-set output directory
# Individual steps
audio-extraction-analysis extract video.mp4 # Audio extraction only
audio-extraction-analysis transcribe audio.mp3 # Transcription only
# Event streaming (for monitoring/integration) - NEW!
audio-extraction-analysis process video.mp4 --jsonl # Stream events as JSONL
audio-extraction-analysis process video.mp4 --jsonl | jq '.type' # Monitor event types
# Help and info
audio-extraction-analysis --help # Show all commands
audio-extraction-analysis --version # Show version
# Markdown export (timestamps + speakers by default)
audio-extraction-analysis export-markdown audio.mp3 \
--output-dir ./output \
--template default \
--timestamps --speakers
# Or add Markdown export to existing commands
audio-extraction-analysis transcribe audio.mp3 --export-markdown --md-template detailed
audio-extraction-analysis process video.mp4 --export-markdown --md-no-speakers --md-template minimal| Option | Values | Description |
|---|---|---|
--quality |
speech, standard, high, compressed |
Audio extraction quality |
--language |
en, es, fr, de, etc. |
Transcription language |
--output-dir |
Directory path | Where to save results |
--analysis-style |
concise, full |
Output style: single analysis vs. 5 files |
--verbose |
Flag | Detailed logging |
--jsonl |
Flag | Stream events as JSONL for monitoring |
--provider |
auto, deepgram, elevenlabs, whisper, parakeet |
Transcription provider selection |
For a guided, interactive experience with live progress monitoring, use the TUI mode:
# Launch interactive TUI
audio-extraction-analysis tui
# Or with pre-populated paths
audio-extraction-analysis tui --input video.mp4 --output-dir ./results- π Live Progress Monitoring: Real-time progress cards for extraction, transcription, and analysis with ETAs
- π Log Streaming: Filterable, color-coded logs (DEBUG, INFO, WARNING, ERROR)
- π₯ Provider Health: Monitor transcription provider status (Deepgram, ElevenLabs, Whisper, Parakeet)
- πΎ Auto-Save Settings: Configuration and recent files persist across sessions
- β¨οΈ Keyboard Shortcuts: Full keyboard navigation (press
hor?for help) - π― File Browser: Browse and select files with recent files quick access
| Key | Action |
|---|---|
h, ? |
Show help screen |
q |
Quit application |
d |
Toggle dark mode |
Esc |
Go back / Close screen |
In Run Screen:
c- Cancel pipelineo- Open output directorya/d/i/w/e- Filter logs (all/debug+/info+/warning+/error)
Welcome β Select File β Configure β Run β View Results
β β β β β
Start Browse Settings Progress Open Output
# Process a local video file
audio-extraction-analysis process team-meeting.mp4
# Process from URL (YouTube, Vimeo, etc.)
audio-extraction-analysis process --url "https://youtube.com/watch?v=VIDEO_ID"
# Process with custom settings
audio-extraction-analysis process interview.mp4 \
--output-dir ./transcripts \
--quality high \
--language en
# Extract audio only (for manual transcription)
audio-extraction-analysis extract presentation.mp4 --quality speech
# Transcribe existing audio file
audio-extraction-analysis transcribe recording.mp3 --language es# Process multiple videos
for video in *.mp4; do
audio-extraction-analysis process "$video" --output-dir "./results/${video%.*}"
donespeech(default): Optimized for meetings, interviewsstandard: Balanced quality for general contenthigh: Maximum quality for archivalcompressed: Smaller files for quick tests
The TUI provides a visual, interactive interface for processing audio and video files with real-time progress monitoring.
- π Live Progress Monitoring: Real-time progress bars with ETAs for each stage
- π File Browser: Navigate filesystem with directory tree and recent files
- βοΈ Configuration Screen: Visual settings editor with persistence
- π Run Screen: Live progress cards and scrollable, color-coded logs
- π¨ Themes: Dark/light mode toggle (press
d)
# Basic launch
audio-extraction-analysis tui
# With pre-populated input
audio-extraction-analysis tui --input video.mp4
# With custom output directory
audio-extraction-analysis tui --output-dir ./results| Key | Action | Context |
|---|---|---|
q |
Quit application | Global |
d |
Toggle dark/light mode | Global |
? or h |
Show help | Global |
Enter |
Select/Confirm | Navigation |
Tab |
Switch panes | Home Screen |
/ |
Filter files | Home Screen |
c |
Cancel pipeline | Run Screen |
o |
Open output folder | Run Screen (when complete) |
Escape |
Go back | All screens |
Welcome β File Selection β Configuration β Live Processing β Auto-open Results
Stream structured events in JSONL format for monitoring, logging, or integration with other tools.
# Stream events to stdout
audio-extraction-analysis process video.mp4 --jsonl
# Monitor specific event types
audio-extraction-analysis process video.mp4 --jsonl | jq 'select(.type=="stage_progress")'
# Save events to file
audio-extraction-analysis process video.mp4 --jsonl > events.jsonl
# Real-time monitoring with jq
audio-extraction-analysis process video.mp4 --jsonl | jq -r '
if .type == "stage_start" then
"βΆ Starting: " + .stage
elif .type == "stage_progress" then
"β³ Progress: " + .stage + " " + (.data.completed/.data.total*100|tostring) + "%"
elif .type == "stage_end" then
"β Completed: " + .stage
else . end'stage_start: Stage beginning (extract, transcribe, analyze)stage_progress: Progress updates with completed/total countsstage_end: Stage completion with durationartifact: File created with path and metadatalog,warning,error: Log messages at various levelssummary: Final metrics and statisticscancelled: Pipeline cancellation
Depends on --analysis-style:
{video}.mp3- Extracted audio{video}_analysis.md- Single comprehensive analysis{video}_transcript.txt- Provider-formatted transcript
Each processed video generates 5 structured markdown files:
| File | Purpose | Target Audience |
|---|---|---|
01_executive_summary.md |
High-level overview with metadata | Executives, managers |
02_chapter_overview.md |
Detailed content breakdown by topic | Project managers, team leads |
03_key_topics_and_intents.md |
Technical analysis of discussion themes | Analysts, researchers |
04_full_transcript_with_timestamps.md |
Complete searchable record | All stakeholders, archives |
05_key_insights_and_takeaways.md |
Strategic insights and action items | Decision makers, implementers |
./output/
βββ meeting_2024.mp3 # Extracted audio
βββ 01_executive_summary.md # 2-3 KB overview
βββ 02_chapter_overview.md # 4-5 KB chapters
βββ 03_key_topics_and_intents.md # 5-6 KB analysis
βββ 04_full_transcript_with_timestamps.md # 100+ KB transcript
βββ 05_key_insights_and_takeaways.md # 3-4 KB insights
When exporting markdown, files are organized as:
./output/
βββ <source_name>/
βββ transcript.md
βββ metadata.json
βββ segments.json
# Required for cloud providers
export DEEPGRAM_API_KEY='your-api-key-here' # Get from console.deepgram.com
export ELEVENLABS_API_KEY='your-api-key-here' # Get from elevenlabs.io/api
# Optional - Whisper configuration (local processing)
export WHISPER_MODEL='base' # tiny, base, small, medium, large
export WHISPER_DEVICE='cuda' # cuda or cpu
export WHISPER_COMPUTE_TYPE='float16' # float16 or float32
# Optional - Parakeet configuration (NVIDIA models)
export PARAKEET_MODEL='stt_en_conformer_ctc_large' # stt_en_conformer_ctc_large, stt_en_conformer_transducer_large, stt_en_fastconformer_ctc_large
export PARAKEET_DEVICE='auto' # auto, cuda or cpu
export PARAKEET_BATCH_SIZE=8 # Batch size for processing
export PARAKEET_BEAM_SIZE=10 # Beam size for decoding
export PARAKEET_USE_FP16=true # Use FP16 for faster processing
export PARAKEET_CHUNK_LENGTH=30 # Audio chunk length in seconds
export PARAKEET_MODEL_CACHE_DIR='~/.cache/parakeet' # Model cache directory
# Optional - General configuration
export LOG_LEVEL='INFO' # DEBUG, INFO, WARNING, ERROR
export TEMP_DIR='/custom/temp/path' # Custom temporary directory
# Optional - TUI configuration persistence
# TUI settings are automatically saved to platform-specific config directory:
# - macOS: ~/Library/Application Support/audio-extraction-analysis/
# - Linux: ~/.config/audio-extraction-analysis/
# - Windows: %APPDATA%\audio-extraction-analysis\# Create .env file in project root
echo "DEEPGRAM_API_KEY=your-key-here" > .env
echo "ELEVENLABS_API_KEY=your-key-here" >> .env
echo "WHISPER_MODEL=base" >> .env
echo "WHISPER_DEVICE=cuda" >> .env
echo "LOG_LEVEL=DEBUG" >> .enven- English (default)es- Spanishfr- Frenchde- Germanit- Italianpt- Portugueseauto- Auto-detect
Whisper supports multiple model sizes with different performance characteristics:
| Model | Parameters | Disk Space | RAM Usage | VRAM Usage | Quality |
|---|---|---|---|---|---|
| tiny | 39M | 75MB | ~1GB | ~1GB | Basic |
| base | 74M | 142MB | ~1GB | ~1GB | Good |
| small | 244M | 461MB | ~2GB | ~2GB | Better |
| medium | 769M | 1.5GB | ~5GB | ~5GB | Great |
| large | 1.5B | 2.9GB | ~10GB | ~10GB | Best |
Set model size with: export WHISPER_MODEL=medium
Parakeet supports multiple model architectures with different performance characteristics:
| Model | Type | Accuracy | Speed | Memory | Languages |
|---|---|---|---|---|---|
| stt_en_conformer_ctc_large | CTC | High | Fast | 4GB | English |
| stt_en_conformer_transducer_large | RNN-T | Highest | Medium | 6GB | English |
| stt_en_fastconformer_ctc_large | CTC | Medium | Fastest | 2GB | English |
Set model with: export PARAKEET_MODEL=stt_en_conformer_ctc_large
Templates are defined in src/formatters/templates.py:
default: Rich header, timestamps, speaker prefixesminimal: Title only + textdetailed: Report-style header, stats, bold timestamps
Each template accepts placeholders in strings:
- Header:
{title},{source},{duration},{processed_at},{provider},{segment_count},{avg_confidence} - Segment:
{timestamp},{speaker_prefix},{text},{confidence}
To customize, add a new entry to TEMPLATES with keys:
header, segment, speaker_prefix, timestamp_format.
- Automatic Chapter Detection: Identifies topic transitions and creates logical sections
- Speaker Separation: Maintains clear attribution with timestamps
- Topic Extraction: Identifies and ranks discussion themes by frequency and importance
- Intent Detection: Recognizes underlying purposes in discussions
- Sentiment Analysis: Tracks positive, neutral, and negative segments
- Insight Generation: Extracts actionable takeaways with supporting evidence
- Unified CLI with TUI: Simple commands and interactive interface for complex workflows
- Quality Presets: Optimized audio extraction for different needs
- Provider Health Checking: Automatic validation and fallback to working providers
- Circuit Breaker Pattern: Fault tolerance with automatic retry and failover
- Event Streaming: Real-time monitoring via JSONL events
- Path Sanitization: Security-hardened file operations
- Error Handling: Robust processing with detailed logging
- Fast Processing: 2-hour meetings processed in ~5-7 minutes
- Configuration Persistence: TUI settings saved across sessions
- Python 3.11+ (3.12 recommended)
- FFmpeg (for audio extraction)
- API Key for cloud providers (Deepgram or ElevenLabs) OR local models (Whisper/Parakeet)
- Internet connection (for cloud API calls)
- Install FFmpeg:
brew install ffmpeg(macOS) orsudo apt install ffmpeg(Ubuntu) - Clone repository:
git clone <repository-url> - Install package with desired features:
# Basic installation (cloud providers only) uv sync # With TUI (Terminal User Interface) uv sync --extra tui # With Whisper (local transcription) uv add openai-whisper torch # With Parakeet (NVIDIA models) uv sync --extra parakeet # With all features (recommended for development) uv sync --dev --extra tui --extra parakeet --extra yaml --extra redis
- Configure API keys (for cloud providers):
- Deepgram:
export DEEPGRAM_API_KEY='your-key'(get from console.deepgram.com) - ElevenLabs:
export ELEVENLABS_API_KEY='your-key'(get from elevenlabs.io)
- Deepgram:
- Test:
audio-extraction-analysis --version
- Input: MP4, MOV, AVI, MKV, MP3, WAV, M4A
- Output: MP3 (audio), Markdown (transcripts)
- File size: Up to 2GB
- Duration: No limit
# Check file path and permissions
ls -la your-file.mp4
# Use absolute path if needed
audio-extraction-analysis process /full/path/to/video.mp4# Set API key
export DEEPGRAM_API_KEY="your-key-here"
# Or create .env file
echo "DEEPGRAM_API_KEY=your-key-here" > .env
# Get API key from: https://console.deepgram.com/# Install FFmpeg
# macOS: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg
# Windows: choco install ffmpeg
# Verify: ffmpeg -version# Install Whisper and PyTorch
uv add openai-whisper torch
# For GPU acceleration (recommended):
uv add openai-whisper torch torchaudio --index-url https://download.pytorch.org/whl/cu118
# Verify installation:
python -c "import whisper; print('Whisper installed successfully')"# Install Parakeet and NeMo dependencies
uv sync --extra parakeet
# Or directly: uv add "nemo-toolkit[asr]@1.20.0" --extra parakeet
# For GPU acceleration (recommended):
uv add "nemo-toolkit[asr]@1.20.0" torch torchaudio --extra parakeet
# Verify installation:
python -c "import nemo; print('Parakeet installed successfully')"# Install TUI dependencies
uv sync --extra tui
# Or install Textual directly
uv add "textual>=0.47.0" --extra tui
# Verify TUI works
audio-extraction-analysis tui --help# Check output directory permissions
ls -la /path/to/output/
# Create directory if needed
mkdir -p /path/to/output- Use
--quality speechfor faster processing - Process large files in smaller segments
- Ensure stable internet for API calls
- Monitor disk space for output files
- Input: 2-hour team meeting recording
- Output: Executive summary, action items, decisions
- Time: ~5-7 minutes processing
- Input: Multi-hour training video
- Output: Searchable reference, key concepts, Q&A
- Benefits: Reusable training materials
- Input: Interview recordings
- Output: Insights, pain points, feature requests
- Benefits: Structured feedback analysis
- Input: Long-form content
- Output: Chapter breakdown, topics, quotes
- Benefits: Content repurposing, highlights
- Accuracy: 95%+ with Deepgram Nova 3, 85%+ with Whisper large
- Speed: Real-time processing capability (cloud), 0.5-5x real-time (Whisper)
- Output: 5 files, 100-150KB total
- Languages: 10+ supported languages (Whisper supports 100+ languages)
For detailed information:
- CLI Help (
audio-extraction-analysis --help) - Discover the latest command syntax directly from the tool - HTML Dashboard Guide - Learn how to render interactive dashboards from analysis output
- Examples Directory - Browse runnable samples and generated markdown outputs
# Development setup
git clone https://github.com/edlsh/audio-extraction-analysis.git
cd audio-extraction-analysis
uv sync --dev
# Install Whisper for testing
uv add openai-whisper torch
# Install Parakeet for testing
uv add "nemo-toolkit[asr]@1.20.0" --extra parakeet
# Run tests (default - unit tests only)
pytest
# Run specific test profiles
./scripts/run_tests.sh --profile fast # Fast unit tests only
./scripts/run_tests.sh --profile integration # Integration tests
./scripts/run_tests.sh --profile e2e # End-to-end tests
./scripts/run_tests.sh --profile benchmark # Performance benchmarks
./scripts/run_tests.sh --profile all # Complete test suite
# Run with mock provider (no API keys needed)
AUDIO_TEST_MODE=1 pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Code quality checks (CI/CD compliant)
black src/ tests/ # Format code
ruff check src/ tests/ # Lint code
bandit -r src -ll # Security analysis
./scripts/run_static_checks.sh # Run all static checksThis project is provided as-is for professional use. Adapt and modify according to your organization's needs.
Transform your recordings into structured, actionable documentation.