OpenTranscribe is a powerful, containerized web application for transcribing and analyzing audio/video files using state-of-the-art AI models. Built with modern technologies and designed for scalability, it provides an end-to-end solution for speech-to-text conversion, speaker identification, and content analysis.
Note: This application is 99.9% created by AI using Windsurf and various commercial LLMs, demonstrating the power of AI-assisted development.
- High-Accuracy Speech Recognition: Powered by WhisperX with faster-whisper backend
- Word-Level Timestamps: Precise timing for every word using WAV2VEC2 alignment
- Multi-Language Support: Transcribe in multiple languages with automatic English translation
- Batch Processing: 70x realtime speed with large-v2 model on GPU
- Automatic Speaker Diarization: Identify different speakers using PyAnnote.audio
- Cross-Video Speaker Recognition: AI-powered voice fingerprinting to identify speakers across different media files
- Speaker Profile System: Create and manage global speaker profiles that persist across all transcriptions
- AI-Powered Speaker Suggestions: Automatic speaker identification with confidence scores and verification workflow
- Custom Speaker Labels: Edit and manage speaker names and information with intelligent suggestions
- Speaker Analytics: View speaking time distribution, cross-media appearances, and interaction patterns
- Universal Format Support: Audio (MP3, WAV, FLAC, M4A) and Video (MP4, MOV, AVI, MKV)
- Large File Support: Upload files up to 4GB for GoPro and high-quality video content
- Interactive Media Player: Click transcript to navigate playback
- Metadata Extraction: Comprehensive file information using ExifTool
- Subtitle Export: Generate SRT/VTT files for accessibility
- File Reprocessing: Re-run AI analysis while preserving user comments and annotations
- Hybrid Search: Combine keyword and semantic search capabilities
- Full-Text Indexing: Lightning-fast content search with OpenSearch
- Advanced Filtering: Filter by speaker, date, tags, duration, and more
- Smart Tagging: Organize content with custom tags and categories
- Collections System: Group related media files into organized collections for better project management
- Content Analysis: Word count, speaking time, and conversation flow
- Speaker Statistics: Individual speaker metrics and participation
- Sentiment Analysis: Understand tone and emotional content
- Automated Summaries: Generate concise summaries using local LLMs
- Time-Stamped Comments: Add annotations at specific moments
- User Management: Role-based access control (admin/user)
- Export Options: Download transcripts in multiple formats
- Real-Time Updates: Live progress tracking with detailed WebSocket notifications
- Enhanced Progress Tracking: 13 granular processing stages with descriptive messages
- Collection Management: Create, organize, and share collections of related media files
- Svelte - Reactive UI framework with excellent performance
- TypeScript - Type-safe development with modern JavaScript
- Progressive Web App - Offline capabilities and native-like experience
- Responsive Design - Seamless experience across all devices
- FastAPI - High-performance async Python web framework
- SQLAlchemy 2.0 - Modern ORM with type safety
- Celery + Redis - Distributed task processing for AI workloads
- WebSocket - Real-time communication for live updates
- WhisperX - Advanced speech recognition with alignment
- PyAnnote.audio - Speaker diarization and voice analysis
- Faster-Whisper - Optimized inference engine
- Local LLMs - Privacy-focused text processing
- PostgreSQL - Reliable relational database
- MinIO - S3-compatible object storage
- OpenSearch - Full-text and vector search engine
- Docker - Containerized deployment
- NGINX - Production web server
# Required
- Docker and Docker Compose
- 8GB+ RAM (16GB+ recommended)
# Recommended for optimal performance
- NVIDIA GPU with CUDA support
Run this one-liner to download and set up OpenTranscribe using our pre-built Docker Hub images:
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash
Then follow the on-screen instructions. The setup script will:
- Download the production Docker Compose file
- Configure environment variables including GPU support (default GPU device ID: 2)
- Help you set up your Hugging Face token (required for speaker diarization)
- Set up the management script (
opentranscribe.sh
)
Once setup is complete, start OpenTranscribe with:
cd opentranscribe
./opentranscribe.sh start
The Docker images are available on Docker Hub as separate repositories:
davidamacey/opentranscribe-backend
: Backend service (also used for celery-worker and flower)davidamacey/opentranscribe-frontend
: Frontend service
Access the web interface at http://localhost:5173
-
Clone the Repository
git clone https://github.com/davidamacey/OpenTranscribe.git cd OpenTranscribe # Make utility script executable chmod +x opentr.sh
-
Environment Configuration
# Copy environment template cp .env.example .env # Edit .env file with your settings (optional for development) # Key variables: # - HUGGINGFACE_TOKEN (required for speaker diarization) # - GPU settings for optimal performance
-
Start OpenTranscribe
# Start in development mode (with hot reload) ./opentr.sh start dev # Or start in production mode ./opentr.sh start prod
-
Access the Application
- π Web Interface: http://localhost:5173
- π API Documentation: http://localhost:8080/docs
- πΊ Task Monitor: http://localhost:5555/flower
- π Search Engine: http://localhost:9200
- π File Storage: http://localhost:9091
The opentr.sh
script provides comprehensive management for all application operations:
# Start the application
./opentr.sh start [dev|prod] # Start in development or production mode
./opentr.sh stop # Stop all services
./opentr.sh status # Show container status
./opentr.sh logs [service] # View logs (all or specific service)
# Service management
./opentr.sh restart-backend # Restart API and workers without database reset
./opentr.sh restart-frontend # Restart frontend only
./opentr.sh restart-all # Restart all services without data loss
# Container rebuilding (after code changes)
./opentr.sh rebuild-backend # Rebuild backend with new code
./opentr.sh rebuild-frontend # Rebuild frontend with new code
./opentr.sh build # Rebuild all containers
# Data operations (β οΈ DESTRUCTIVE)
./opentr.sh reset [dev|prod] # Complete reset - deletes ALL data!
./opentr.sh init-db # Initialize database without container reset
# Backup and restore
./opentr.sh backup # Create timestamped database backup
./opentr.sh restore [file] # Restore from backup file
# Maintenance
./opentr.sh clean # Remove unused containers and images
./opentr.sh health # Check service health status
./opentr.sh shell [service] # Open shell in container
# Available services: backend, frontend, postgres, redis, minio, opensearch, celery-worker
# View specific service logs
./opentr.sh logs backend # API server logs
./opentr.sh logs celery-worker # AI processing logs
./opentr.sh logs frontend # Frontend development logs
./opentr.sh logs postgres # Database logs
# Follow logs in real-time
./opentr.sh logs backend -f
-
User Registration
- Navigate to http://localhost:5173
- Create an account or use default admin credentials
- Set up your profile and preferences
-
Upload Your First File
- Click "Upload Files" or drag-and-drop media files (up to 4GB)
- Supported formats: MP3, WAV, MP4, MOV, and more
- Files are automatically queued for processing
-
Monitor Processing
- Watch detailed real-time progress with 13 processing stages
- View task status in Flower monitor
- Receive live WebSocket notifications for all status changes
-
Explore Your Transcript
- Click on transcript text to navigate media playback
- Edit speaker names and add custom labels
- Add time-stamped comments and annotations
- Reprocess files to improve accuracy while preserving your edits
π₯ Automatic Detection β π€ AI Recognition β π·οΈ Profile Management β π Cross-Media Tracking
- Speakers are automatically detected and assigned labels using advanced AI diarization
- AI suggests speaker identities based on voice fingerprinting across your media library
- Create global speaker profiles that persist across all your transcriptions
- Accept or reject AI suggestions with confidence scores to improve accuracy over time
- Track speaker appearances across multiple media files with detailed analytics
π Keyword Search β π§ Semantic Search β π·οΈ Smart Filtering
- Search transcript content with advanced filters
- Use semantic search to find related concepts
- Organize content with custom tags and categories
π Create Collections β π Organize Files β π·οΈ Bulk Operations
- Group related media files into named collections
- Filter library view by specific collections
- Bulk add/remove files from collections
- Manage collection metadata and descriptions
π Multiple Formats β πΊ Subtitle Files β π API Access
- Export transcripts as TXT, JSON, or CSV
- Generate SRT/VTT subtitle files
- Access data programmatically via REST API
OpenTranscribe/
βββ π backend/ # Python FastAPI backend
β βββ π app/ # Application modules
β β βββ π api/ # REST API endpoints
β β βββ π models/ # Database models
β β βββ π services/ # Business logic
β β βββ π tasks/ # Background AI processing
β β βββ π utils/ # Common utilities
β β βββ π db/ # Database configuration
β βββ π scripts/ # Admin and maintenance scripts
β βββ π tests/ # Comprehensive test suite
β βββ π README.md # Backend documentation
βββ π frontend/ # Svelte frontend application
β βββ π src/ # Source code
β β βββ π components/ # Reusable UI components
β β βββ π routes/ # Page components
β β βββ π stores/ # State management
β β βββ π styles/ # CSS and themes
β βββ π README.md # Frontend documentation
βββ π database/ # Database initialization
βββ π models_ai/ # AI model storage (runtime)
βββ π scripts/ # Utility scripts
βββ π docker-compose.yml # Container orchestration
βββ π opentr.sh # Main utility script
βββ π README.md # This file
# Database
DATABASE_URL=postgresql://postgres:password@postgres:5432/opentranscribe
# Security
SECRET_KEY=your-super-secret-key-here
JWT_SECRET_KEY=your-jwt-secret-key
# Object Storage
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
MINIO_BUCKET_NAME=transcribe-app
# Required for speaker diarization - see setup instructions below
HUGGINGFACE_TOKEN=your_huggingface_token_here
# Model configuration
WHISPER_MODEL=large-v2 # large-v2, medium, small, base
COMPUTE_TYPE=float16 # float16, int8
BATCH_SIZE=16 # Reduce if GPU memory limited
# Speaker detection
MIN_SPEAKERS=1 # Minimum speakers to detect
MAX_SPEAKERS=10 # Maximum speakers to detect
OpenTranscribe requires a HuggingFace token for speaker diarization and voice fingerprinting features. Follow these steps:
- Visit HuggingFace Settings > Access Tokens
- Click "New token" and select "Read" access
- Copy the generated token
You must accept the user agreements for these models:
- Segmentation Model - Click "Agree and access repository"
- Speaker Diarization Model - Click "Agree and access repository"
Add your token to the environment configuration:
For Production Installation:
# The setup script will prompt you for your token
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash
For Manual Installation:
# Add to .env file
echo "HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" >> .env
Note: Without a valid HuggingFace token, speaker diarization will be disabled and speakers will not be automatically detected or identified across different media files.
# GPU settings
USE_GPU=true # Enable GPU acceleration
CUDA_VISIBLE_DEVICES=0 # GPU device selection
# Resource limits
MAX_UPLOAD_SIZE=4GB # Maximum file size (supports GoPro videos)
CELERY_WORKER_CONCURRENCY=2 # Concurrent tasks
For production use, ensure you:
-
Security Configuration
# Generate strong secrets openssl rand -hex 32 # For SECRET_KEY openssl rand -hex 32 # For JWT_SECRET_KEY # Set strong database passwords # Configure proper firewall rules # Set up SSL/TLS certificates
-
Performance Optimization
# Use production environment NODE_ENV=production # Configure resource limits # Set up monitoring and logging # Configure backup strategies
-
Reverse Proxy Setup
# Example NGINX configuration server { listen 80; server_name your-domain.com; location / { proxy_pass http://localhost:5173; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } location /api { proxy_pass http://localhost:8080; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } }
# Start development with hot reload
./opentr.sh start dev
# Backend development
cd backend/
pip install -r requirements.txt
pytest tests/ # Run tests
black app/ # Format code
flake8 app/ # Lint code
# Frontend development
cd frontend/
npm install
npm run dev # Development server
npm run test # Run tests
npm run lint # Lint code
# Backend tests
./opentr.sh shell backend
pytest tests/ # All tests
pytest tests/api/ # API tests only
pytest --cov=app tests/ # With coverage
# Frontend tests
cd frontend/
npm run test # Unit tests
npm run test:e2e # End-to-end tests
npm run test:components # Component tests
We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.
# Check GPU availability
nvidia-smi
# Verify Docker GPU support
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
# Set CPU-only mode if needed
echo "USE_GPU=false" >> .env
# Reduce model size
echo "WHISPER_MODEL=medium" >> .env
echo "BATCH_SIZE=8" >> .env
echo "COMPUTE_TYPE=int8" >> .env
# Monitor memory usage
docker stats
- Use GPU acceleration (
USE_GPU=true
) - Reduce model size (
WHISPER_MODEL=medium
) - Increase batch size if you have GPU memory
- Split large files into smaller segments
# Reset database
./opentr.sh reset dev
# Check database logs
./opentr.sh logs postgres
# Verify database is running
./opentr.sh shell postgres psql -U postgres -l
# Check service status
./opentr.sh status
# Clean up resources
./opentr.sh clean
# Full reset (β οΈ deletes all data)
./opentr.sh reset dev
- π Documentation: Check README files in each component directory
- π Issues: Report bugs on GitHub Issues
- π¬ Discussions: Ask questions in GitHub Discussions
- π Monitoring: Use Flower dashboard for task debugging
- 8GB RAM
- 4 CPU cores
- 50GB disk space
- Any modern GPU (optional but recommended)
- 16GB+ RAM
- 8+ CPU cores
- 100GB+ SSD storage
- NVIDIA GPU with 8GB+ VRAM (RTX 3070 or better)
- High-speed internet for model downloads
- 32GB+ RAM
- 16+ CPU cores
- Multiple GPUs for parallel processing
- Fast NVMe storage
- Load balancer for multiple instances
# GPU optimization
COMPUTE_TYPE=float16 # Use half precision
BATCH_SIZE=32 # Increase for more GPU memory
WHISPER_MODEL=large-v2 # Best accuracy
# CPU optimization (if no GPU)
COMPUTE_TYPE=int8 # Use quantization
BATCH_SIZE=1 # Reduce memory usage
WHISPER_MODEL=base # Faster processing
- All processing happens locally - no data sent to external services
- Optional: Disable external model downloads for air-gapped environments
- User data is encrypted at rest and in transit
- Configurable data retention policies
- Role-based permissions (admin/user)
- File ownership validation
- API rate limiting
- Secure session management
- All services run in isolated Docker network
- Configurable firewall rules
- Optional SSL/TLS termination
- Secure default configurations
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper - Foundation speech recognition model
- WhisperX - Enhanced alignment and diarization
- PyAnnote.audio - Speaker diarization capabilities
- FastAPI - Modern Python web framework
- Svelte - Reactive frontend framework
- Docker - Containerization platform
- π Documentation: Complete documentation index
- π οΈ API Reference: http://localhost:8080/docs (when running)
- πΊ Task Monitor: http://localhost:5555/flower (when running)
- π€ Contributing: Contribution guidelines
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
Built with β€οΈ using AI assistance and modern open-source technologies.
OpenTranscribe demonstrates the power of AI-assisted development while maintaining full local control over your data and processing.