Skip to content

Self-hosted AI-powered transcription platform with speaker diarization, search, and collaboration features. Built with Svelte, FastAPI, and Docker for easy deployment.

License

Notifications You must be signed in to change notification settings

davidamacey/OpenTranscribe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

91 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OpenTranscribe Logo

AI-Powered Transcription and Media Analysis Platform

OpenTranscribe is a powerful, containerized web application for transcribing and analyzing audio/video files using state-of-the-art AI models. Built with modern technologies and designed for scalability, it provides an end-to-end solution for speech-to-text conversion, speaker identification, and content analysis.

Note: This application is 99.9% created by AI using Windsurf and various commercial LLMs, demonstrating the power of AI-assisted development.

✨ Key Features

🎧 Advanced Transcription

  • High-Accuracy Speech Recognition: Powered by WhisperX with faster-whisper backend
  • Word-Level Timestamps: Precise timing for every word using WAV2VEC2 alignment
  • Multi-Language Support: Transcribe in multiple languages with automatic English translation
  • Batch Processing: 70x realtime speed with large-v2 model on GPU

πŸ‘₯ Smart Speaker Management

  • Automatic Speaker Diarization: Identify different speakers using PyAnnote.audio
  • Cross-Video Speaker Recognition: AI-powered voice fingerprinting to identify speakers across different media files
  • Speaker Profile System: Create and manage global speaker profiles that persist across all transcriptions
  • AI-Powered Speaker Suggestions: Automatic speaker identification with confidence scores and verification workflow
  • Custom Speaker Labels: Edit and manage speaker names and information with intelligent suggestions
  • Speaker Analytics: View speaking time distribution, cross-media appearances, and interaction patterns

🎬 Rich Media Support

  • Universal Format Support: Audio (MP3, WAV, FLAC, M4A) and Video (MP4, MOV, AVI, MKV)
  • Large File Support: Upload files up to 4GB for GoPro and high-quality video content
  • Interactive Media Player: Click transcript to navigate playback
  • Metadata Extraction: Comprehensive file information using ExifTool
  • Subtitle Export: Generate SRT/VTT files for accessibility
  • File Reprocessing: Re-run AI analysis while preserving user comments and annotations

πŸ” Powerful Search & Discovery

  • Hybrid Search: Combine keyword and semantic search capabilities
  • Full-Text Indexing: Lightning-fast content search with OpenSearch
  • Advanced Filtering: Filter by speaker, date, tags, duration, and more
  • Smart Tagging: Organize content with custom tags and categories
  • Collections System: Group related media files into organized collections for better project management

πŸ“Š Analytics & Insights

  • Content Analysis: Word count, speaking time, and conversation flow
  • Speaker Statistics: Individual speaker metrics and participation
  • Sentiment Analysis: Understand tone and emotional content
  • Automated Summaries: Generate concise summaries using local LLMs

πŸ’¬ Collaboration Features

  • Time-Stamped Comments: Add annotations at specific moments
  • User Management: Role-based access control (admin/user)
  • Export Options: Download transcripts in multiple formats
  • Real-Time Updates: Live progress tracking with detailed WebSocket notifications
  • Enhanced Progress Tracking: 13 granular processing stages with descriptive messages
  • Collection Management: Create, organize, and share collections of related media files

πŸ› οΈ Technology Stack

Frontend

  • Svelte - Reactive UI framework with excellent performance
  • TypeScript - Type-safe development with modern JavaScript
  • Progressive Web App - Offline capabilities and native-like experience
  • Responsive Design - Seamless experience across all devices

Backend

  • FastAPI - High-performance async Python web framework
  • SQLAlchemy 2.0 - Modern ORM with type safety
  • Celery + Redis - Distributed task processing for AI workloads
  • WebSocket - Real-time communication for live updates

AI/ML Stack

  • WhisperX - Advanced speech recognition with alignment
  • PyAnnote.audio - Speaker diarization and voice analysis
  • Faster-Whisper - Optimized inference engine
  • Local LLMs - Privacy-focused text processing

Infrastructure

  • PostgreSQL - Reliable relational database
  • MinIO - S3-compatible object storage
  • OpenSearch - Full-text and vector search engine
  • Docker - Containerized deployment
  • NGINX - Production web server

πŸš€ Quick Start

Prerequisites

# Required
- Docker and Docker Compose
- 8GB+ RAM (16GB+ recommended)

# Recommended for optimal performance
- NVIDIA GPU with CUDA support

Quick Installation (Using Docker Hub Images)

Run this one-liner to download and set up OpenTranscribe using our pre-built Docker Hub images:

curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash

Then follow the on-screen instructions. The setup script will:

  • Download the production Docker Compose file
  • Configure environment variables including GPU support (default GPU device ID: 2)
  • Help you set up your Hugging Face token (required for speaker diarization)
  • Set up the management script (opentranscribe.sh)

Once setup is complete, start OpenTranscribe with:

cd opentranscribe
./opentranscribe.sh start

The Docker images are available on Docker Hub as separate repositories:

  • davidamacey/opentranscribe-backend: Backend service (also used for celery-worker and flower)
  • davidamacey/opentranscribe-frontend: Frontend service

Access the web interface at http://localhost:5173

Manual Installation (From Source)

  1. Clone the Repository

    git clone https://github.com/davidamacey/OpenTranscribe.git
    cd OpenTranscribe
    
    # Make utility script executable
    chmod +x opentr.sh
  2. Environment Configuration

    # Copy environment template
    cp .env.example .env
    
    # Edit .env file with your settings (optional for development)
    # Key variables:
    # - HUGGINGFACE_TOKEN (required for speaker diarization)
    # - GPU settings for optimal performance
  3. Start OpenTranscribe

    # Start in development mode (with hot reload)
    ./opentr.sh start dev
    
    # Or start in production mode
    ./opentr.sh start prod
  4. Access the Application

πŸ“‹ OpenTranscribe Utility Commands

The opentr.sh script provides comprehensive management for all application operations:

Basic Operations

# Start the application
./opentr.sh start [dev|prod]     # Start in development or production mode
./opentr.sh stop                 # Stop all services
./opentr.sh status               # Show container status
./opentr.sh logs [service]       # View logs (all or specific service)

Development Workflow

# Service management
./opentr.sh restart-backend      # Restart API and workers without database reset
./opentr.sh restart-frontend     # Restart frontend only
./opentr.sh restart-all          # Restart all services without data loss

# Container rebuilding (after code changes)
./opentr.sh rebuild-backend      # Rebuild backend with new code
./opentr.sh rebuild-frontend     # Rebuild frontend with new code
./opentr.sh build                # Rebuild all containers

Database Management

# Data operations (⚠️ DESTRUCTIVE)
./opentr.sh reset [dev|prod]     # Complete reset - deletes ALL data!
./opentr.sh init-db              # Initialize database without container reset

# Backup and restore
./opentr.sh backup               # Create timestamped database backup
./opentr.sh restore [file]       # Restore from backup file

System Administration

# Maintenance
./opentr.sh clean                # Remove unused containers and images
./opentr.sh health               # Check service health status
./opentr.sh shell [service]      # Open shell in container

# Available services: backend, frontend, postgres, redis, minio, opensearch, celery-worker

Monitoring and Debugging

# View specific service logs
./opentr.sh logs backend         # API server logs
./opentr.sh logs celery-worker   # AI processing logs
./opentr.sh logs frontend        # Frontend development logs
./opentr.sh logs postgres        # Database logs

# Follow logs in real-time
./opentr.sh logs backend -f

🎯 Usage Guide

Getting Started

  1. User Registration

    • Navigate to http://localhost:5173
    • Create an account or use default admin credentials
    • Set up your profile and preferences
  2. Upload Your First File

    • Click "Upload Files" or drag-and-drop media files (up to 4GB)
    • Supported formats: MP3, WAV, MP4, MOV, and more
    • Files are automatically queued for processing
  3. Monitor Processing

    • Watch detailed real-time progress with 13 processing stages
    • View task status in Flower monitor
    • Receive live WebSocket notifications for all status changes
  4. Explore Your Transcript

    • Click on transcript text to navigate media playback
    • Edit speaker names and add custom labels
    • Add time-stamped comments and annotations
    • Reprocess files to improve accuracy while preserving your edits

Advanced Features

Speaker Management

πŸ‘₯ Automatic Detection β†’ πŸ€– AI Recognition β†’ 🏷️ Profile Management β†’ πŸ” Cross-Media Tracking
  • Speakers are automatically detected and assigned labels using advanced AI diarization
  • AI suggests speaker identities based on voice fingerprinting across your media library
  • Create global speaker profiles that persist across all your transcriptions
  • Accept or reject AI suggestions with confidence scores to improve accuracy over time
  • Track speaker appearances across multiple media files with detailed analytics

Search and Discovery

πŸ” Keyword Search β†’ 🧠 Semantic Search β†’ 🏷️ Smart Filtering
  • Search transcript content with advanced filters
  • Use semantic search to find related concepts
  • Organize content with custom tags and categories

Collections Management

πŸ“ Create Collections β†’ πŸ“‚ Organize Files β†’ 🏷️ Bulk Operations
  • Group related media files into named collections
  • Filter library view by specific collections
  • Bulk add/remove files from collections
  • Manage collection metadata and descriptions

Export and Integration

πŸ“„ Multiple Formats β†’ πŸ“Ί Subtitle Files β†’ πŸ”— API Access
  • Export transcripts as TXT, JSON, or CSV
  • Generate SRT/VTT subtitle files
  • Access data programmatically via REST API

πŸ“ Project Structure

OpenTranscribe/
β”œβ”€β”€ πŸ“ backend/                 # Python FastAPI backend
β”‚   β”œβ”€β”€ πŸ“ app/                # Application modules
β”‚   β”‚   β”œβ”€β”€ πŸ“ api/            # REST API endpoints
β”‚   β”‚   β”œβ”€β”€ πŸ“ models/         # Database models
β”‚   β”‚   β”œβ”€β”€ πŸ“ services/       # Business logic
β”‚   β”‚   β”œβ”€β”€ πŸ“ tasks/          # Background AI processing
β”‚   β”‚   β”œβ”€β”€ πŸ“ utils/          # Common utilities
β”‚   β”‚   └── πŸ“ db/             # Database configuration
β”‚   β”œβ”€β”€ πŸ“ scripts/            # Admin and maintenance scripts
β”‚   β”œβ”€β”€ πŸ“ tests/              # Comprehensive test suite
β”‚   └── πŸ“„ README.md           # Backend documentation
β”œβ”€β”€ πŸ“ frontend/               # Svelte frontend application
β”‚   β”œβ”€β”€ πŸ“ src/                # Source code
β”‚   β”‚   β”œβ”€β”€ πŸ“ components/     # Reusable UI components
β”‚   β”‚   β”œβ”€β”€ πŸ“ routes/         # Page components
β”‚   β”‚   β”œβ”€β”€ πŸ“ stores/         # State management
β”‚   β”‚   └── πŸ“ styles/         # CSS and themes
β”‚   └── πŸ“„ README.md           # Frontend documentation
β”œβ”€β”€ πŸ“ database/               # Database initialization
β”œβ”€β”€ πŸ“ models_ai/              # AI model storage (runtime)
β”œβ”€β”€ πŸ“ scripts/                # Utility scripts
β”œβ”€β”€ πŸ“„ docker-compose.yml      # Container orchestration
β”œβ”€β”€ πŸ“„ opentr.sh               # Main utility script
└── πŸ“„ README.md               # This file

πŸ”§ Configuration

Environment Variables

Core Application

# Database
DATABASE_URL=postgresql://postgres:password@postgres:5432/opentranscribe

# Security
SECRET_KEY=your-super-secret-key-here
JWT_SECRET_KEY=your-jwt-secret-key

# Object Storage
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
MINIO_BUCKET_NAME=transcribe-app

AI Processing

# Required for speaker diarization - see setup instructions below
HUGGINGFACE_TOKEN=your_huggingface_token_here

# Model configuration
WHISPER_MODEL=large-v2              # large-v2, medium, small, base
COMPUTE_TYPE=float16                # float16, int8
BATCH_SIZE=16                       # Reduce if GPU memory limited

# Speaker detection
MIN_SPEAKERS=1                      # Minimum speakers to detect
MAX_SPEAKERS=10                     # Maximum speakers to detect

πŸ”‘ HuggingFace Token Setup

OpenTranscribe requires a HuggingFace token for speaker diarization and voice fingerprinting features. Follow these steps:

1. Generate HuggingFace Token

  1. Visit HuggingFace Settings > Access Tokens
  2. Click "New token" and select "Read" access
  3. Copy the generated token

2. Accept Model User Agreements

You must accept the user agreements for these models:

3. Configure Token

Add your token to the environment configuration:

For Production Installation:

# The setup script will prompt you for your token
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash

For Manual Installation:

# Add to .env file
echo "HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" >> .env

Note: Without a valid HuggingFace token, speaker diarization will be disabled and speakers will not be automatically detected or identified across different media files.

Performance Tuning

# GPU settings
USE_GPU=true                        # Enable GPU acceleration
CUDA_VISIBLE_DEVICES=0              # GPU device selection

# Resource limits
MAX_UPLOAD_SIZE=4GB                 # Maximum file size (supports GoPro videos)
CELERY_WORKER_CONCURRENCY=2         # Concurrent tasks

Production Deployment

For production use, ensure you:

  1. Security Configuration

    # Generate strong secrets
    openssl rand -hex 32  # For SECRET_KEY
    openssl rand -hex 32  # For JWT_SECRET_KEY
    
    # Set strong database passwords
    # Configure proper firewall rules
    # Set up SSL/TLS certificates
  2. Performance Optimization

    # Use production environment
    NODE_ENV=production
    
    # Configure resource limits
    # Set up monitoring and logging
    # Configure backup strategies
  3. Reverse Proxy Setup

    # Example NGINX configuration
    server {
        listen 80;
        server_name your-domain.com;
        
        location / {
            proxy_pass http://localhost:5173;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
        
        location /api {
            proxy_pass http://localhost:8080;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }

πŸ§ͺ Development

Development Environment

# Start development with hot reload
./opentr.sh start dev

# Backend development
cd backend/
pip install -r requirements.txt
pytest tests/                    # Run tests
black app/                       # Format code
flake8 app/                      # Lint code

# Frontend development  
cd frontend/
npm install
npm run dev                      # Development server
npm run test                     # Run tests
npm run lint                     # Lint code

Testing

# Backend tests
./opentr.sh shell backend
pytest tests/                    # All tests
pytest tests/api/                # API tests only
pytest --cov=app tests/          # With coverage

# Frontend tests
cd frontend/
npm run test                     # Unit tests
npm run test:e2e                 # End-to-end tests
npm run test:components          # Component tests

Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.

πŸ” Troubleshooting

Common Issues

GPU Not Detected

# Check GPU availability
nvidia-smi

# Verify Docker GPU support
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

# Set CPU-only mode if needed
echo "USE_GPU=false" >> .env

Memory Issues

# Reduce model size
echo "WHISPER_MODEL=medium" >> .env
echo "BATCH_SIZE=8" >> .env
echo "COMPUTE_TYPE=int8" >> .env

# Monitor memory usage
docker stats

Slow Transcription

  • Use GPU acceleration (USE_GPU=true)
  • Reduce model size (WHISPER_MODEL=medium)
  • Increase batch size if you have GPU memory
  • Split large files into smaller segments

Database Connection Issues

# Reset database
./opentr.sh reset dev

# Check database logs
./opentr.sh logs postgres

# Verify database is running
./opentr.sh shell postgres psql -U postgres -l

Container Issues

# Check service status
./opentr.sh status

# Clean up resources
./opentr.sh clean

# Full reset (⚠️ deletes all data)
./opentr.sh reset dev

Getting Help

  • πŸ“š Documentation: Check README files in each component directory
  • πŸ› Issues: Report bugs on GitHub Issues
  • πŸ’¬ Discussions: Ask questions in GitHub Discussions
  • πŸ“Š Monitoring: Use Flower dashboard for task debugging

πŸ“ˆ Performance & Scalability

Hardware Recommendations

Minimum Requirements

  • 8GB RAM
  • 4 CPU cores
  • 50GB disk space
  • Any modern GPU (optional but recommended)

Recommended Configuration

  • 16GB+ RAM
  • 8+ CPU cores
  • 100GB+ SSD storage
  • NVIDIA GPU with 8GB+ VRAM (RTX 3070 or better)
  • High-speed internet for model downloads

Production Scale

  • 32GB+ RAM
  • 16+ CPU cores
  • Multiple GPUs for parallel processing
  • Fast NVMe storage
  • Load balancer for multiple instances

Performance Tuning

# GPU optimization
COMPUTE_TYPE=float16              # Use half precision
BATCH_SIZE=32                     # Increase for more GPU memory
WHISPER_MODEL=large-v2            # Best accuracy

# CPU optimization (if no GPU)
COMPUTE_TYPE=int8                 # Use quantization
BATCH_SIZE=1                      # Reduce memory usage
WHISPER_MODEL=base                # Faster processing

πŸ” Security Considerations

Data Privacy

  • All processing happens locally - no data sent to external services
  • Optional: Disable external model downloads for air-gapped environments
  • User data is encrypted at rest and in transit
  • Configurable data retention policies

Access Control

  • Role-based permissions (admin/user)
  • File ownership validation
  • API rate limiting
  • Secure session management

Network Security

  • All services run in isolated Docker network
  • Configurable firewall rules
  • Optional SSL/TLS termination
  • Secure default configurations

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • OpenAI Whisper - Foundation speech recognition model
  • WhisperX - Enhanced alignment and diarization
  • PyAnnote.audio - Speaker diarization capabilities
  • FastAPI - Modern Python web framework
  • Svelte - Reactive frontend framework
  • Docker - Containerization platform

πŸ”— Useful Links


Built with ❀️ using AI assistance and modern open-source technologies.

OpenTranscribe demonstrates the power of AI-assisted development while maintaining full local control over your data and processing.

About

Self-hosted AI-powered transcription platform with speaker diarization, search, and collaboration features. Built with Svelte, FastAPI, and Docker for easy deployment.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •