A Python tool to generate high-accuracy, line-by-line transcripts from any YouTube video with audio. While optimized for song lyrics, it works great for podcasts, interviews, lectures, and other speech content.
- Downloads and transcribes any YouTube videos with audio
- Shows progress bars for download and transcription
- Intelligently formats content with natural line breaks
- Supports multiple Whisper model sizes for different accuracy levels
- Special optimization for music and lyrics transcription
- Parallel processing using multiple CPU cores
- GPU acceleration with CUDA support
- Fast mode for quicker transcription
- Beautifully animated progress bars
- Advanced logging with file output support
- Python 3.7+
- FFmpeg (required for audio processing)
- Windows: Download from FFmpeg website or install with Chocolatey:
choco install ffmpeg
- macOS: Install using Homebrew:
brew install ffmpeg
- Linux: Install using package manager, e.g.,
apt install ffmpeg
ordnf install ffmpeg
- Clone or download this repository
- Install the required Python packages:
pip install -r requirements.txt
Basic usage:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
The tool works with any YouTube content containing speech or music:
- Song lyrics
- Podcasts
- Interviews
- Lectures
- Speeches
- Any other audio content
Flag | Shortcut | Description | Default |
---|---|---|---|
--output FILE |
-o |
Save lyrics to this file | stdout |
--model SIZE |
-m |
Choose model size (tiny, base, small, medium, large) | medium |
--keep-audio |
Keep the downloaded audio file after processing | False | |
--audio-output PATH |
Specify path to save the downloaded audio | ||
--min-line-duration N |
Minimum pause duration (seconds) for line breaks | 1.0 | |
--raw-data |
Save raw transcription data as JSON | False | |
--device TYPE |
Device to use for inference (cpu or cuda) | auto | |
--chunk-size N |
Process audio in chunks of this size in seconds | ||
--fast |
Enable fast mode (sacrifices some accuracy for speed) | False | |
--max-duration N |
Maximum duration (seconds) to transcribe from video start | ||
--verbose |
-v |
Enable verbose logging for debugging | False |
--log-file FILE |
Save logs to specified file in addition to console | ||
--safe-mode |
Use safe mode without word-level timestamps | False |
Transcribe content with higher accuracy:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" -m large -o transcript.txt
Customize line breaks with longer pauses:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --min-line-duration 2.0 -o transcript.txt
Get raw data with timestamps:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --raw-data -o transcript.txt
For fastest transcription with slightly lower quality:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --fast
Use GPU acceleration (if you have a CUDA-compatible GPU):
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --device cuda
Process a long audio file in chunks:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --chunk-size 30
Only transcribe the first minute of a video:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --max-duration 60
Fastest possible transcription (GPU + fast mode):
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --device cuda --fast
Enable detailed debug logs:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" -v
Save logs to a file for later analysis:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --log-file transcription.log
Both verbose console output and file logging:
python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" -v --log-file debug.log
This tool is specially optimized for creating lyrics for Genius.com:
- Lines are intelligently separated based on natural pauses in the song
- Default model is set to "medium" for better lyrics accuracy
- Output is properly formatted and cleaned for easy copy/paste to Genius
- GPU Acceleration: If you have a NVIDIA GPU, use
--device cuda
for significantly faster transcription - Parallel Processing: For long videos, use
--chunk-size 30
to process in 30-second chunks - Fast Mode: The
--fast
flag will:- Use beam size 1 for faster inference
- Automatically downgrade from medium to small model for better speed
- Disable word-level timestamps for faster processing
- Limit Duration: For testing or when you only need part of a song, use
--max-duration 60
to only process the first minute - Model Selection: For quick drafts, use
-m small
or-m base
For best results, install in a dedicated virtual environment:
# Create a virtual environment
python -m venv youtube_transcriber_env
# Activate it (Windows)
youtube_transcriber_env\Scripts\activate
# Activate it (Linux/Mac)
source youtube_transcriber_env/bin/activate
# Install dependencies
pip install -r requirements.txt
Alternatively, use uv for faster installation:
# Install uv
pip install uv
# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate # On Linux/Mac
# OR
.venv\Scripts\activate # On Windows
# Install dependencies
uv pip install -r requirements.txt
- The script requires an internet connection to download the YouTube video and (first time only) to download the Whisper model.
- For transcribing speech (interviews, lectures, etc.), the "base" or "small" model often provides sufficient accuracy
- For song lyrics, the "medium" or "large" model provides much better accuracy and is highly recommended
- Adjust the
--min-line-duration
parameter based on content type:- Fast speech or songs: try lower values (0.8-1.2 seconds)
- Slow speech or songs: try higher values (2.0-3.0 seconds)
- Progress bars show download and transcription progress in real-time
- Additional dependencies: For parallel processing, you need to install
librosa
(pip install librosa
)
Contributions are welcome! If you'd like to contribute to this project:
- Fork the repository
- Create a new branch (
git checkout -b feature/your-feature-name
) - Make your changes
- Commit your changes (
git commit -m 'Add some feature'
) - Push to the branch (
git push origin feature/your-feature-name
) - Open a Pull Request
Please ensure your code follows the existing style and includes appropriate documentation.
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper - The AI speech recognition model that powers the transcription
- yt-dlp - For YouTube video downloading capabilities
- PyTorch - For the underlying machine learning framework
- tqdm - For the progress bar visualizations