Skip to content

A Python tool that generates high-accuracy transcripts from any YouTube video with audio. Works with songs, podcasts, lectures, and interviews, featuring GPU acceleration and intelligent formatting.

License

Notifications You must be signed in to change notification settings

jaredescott/youtube_transcriber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YouTube Video Transcriber

A Python tool to generate high-accuracy, line-by-line transcripts from any YouTube video with audio. While optimized for song lyrics, it works great for podcasts, interviews, lectures, and other speech content.

Features

  • Downloads and transcribes any YouTube videos with audio
  • Shows progress bars for download and transcription
  • Intelligently formats content with natural line breaks
  • Supports multiple Whisper model sizes for different accuracy levels
  • Special optimization for music and lyrics transcription
  • Parallel processing using multiple CPU cores
  • GPU acceleration with CUDA support
  • Fast mode for quicker transcription
  • Beautifully animated progress bars
  • Advanced logging with file output support

Prerequisites

  • Python 3.7+
  • FFmpeg (required for audio processing)

Installing FFmpeg

  • Windows: Download from FFmpeg website or install with Chocolatey: choco install ffmpeg
  • macOS: Install using Homebrew: brew install ffmpeg
  • Linux: Install using package manager, e.g., apt install ffmpeg or dnf install ffmpeg

Installation

  1. Clone or download this repository
  2. Install the required Python packages:
pip install -r requirements.txt

Usage

Basic usage:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

The tool works with any YouTube content containing speech or music:

  • Song lyrics
  • Podcasts
  • Interviews
  • Lectures
  • Speeches
  • Any other audio content

Options

Flag Shortcut Description Default
--output FILE -o Save lyrics to this file stdout
--model SIZE -m Choose model size (tiny, base, small, medium, large) medium
--keep-audio Keep the downloaded audio file after processing False
--audio-output PATH Specify path to save the downloaded audio
--min-line-duration N Minimum pause duration (seconds) for line breaks 1.0
--raw-data Save raw transcription data as JSON False
--device TYPE Device to use for inference (cpu or cuda) auto
--chunk-size N Process audio in chunks of this size in seconds
--fast Enable fast mode (sacrifices some accuracy for speed) False
--max-duration N Maximum duration (seconds) to transcribe from video start
--verbose -v Enable verbose logging for debugging False
--log-file FILE Save logs to specified file in addition to console
--safe-mode Use safe mode without word-level timestamps False

Examples

Transcribe content with higher accuracy:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" -m large -o transcript.txt

Customize line breaks with longer pauses:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --min-line-duration 2.0 -o transcript.txt

Get raw data with timestamps:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --raw-data -o transcript.txt

Performance Optimization Examples

For fastest transcription with slightly lower quality:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --fast

Use GPU acceleration (if you have a CUDA-compatible GPU):

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --device cuda

Process a long audio file in chunks:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --chunk-size 30

Only transcribe the first minute of a video:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --max-duration 60

Fastest possible transcription (GPU + fast mode):

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --device cuda --fast

Logging Options

Enable detailed debug logs:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" -v

Save logs to a file for later analysis:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" --log-file transcription.log

Both verbose console output and file logging:

python transcribe.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" -v --log-file debug.log

For Genius.com Users

This tool is specially optimized for creating lyrics for Genius.com:

  1. Lines are intelligently separated based on natural pauses in the song
  2. Default model is set to "medium" for better lyrics accuracy
  3. Output is properly formatted and cleaned for easy copy/paste to Genius

Performance Tips

  • GPU Acceleration: If you have a NVIDIA GPU, use --device cuda for significantly faster transcription
  • Parallel Processing: For long videos, use --chunk-size 30 to process in 30-second chunks
  • Fast Mode: The --fast flag will:
    • Use beam size 1 for faster inference
    • Automatically downgrade from medium to small model for better speed
    • Disable word-level timestamps for faster processing
  • Limit Duration: For testing or when you only need part of a song, use --max-duration 60 to only process the first minute
  • Model Selection: For quick drafts, use -m small or -m base

Virtual Environment Setup

For best results, install in a dedicated virtual environment:

# Create a virtual environment
python -m venv youtube_transcriber_env

# Activate it (Windows)
youtube_transcriber_env\Scripts\activate

# Activate it (Linux/Mac)
source youtube_transcriber_env/bin/activate

# Install dependencies
pip install -r requirements.txt

Alternatively, use uv for faster installation:

# Install uv
pip install uv

# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate  # On Linux/Mac
# OR
.venv\Scripts\activate     # On Windows

# Install dependencies
uv pip install -r requirements.txt

Notes

  • The script requires an internet connection to download the YouTube video and (first time only) to download the Whisper model.
  • For transcribing speech (interviews, lectures, etc.), the "base" or "small" model often provides sufficient accuracy
  • For song lyrics, the "medium" or "large" model provides much better accuracy and is highly recommended
  • Adjust the --min-line-duration parameter based on content type:
    • Fast speech or songs: try lower values (0.8-1.2 seconds)
    • Slow speech or songs: try higher values (2.0-3.0 seconds)
  • Progress bars show download and transcription progress in real-time
  • Additional dependencies: For parallel processing, you need to install librosa (pip install librosa)

Contributing

Contributions are welcome! If you'd like to contribute to this project:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/your-feature-name)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some feature')
  5. Push to the branch (git push origin feature/your-feature-name)
  6. Open a Pull Request

Please ensure your code follows the existing style and includes appropriate documentation.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

  • OpenAI Whisper - The AI speech recognition model that powers the transcription
  • yt-dlp - For YouTube video downloading capabilities
  • PyTorch - For the underlying machine learning framework
  • tqdm - For the progress bar visualizations

About

A Python tool that generates high-accuracy transcripts from any YouTube video with audio. Works with songs, podcasts, lectures, and interviews, featuring GPU acceleration and intelligent formatting.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages