Dub any YouTube video into another language — with the original speaker's voice
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -t ruWarning
Apple Silicon ONLY (M1–M4), tested on M4 Pro (48 GB)
One command: download, transcribe, translate (Claude), clone each speaker's voice (Qwen3-TTS), mix with the original background — done. All ML inference runs locally on your Mac's GPU via MLX
- Human-quality voice cloning
Qwen3-TTS per speaker, not a generic synth. Multiple speakers are diarized and voiced separately - LLM translation
Claude handles idioms, context, and produces TTS-friendly text — not word-for-word machine translation - Background preserved
BS-RoFormer separates vocals from music/sfx. Sidechain ducking mixes them back naturally - Production audio chain
Loudnorm (-16 LUFS), de-essing, pitch-preserving speed-up, equal-power crossfade - Checkpoint & resume
Every step saves state. Interrupted?yt-dbl resumecontinues where it stopped - Private
Everything local except the Claude API call
TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian
ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)
- macOS with Apple Silicon (M1–M4) — MLX needs Metal
- Python >= 3.12
- FFmpeg — audio extraction, postprocessing, final assembly
- yt-dlp — video download
- Anthropic API key — translation via Claude
brew install ffmpeg yt-dlpOptional:
brew install ffmpeg-fullfor pitch-preserving speed-up via rubberband Without it, falls back to ffmpeg'satempofilter (works fine, just no pitch correction)
# From PyPI
uv tool install --prerelease=allow yt-dbl
# Or with pipx
pipx install yt-dbl
--prerelease=allowis needed becausemlx-audiodepends on a pre-releasetransformersIf
yt-dblis not found, runuv tool update-shell && source ~/.zshrc
From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv syncUse uv run yt-dbl instead of yt-dbl when running from source
echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrcOr use a .env file:
YT_DBL_ANTHROPIC_API_KEY=sk-ant-...Models (~8.2 GB) download automatically on first run, or fetch them ahead of time:
yt-dbl models downloadPriority: CLI args > env vars (YT_DBL_ prefix) > .env file > defaults
cp .env.example .env| Env variable | Default | Description |
|---|---|---|
YT_DBL_ANTHROPIC_API_KEY |
— | Required — Anthropic API key |
YT_DBL_TARGET_LANGUAGE |
ru |
Target language (ISO 639-1) |
YT_DBL_OUTPUT_FORMAT |
mp4 |
mp4 / mkv |
YT_DBL_SUBTITLE_MODE |
softsub |
softsub / hardsub / none |
YT_DBL_BACKGROUND_VOLUME |
0.15 |
Background volume during speech (0.0–1.0) |
YT_DBL_MAX_SPEED_FACTOR |
1.4 |
Max TTS speed-up to fit timing (1.0–2.0) |
YT_DBL_MAX_LOADED_MODELS |
0 (auto) |
Max models in memory (0 = auto by RAM) |
YT_DBL_WORK_DIR |
dubbed |
Output directory |
See
.env.examplefor all 33 parameters
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" # dub to Russian (default)
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es # dub to Spanish
yt-dbl dub "https://youtu.be/VIDEO_ID" -o ./out # custom output dir
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate # re-run from a specific step
yt-dbl resume VIDEO_ID # resume after interrupt
yt-dbl status VIDEO_ID # check job progressyt-dbl dub <URL> [options]| Option | Description | Default |
|---|---|---|
-t, --target-language |
Target language | ru |
-o, --output-dir |
Output directory | ./dubbed |
--bg-volume |
Background volume (0.0–1.0) | 0.15 |
--max-speed |
Max TTS speed-up (1.0–2.0) | 1.4 |
--max-models |
Max models in memory | auto |
--from-step |
Start from: download / separate / transcribe / translate / synthesize / assemble |
— |
--no-subs |
Disable subtitles | false |
--sub-mode |
softsub / hardsub / none |
softsub |
--format |
mp4 / mkv |
mp4 |
yt-dbl resume <video_id> [--max-models N] [-o DIR]yt-dbl status <video_id>yt-dbl models list # show models, download status, size
yt-dbl models download # pre-download all models┌─────────────────────────────────────────────────────────────────────────────────┐
│ YouTube URL │
└─────────────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 1. DOWNLOAD │
│ │
│ yt-dlp downloads the video, ffmpeg extracts the audio track │
│ Output: video.mp4, audio.wav (48 kHz, mono) │
└─────────────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 2. SEPARATE │
│ │
│ BS-RoFormer splits audio into vocals and background (PyTorch MPS) │
│ Output: vocals.wav, background.wav │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
│ │
vocals.wav background.wav
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 3. TRANSCRIBE │ │
│ │ │
│ VibeVoice-ASR (MLX, ~5.7 GB) │ │
│ → speech segments + speaker diarization │ │
│ Qwen3-ForcedAligner (MLX, ~600 MB) │ │
│ → word-level timestamps │ │
│ + language auto-detection via Unicode scripts │ │
│ │ │
│ Output: segments.json │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 4. TRANSLATE │ │
│ │ │
│ Claude API (auto-batched by token budget) │ │
│ TTS-friendly output: short phrases, spelled-out │ │
│ numbers, no special characters │ │
│ │ │
│ Output: translations.json, subtitles.srt │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 5. SYNTHESIZE │ │
│ │ │
│ Qwen3-TTS (MLX, ~1.7 GB) — voice cloning │ │
│ using a voice reference for each speaker │ │
│ Postprocessing (parallel, ThreadPool): │ │
│ • speed-up (rubberband or atempo) │ │
│ • loudnorm (-16 LUFS, 2-pass) │ │
│ • de-essing │ │
│ │ │
│ Output: segment_0000.wav, segment_0001.wav ... │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 6. ASSEMBLE │
│ │
│ Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking) │
│ + video (copy) + subtitles (softsub / hardsub / none) │
│ All in a single ffmpeg call │
│ │
│ Output: result.mp4 │
└──────────────────────────────────────────┬──────────────────────────────────────┘
│
▼
┌───────────────────┐
│ result.mp4 │
└───────────────────┘
LRU model manager — auto-selects how many models to keep loaded based on RAM:
RAM Models Batch (separation)
───────────── ─────── ──────────────────
<= 16 GB 1 1
17–31 GB 2 2
32–47 GB 3 4
48+ GB 3 8
ASR (~5.7 GB) is unloaded before loading the Aligner to avoid holding both in memory
dubbed/
└── <video_id>/
├── state.json ← pipeline checkpoint (JSON)
├── 01_download/
│ ├── video.mp4 ← original video
│ └── audio.wav ← extracted audio track (48 kHz, mono)
├── 02_separate/
│ ├── vocals.wav ← isolated vocals
│ └── background.wav ← background music/noise
├── 03_transcribe/
│ └── segments.json ← segments, speakers, words with timestamps
├── 04_translate/
│ ├── translations.json ← translated texts
│ └── subtitles.srt ← subtitles (SRT)
├── 05_synthesize/
│ ├── ref_SPEAKER_00.wav ← speaker voice reference
│ ├── segment_0000.wav ← final segments (after postprocessing)
│ ├── segment_0001.wav
│ └── synth_meta.json ← synthesis metadata
├── 06_assemble/
│ └── speech.wav ← assembled speech track
└── result.mp4 ← final output (in job dir root)
| Model | Size | Task |
|---|---|---|
| VibeVoice-ASR | ~5.7 GB | ASR + speaker diarization |
| Qwen3-ForcedAligner | ~600 MB | Word-level alignment |
| Qwen3-TTS | ~1.7 GB | TTS + voice cloning |
| MelBand-RoFormer (BS-RoFormer) | ~200 MB | Vocal/background separation |
| Claude Sonnet 4.5 | — | Translation (API) |
All local models run on MLX (Metal GPU), total ~8.2 GB
just check # lint + format + typecheck + tests
just test # fast tests (parallel, coverage)
just test-e2e # E2E (needs ffmpeg + network)
just fix # auto-fix lint
just format # auto-formatMIT