Local browser-based transcription app for Apple Silicon Macs. It serves a FastAPI backend and vanilla HTML/CSS/JS frontend, then transcribes audio locally with mlx-whisper or the optional Microsoft VibeVoice ASR backend for diarized output.
- macOS on Apple Silicon
- Python 3.12+
uv- For
vibevoice-7b: enough memory for the 7B ASR model; browser.webmrecordings also need theffmpegcommand available onPATHfor audio decoding
uv syncThe VibeVoice ASR source needed by this app is vendored in vibevoice/ under the upstream MIT license, so no second VibeVoice checkout is required.
uv run python -m app.mainOpen http://127.0.0.1:8000.
Choose a language and model, then either record from the browser microphone or choose an audio file. The UI shows upload and transcription progress, result metadata, and lets you copy or save the transcript as .txt.
Configured model options live in config.yaml:
tiny,base,small,medium,large:mlx-whispermodels cached undertranscription.cache_dirvibevoice-7b: VibeVoice ASR with speaker segments, loaded through the local vendored VibeVoice package
Models download on first use and are reused from their cache afterward. VibeVoice also downloads its Hugging Face model on first real use and can require substantial memory; keep transcription.vibevoice_max_new_tokens conservative unless you know the target machine can handle more.
POST /api/transcriptions
Multipart form fields:
audio: audio file or browser recordinglanguage: configured language code, currentlydeorenmodel_size: configured model id, for examplesmall,large, orvibevoice-7b
Successful response:
{
"transcript": "Text...",
"language": "de",
"model_size": "small",
"duration_seconds": 12.3,
"segments": []
}Error response:
{
"error": {
"code": "unsupported_language",
"message": "Choose German or English."
}
}Invalid requests return 400, oversized uploads return 413, and transcription failures return 500.
Edit config.yaml for runtime settings:
transcription.cache_dir: local model cache directorytranscription.max_upload_size_mb: upload size limittranscription.upload_chunk_size_mb: temp-file streaming chunk sizetranscription.supported_languages: language picker and backend validationtranscription.supported_model_sizes: model picker, repo ids, and backend routingtranscription.vibevoice_*: VibeVoice device, dtype, attention mode, and generation settingslogging.level: structured JSON log levelui.*: browser labels, status messages, and progress timing heuristics
server.host and server.port control the bind address used by uv run python -m app.main.
uv run ruff format .
uv run ruff check .
uv run pytest -vProject layout:
app/main.py: FastAPI app, config loading, frontend template renderingapp/api/: transcription APIapp/services/: backend routing,mlx-whisper, and VibeVoice integrationapp/static/: browser UItests/: API, config, and service testsvibevoice/: vendored VibeVoice ASR model and processor code from Microsoft VibeVoice, MIT licensed
MIT License. Vendored VibeVoice code under vibevoice/ is also MIT licensed; see vibevoice/LICENSE.
