Multilingual speech understanding: ASR + emotion recognition + audio event detection. 50+ languages, 15x faster than Whisper, non-autoregressive.
-
Updated
Jun 29, 2026 - C
Multilingual speech understanding: ASR + emotion recognition + audio event detection. 50+ languages, 15x faster than Whisper, non-autoregressive.
Voxtral is a state-of-the-art model developed to handle both speech transcription and audio understanding with remarkable accuracy and efficiency. This demo interface lets you run the Voxtral model on powerful GPUs to evaluate its performance and see how it can be used for transcription and deeper analysis.
TASU: A New Style of Alignment of Speech LLM with only Text Training Data, zero-shot on ASR and Other SU tasks
Add a description, image, and links to the speech-understanding topic page so that developers can more easily learn about it.
To associate your repository with the speech-understanding topic, visit your repo's landing page and select "manage topics."