Releases: fixie-ai/ultravox
v0.4.1
We're releasing Ultravox 0.4.1 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox Realtime APIs, v0.4.1 is the new default.
We'd love to hear feedback on your experience with Ultravox, along with feature suggestions.
What's New
v0.4.1 improves upon 0.4 in the following ways:
- We've upgraded the Whisper encoder from Whisper-medium to Whisper-large-v3-turbo. This has led to quality improvements (see the table below).
- We're adding six new languages: Chinese, Dutch, Hindi, Swedish, Turkish, and Ukrainian. That brings the total supported languages to 15 (see table below).
- Increased the amount of training data for English.
15 Languages Supported
Language | ISO Code |
---|---|
Arabic | ar |
Chinese | zh |
Dutch | nl |
English | en |
French | fr |
German | de |
Hindi | hi |
Italian | it |
Japanese | ja |
Portuguese | pt |
Russian | ru |
Spanish | es |
Swedish | sv |
Turkish | tr |
Ukrainian | uk |
Evals
Our primary method of evaluation is speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca
is an example of model performance for languages not included in training.
Ultravox 70B
Ultravox 0.4 70B | Ultravox 0.4.1 70B | |
---|---|---|
en_ar | 14.97 | 19.64 |
en_de | 30.30 | 32.47 |
es_en | 39.55 | 40.76 |
ru_en | 44.16 | 45.07 |
en_ca | 35.02 | 37.58 |
zh_en | 12.16 | 17.98 |
Ultravox 8B
Ultravox 0.4 8B | Ultravox 0.4.1 8B | |
---|---|---|
en_ar | 11.17 | 12.28 |
en_de | 25.47 | 27.13 |
es_en | 37.11 | 39.16 |
ru_en | 38.96 | 39.65 |
en_ca | 27.46 | 29.94 |
zh_en | 10.08 | 14.55 |
Training
This version of Ultravox continues to use a frozen Llama 3.1 pre-trained core (for both 8B and 70B), but we've significantly increased the size of the data and the overall training time. The speech adapter was trained on >10k hours of multilingual speech data. The training time on 8xH100s is about 24 hours for the 8B model and 3 days for the 70B model.
What's Changed
- Bugfix: push_to_hub to use correct model to test by @farzadab in #98
- Integrating OAI evals post training by @farzadab in #85
- Make sure do_eval works without do_train by @farzadab in #100
- Add AutoProcessor registration by @petersalas in #102
- Support num epochs in config by @liPatrick in #90
- Assert dataset length when using epochs by @liPatrick in #104
- Add chunking to ds_tool by @liPatrick in #97
- max_duration for Mosaic jobs by @farzadab in #112
- Not uploading text_config when text_model_id is present by @farzadab in #108
- [70B-Part1] Prefetch weights separately by @farzadab in #106
- [Bugfix] Dot in output_dir causes evals to fail by @farzadab in #115
- Update oaieval dependency by @farzadab in #114
- Bugfix for path replace by @farzadab in #116
- [70B-Part2] Improved save model (that can work with FSDP) by @farzadab in #107
- [70B-Part3] FSDP Training by @farzadab in #109
- [70B-Part4] Config and init_empty_weights by @farzadab in #117
- Update README: use cases for Ultravox training by @farzadab in #118
- Create test for config_base.py by @farzadab in #119
- Using fixie-ai version of peoples_speech by @farzadab in #125
- Dataset Tool to add Timestamps by @farzadab in #121
New Contributors
- @petersalas made their first contribution in #102
Full Changelog: v0.4...v0.4.1
v0.4
Hey everyone,
We're releasing Ultravox 0.4 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox APIs, v0.4 is the new default.
There are two key differences between 0.3 and 0.4:
- We've upgraded the Whisper encoder from Whisper-small to Whisper-medium
- We've trained on a larger set of multi-lingual data. Previous versions of Ultravox were only trained on English. Supported languages are now
ar
,de
,en
,es
,fr
,it
,ja
,pt
,ru
.
v0.4 builds upon the work in 0.3 and continues to show improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca
and zh
are examples of model performance for languages not included in training.
Ultravox 0.3 | Ultravox 0.4 | |
---|---|---|
en_ar | 9.07 | 28.07 |
en_de | 22.67 | 25.60 |
es_en | 24.10 | 31.03 |
ru_en | 22.52 | 38.96 |
en_ca | 24.87 | 27.49 |
zh_en | 4.26 | 10.08 |
This version of Ultravox continues to use a frozen Llama 3.1 8B pre-trained core, but we've roughly doubled the size of the data and the overall training time. The speech adapter was trained on ~5k hours of speech from LibriSpeech, Common Voice, Peoples Speech, and AnyInstruct. The training time on 8xH100s is roughly 170 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.3 was trained on ~2.5k hours of audio.
We'd love to hear feedback on your experience with Ultravox, along with feature suggestions. Roadmap coming soon.
What's Changed
- Update gradio demo to support text/voice conversation by @zqhuang211 in #75
- Offline batch inference mode by @liPatrick in #82
- Live reload for Gradio demo by @juberti in #89
- Working AutoProcessor.from_pretrained by @farzadab in #92
- Use bfloat16 by default on MPS by @juberti in #95
- Add retry and filter in ds tool by @liPatrick in #81
- Change tokenizer padding_side to left for eval by @zqhuang211 in #96
- Make v0.4 release by @zqhuang211 in #99
Full Changelog: v0.3...v0.4
v0.3
Hey everyone,
We're officially making Ultravox 0.3 available today. The weights have been pushed to Hugging Face (along with updated datasets for training), and the model training code has been updated as well. We’re also opening up early preview access to our Ultravox APIs through our managed service. For more information on that, please go here: https://fixie-ai.github.io/ultradox/
v0.3 demonstrates substantially improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better):
Ultravox 0.2 | Ultravox 0.3 | |
---|---|---|
en_de | 12.07 | 22.68 |
es_en | 15.17 | 24.10 |
This version of Ultravox uses a frozen Llama 3.1 8B pre-trained core. The speech adapter was trained on 2.5k hours of speech from both LibriSpeech and CommonVoice. The training time on 8xH100s is roughly 80 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.2 was trained on ~1.5k hours of audio.
In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.
The key benefit of better adapter alignment is that it makes it easier to customize Ultravox to particular needs and use cases by allowing it to extend any pre-trained LLM (including fine-tuned versions) with speech capabilities while retaining core capabilities across modalities. If this is something that interests you, please get in touch.
We’d love feedback on the model, so please let us know what works well and what doesn’t. To make testing easier, we built a new Gradio demo. To run it, simply run just gradio
inside of the Ultravox folder.
What's Changed
- Remove legacy directory by @farzadab in #1
- Improved Evaluations by @farzadab in #2
- Audio Encoder to bfloat16 by @farzadab in #4
- Whisper encoder + No 30 second padding by @farzadab in #5
- Optionally include "passage" in BoolQ samples by @juberti in #6
- Add tts_tool, for converting a HF dataset to audio by @juberti in #12
- Add logging code by @juberti in #19
- Fixes the default HF model name by @cezarc1 in #13
- Update Hugging Face link by @simonw in #17
- Don't run tests on docs changes by @juberti in #21
- Local tokenizer and processor for more consistent CI by @farzadab in #16
- Tool for uploading to HF Hub by @farzadab in #15
- Remove mlflow dependency by @juberti in #23
- Switch from Pip to Poetry by @juberti in #24
- Tool for adding new synthetic columns by @farzadab in #14
- entails -> provides a rationale for by @farzadab in #27
- Add @file syntax to ds_tool by @juberti in #28
- datasets: Handle converting
int16
audio data inVoiceSample
. by @shaper in #26 - Allow for toggling training and eval on/off by @farzadab in #29
- Add Eleven and Fireworks support to ds_tool by @juberti in #31
- Don't fail basic inference due to missing OAI key by @juberti in #34
- BoolQ for Training and Eval by @farzadab in #30
- Extending
ds_tool
for SODA conversational dataset by @farzadab in #32 - Add streaming support, using HF TextStreamer by @juberti in #46
- Minor fixes to ds_tool and infer_tool by @juberti in #36
- SODA Dataset for Training by @farzadab in #35
- HF pipeline to run Ultravox independent of Ultravox repo by @farzadab in #49
- Runs Tags for filtering by @farzadab in #51
- More validations by @farzadab in #48
- CoVoST 2 dataset by @farzadab in #53
- Speech Translation Evals by @farzadab in #54
- Update ds_tool.py by @zqhuang211 in #52
- Llama3.1 by @farzadab in #56
- Make so infer_tools works with a single arg for filename by @cdiddy77 in #55
- HF Model loading fixes by @farzadab in #59
- Separate files for eval logs by @farzadab in #61
- Add "without any explanation" to ST prompt by @farzadab in #60
- Support KL loss by @zqhuang211 in #63
- [ds_tool] Tools with Audio by @farzadab in #62
- Add basic data_processing test by @juberti in #64
- Add generic dataset by @zqhuang211 in #67
- Fix TypeError: non-default argument 'template' follows default argument, and filter out audio by @liPatrick in #69
- Filter out audio in map sample by @liPatrick in #72
- Update caching to use prefix by @liPatrick in #76
- Add weighted sampling in InterleaveDataset by @zqhuang211 in #70
- Update default config to ultravox_v0.3 by @zqhuang211 in #84
New Contributors
- @cezarc1 made their first contribution in #13
- @simonw made their first contribution in #17
- @shaper made their first contribution in #26
- @cdiddy77 made their first contribution in #55
Full Changelog: https://github.com/fixie-ai/ultravox/commits/v0.3