Releases · fixie-ai/ultravox

12 Nov 22:48

benlower

v0.4.1

812f58c

v0.4.1 Latest

Latest

We're releasing Ultravox 0.4.1 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox Realtime APIs, v0.4.1 is the new default.

We'd love to hear feedback on your experience with Ultravox, along with feature suggestions.

What's New

v0.4.1 improves upon 0.4 in the following ways:

We've upgraded the Whisper encoder from Whisper-medium to Whisper-large-v3-turbo. This has led to quality improvements (see the table below).
We're adding six new languages: Chinese, Dutch, Hindi, Swedish, Turkish, and Ukrainian. That brings the total supported languages to 15 (see table below).
Increased the amount of training data for English.

15 Languages Supported

Language	ISO Code
Arabic	ar
Chinese	zh
Dutch	nl
English	en
French	fr
German	de
Hindi	hi
Italian	it
Japanese	ja
Portuguese	pt
Russian	ru
Spanish	es
Swedish	sv
Turkish	tr
Ukrainian	uk

Evals

Our primary method of evaluation is speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca is an example of model performance for languages not included in training.

Ultravox 70B

	Ultravox 0.4 70B	Ultravox 0.4.1 70B
en_ar	14.97	19.64
en_de	30.30	32.47
es_en	39.55	40.76
ru_en	44.16	45.07
en_ca	35.02	37.58
zh_en	12.16	17.98

Ultravox 8B

	Ultravox 0.4 8B	Ultravox 0.4.1 8B
en_ar	11.17	12.28
en_de	25.47	27.13
es_en	37.11	39.16
ru_en	38.96	39.65
en_ca	27.46	29.94
zh_en	10.08	14.55

Training

This version of Ultravox continues to use a frozen Llama 3.1 pre-trained core (for both 8B and 70B), but we've significantly increased the size of the data and the overall training time. The speech adapter was trained on >10k hours of multilingual speech data. The training time on 8xH100s is about 24 hours for the 8B model and 3 days for the 70B model.

What's Changed

Bugfix: push_to_hub to use correct model to test by @farzadab in #98
Integrating OAI evals post training by @farzadab in #85
Make sure do_eval works without do_train by @farzadab in #100
Add AutoProcessor registration by @petersalas in #102
Support num epochs in config by @liPatrick in #90
Assert dataset length when using epochs by @liPatrick in #104
Add chunking to ds_tool by @liPatrick in #97
max_duration for Mosaic jobs by @farzadab in #112
Not uploading text_config when text_model_id is present by @farzadab in #108
[70B-Part1] Prefetch weights separately by @farzadab in #106
[Bugfix] Dot in output_dir causes evals to fail by @farzadab in #115
Update oaieval dependency by @farzadab in #114
Bugfix for path replace by @farzadab in #116
[70B-Part2] Improved save model (that can work with FSDP) by @farzadab in #107
[70B-Part3] FSDP Training by @farzadab in #109
[70B-Part4] Config and init_empty_weights by @farzadab in #117
Update README: use cases for Ultravox training by @farzadab in #118
Create test for config_base.py by @farzadab in #119
Using fixie-ai version of peoples_speech by @farzadab in #125
Dataset Tool to add Timestamps by @farzadab in #121

New Contributors

@petersalas made their first contribution in #102

Full Changelog: v0.4...v0.4.1

Contributors

petersalas, farzadab, and liPatrick

Assets 2

27 Aug 01:12

zkoch

v0.4

b649b9f

v0.4

Hey everyone,

We're releasing Ultravox 0.4 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox APIs, v0.4 is the new default.

There are two key differences between 0.3 and 0.4:

We've upgraded the Whisper encoder from Whisper-small to Whisper-medium
We've trained on a larger set of multi-lingual data. Previous versions of Ultravox were only trained on English. Supported languages are now ar, de, en, es, fr, it, ja, pt, ru.

v0.4 builds upon the work in 0.3 and continues to show improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca and zh are examples of model performance for languages not included in training.

	Ultravox 0.3	Ultravox 0.4
en_ar	9.07	28.07
en_de	22.67	25.60
es_en	24.10	31.03
ru_en	22.52	38.96
en_ca	24.87	27.49
zh_en	4.26	10.08

This version of Ultravox continues to use a frozen Llama 3.1 8B pre-trained core, but we've roughly doubled the size of the data and the overall training time. The speech adapter was trained on ~5k hours of speech from LibriSpeech, Common Voice, Peoples Speech, and AnyInstruct. The training time on 8xH100s is roughly 170 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.3 was trained on ~2.5k hours of audio.

We'd love to hear feedback on your experience with Ultravox, along with feature suggestions. Roadmap coming soon.

What's Changed

Update gradio demo to support text/voice conversation by @zqhuang211 in #75
Offline batch inference mode by @liPatrick in #82
Live reload for Gradio demo by @juberti in #89
Working AutoProcessor.from_pretrained by @farzadab in #92
Use bfloat16 by default on MPS by @juberti in #95
Add retry and filter in ds tool by @liPatrick in #81
Change tokenizer padding_side to left for eval by @zqhuang211 in #96
Make v0.4 release by @zqhuang211 in #99

Full Changelog: v0.3...v0.4

Contributors

juberti, farzadab, and 2 other contributors

Assets 2

23 Aug 00:02

zkoch

v0.3

b4a4fc5

v0.3

Hey everyone,

We're officially making Ultravox 0.3 available today. The weights have been pushed to Hugging Face (along with updated datasets for training), and the model training code has been updated as well. We’re also opening up early preview access to our Ultravox APIs through our managed service. For more information on that, please go here: https://fixie-ai.github.io/ultradox/

v0.3 demonstrates substantially improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better):

	Ultravox 0.2	Ultravox 0.3
en_de	12.07	22.68
es_en	15.17	24.10

This version of Ultravox uses a frozen Llama 3.1 8B pre-trained core. The speech adapter was trained on 2.5k hours of speech from both LibriSpeech and CommonVoice. The training time on 8xH100s is roughly 80 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.2 was trained on ~1.5k hours of audio.

In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.

The key benefit of better adapter alignment is that it makes it easier to customize Ultravox to particular needs and use cases by allowing it to extend any pre-trained LLM (including fine-tuned versions) with speech capabilities while retaining core capabilities across modalities. If this is something that interests you, please get in touch.

We’d love feedback on the model, so please let us know what works well and what doesn’t. To make testing easier, we built a new Gradio demo. To run it, simply run just gradio inside of the Ultravox folder.

What's Changed

Remove legacy directory by @farzadab in #1
Improved Evaluations by @farzadab in #2
Audio Encoder to bfloat16 by @farzadab in #4
Whisper encoder + No 30 second padding by @farzadab in #5
Optionally include "passage" in BoolQ samples by @juberti in #6
Add tts_tool, for converting a HF dataset to audio by @juberti in #12
Add logging code by @juberti in #19
Fixes the default HF model name by @cezarc1 in #13
Update Hugging Face link by @simonw in #17
Don't run tests on docs changes by @juberti in #21
Local tokenizer and processor for more consistent CI by @farzadab in #16
Tool for uploading to HF Hub by @farzadab in #15
Remove mlflow dependency by @juberti in #23
Switch from Pip to Poetry by @juberti in #24
Tool for adding new synthetic columns by @farzadab in #14
entails -> provides a rationale for by @farzadab in #27
Add @file syntax to ds_tool by @juberti in #28
datasets: Handle converting int16 audio data in VoiceSample. by @shaper in #26
Allow for toggling training and eval on/off by @farzadab in #29
Add Eleven and Fireworks support to ds_tool by @juberti in #31
Don't fail basic inference due to missing OAI key by @juberti in #34
BoolQ for Training and Eval by @farzadab in #30
Extending ds_tool for SODA conversational dataset by @farzadab in #32
Add streaming support, using HF TextStreamer by @juberti in #46
Minor fixes to ds_tool and infer_tool by @juberti in #36
SODA Dataset for Training by @farzadab in #35
HF pipeline to run Ultravox independent of Ultravox repo by @farzadab in #49
Runs Tags for filtering by @farzadab in #51
More validations by @farzadab in #48
CoVoST 2 dataset by @farzadab in #53
Speech Translation Evals by @farzadab in #54
Update ds_tool.py by @zqhuang211 in #52
Llama3.1 by @farzadab in #56
Make so infer_tools works with a single arg for filename by @cdiddy77 in #55
HF Model loading fixes by @farzadab in #59
Separate files for eval logs by @farzadab in #61
Add "without any explanation" to ST prompt by @farzadab in #60
Support KL loss by @zqhuang211 in #63
[ds_tool] Tools with Audio by @farzadab in #62
Add basic data_processing test by @juberti in #64
Add generic dataset by @zqhuang211 in #67
Fix TypeError: non-default argument 'template' follows default argument, and filter out audio by @liPatrick in #69
Filter out audio in map sample by @liPatrick in #72
Update caching to use prefix by @liPatrick in #76
Add weighted sampling in InterleaveDataset by @zqhuang211 in #70
Update default config to ultravox_v0.3 by @zqhuang211 in #84

New Contributors

@cezarc1 made their first contribution in #13
@simonw made their first contribution in #17
@shaper made their first contribution in #26
@cdiddy77 made their first contribution in #55

Full Changelog: https://github.com/fixie-ai/ultravox/commits/v0.3

Contributors

shaper, simonw, and 7 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly