Skip to content

Releases: fixie-ai/ultravox

v0.4.1

12 Nov 22:48
812f58c
Compare
Choose a tag to compare

We're releasing Ultravox 0.4.1 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox Realtime APIs, v0.4.1 is the new default.

We'd love to hear feedback on your experience with Ultravox, along with feature suggestions.

What's New

v0.4.1 improves upon 0.4 in the following ways:

  • We've upgraded the Whisper encoder from Whisper-medium to Whisper-large-v3-turbo. This has led to quality improvements (see the table below).
  • We're adding six new languages: Chinese, Dutch, Hindi, Swedish, Turkish, and Ukrainian. That brings the total supported languages to 15 (see table below).
  • Increased the amount of training data for English.

15 Languages Supported

Language ISO Code
Arabic ar
Chinese zh
Dutch nl
English en
French fr
German de
Hindi hi
Italian it
Japanese ja
Portuguese pt
Russian ru
Spanish es
Swedish sv
Turkish tr
Ukrainian uk

Evals

Our primary method of evaluation is speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca is an example of model performance for languages not included in training.

Ultravox 70B

Ultravox 0.4 70B Ultravox 0.4.1 70B
en_ar 14.97 19.64
en_de 30.30 32.47
es_en 39.55 40.76
ru_en 44.16 45.07
en_ca 35.02 37.58
zh_en 12.16 17.98

Ultravox 8B

Ultravox 0.4 8B Ultravox 0.4.1 8B
en_ar 11.17 12.28
en_de 25.47 27.13
es_en 37.11 39.16
ru_en 38.96 39.65
en_ca 27.46 29.94
zh_en 10.08 14.55

Training

This version of Ultravox continues to use a frozen Llama 3.1 pre-trained core (for both 8B and 70B), but we've significantly increased the size of the data and the overall training time. The speech adapter was trained on >10k hours of multilingual speech data. The training time on 8xH100s is about 24 hours for the 8B model and 3 days for the 70B model.

What's Changed

New Contributors

Full Changelog: v0.4...v0.4.1

v0.4

27 Aug 01:12
b649b9f
Compare
Choose a tag to compare

Hey everyone,

We're releasing Ultravox 0.4 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox APIs, v0.4 is the new default.

There are two key differences between 0.3 and 0.4:

  • We've upgraded the Whisper encoder from Whisper-small to Whisper-medium
  • We've trained on a larger set of multi-lingual data. Previous versions of Ultravox were only trained on English. Supported languages are now ar, de, en, es, fr, it, ja, pt, ru.

v0.4 builds upon the work in 0.3 and continues to show improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca and zh are examples of model performance for languages not included in training.

Ultravox 0.3 Ultravox 0.4
en_ar 9.07 28.07
en_de 22.67 25.60
es_en 24.10 31.03
ru_en 22.52 38.96
en_ca 24.87 27.49
zh_en 4.26 10.08

This version of Ultravox continues to use a frozen Llama 3.1 8B pre-trained core, but we've roughly doubled the size of the data and the overall training time. The speech adapter was trained on ~5k hours of speech from LibriSpeech, Common Voice, Peoples Speech, and AnyInstruct. The training time on 8xH100s is roughly 170 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.3 was trained on ~2.5k hours of audio.

We'd love to hear feedback on your experience with Ultravox, along with feature suggestions. Roadmap coming soon.

What's Changed

Full Changelog: v0.3...v0.4

v0.3

23 Aug 00:02
b4a4fc5
Compare
Choose a tag to compare

Hey everyone,

We're officially making Ultravox 0.3 available today. The weights have been pushed to Hugging Face (along with updated datasets for training), and the model training code has been updated as well. We’re also opening up early preview access to our Ultravox APIs through our managed service. For more information on that, please go here: https://fixie-ai.github.io/ultradox/

v0.3 demonstrates substantially improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better):

Ultravox 0.2 Ultravox 0.3
en_de 12.07 22.68
es_en 15.17 24.10

This version of Ultravox uses a frozen Llama 3.1 8B pre-trained core. The speech adapter was trained on 2.5k hours of speech from both LibriSpeech and CommonVoice. The training time on 8xH100s is roughly 80 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.2 was trained on ~1.5k hours of audio.

In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.

The key benefit of better adapter alignment is that it makes it easier to customize Ultravox to particular needs and use cases by allowing it to extend any pre-trained LLM (including fine-tuned versions) with speech capabilities while retaining core capabilities across modalities. If this is something that interests you, please get in touch.

We’d love feedback on the model, so please let us know what works well and what doesn’t. To make testing easier, we built a new Gradio demo. To run it, simply run just gradio inside of the Ultravox folder.

What's Changed

New Contributors

Full Changelog: https://github.com/fixie-ai/ultravox/commits/v0.3