Skip to content

Conversation

@xenova
Copy link
Collaborator

@xenova xenova commented Jul 22, 2025

Example:

import { VoxtralForConditionalGeneration, VoxtralProcessor, TextStreamer, read_audio } from "@huggingface/transformers";

// Load the processor and model
const model_id = "onnx-community/Voxtral-Mini-3B-2507-ONNX";
const processor = await VoxtralProcessor.from_pretrained(model_id);
const model = await VoxtralForConditionalGeneration.from_pretrained(
    model_id,
    {
        dtype: {
            embed_tokens: "fp16", // "fp32", "fp16", "q8", "q4"
            audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
            decoder_model_merged: "q4", // "q4", "q4f16"
        },
    },
);

// Prepare the conversation
const conversation = [
    {
        "role": "user",
        "content": [
            { "type": "audio" },
            { "type": "audio" },
            { "type": "text", "text": "Describe these two audio clips in detail." },
        ],
    }
];
const text = processor.apply_chat_template(conversation, { tokenize: false });
const audio = await Promise.all([
    read_audio("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav", 16000),
    read_audio("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000),
]);
const inputs = await processor(text, audio);

// Generate the response
const generated_ids = await model.generate({
    ...inputs,
    max_new_tokens: 256,
    streamer: new TextStreamer(processor.tokenizer, { skip_special_tokens: true, skip_prompt: true }),
});

// Decode the generated tokens
const new_tokens = generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]);
const generated_texts = processor.batch_decode(
    new_tokens,
    { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// The first audio clip is a speech by a leader, likely a politician or a public figure, addressing a large audience. The speaker begins by encouraging the listeners to ask not what their country can do for them, but what they can do for their country. This is a call to action and a reminder of the individual's responsibility to contribute to the nation's well-being. The second audio clip is a passionate speech by a different leader, possibly a civil rights activist or a community organizer. This speaker expresses a dream of a nation that will rise up and live out the true meaning of its creed, suggesting a vision of a more just and equitable society.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova xenova merged commit beb1d17 into main Jul 22, 2025
4 checks passed
@xenova xenova deleted the add-voxtral branch July 22, 2025 20:44
xenova added a commit that referenced this pull request Jul 30, 2025
* ONNX Runtime improvements (experimental native webgpu; fix iOS) (#1231)

* customize the wasm paths

* update implementation

* allow using 'webgpu' in nodejs binding

* update version of onnxruntime-node

* Upgrade onnxruntime-web to same version as onnxruntime-node

* Update list of supported devices

---------

Co-authored-by: Joshua Lochner <26504141+xenova@users.noreply.github.com>

* customize the wasm paths (#1250)

* customize the wasm paths

* update implementation

* [internal] Add is_decoder option to session retrieval for preferred output location

* Update tests

* Formatting

* Bump ort versions

* Bump onnxruntime-node version

* Bump versions

* Bump ORT versions

* Bump versions

* Only check webgpu fp16 for non-node environments

* Fix

* Assume node supports webgpu

* Update ORT node support comment

* Relax test strictness

* Update conversion script versions

* Downgrade onnxslim

* cleanup

* Update package-lock.json

* Update onnxruntime versions

* Update post-build script

* Use built-in session release function

* Call garbage collection after each tokenizer test

* Do not double-throw error

* Fix race-condition in build process with file removal

* Update versions

* Bump jinja version

* [version] Update to 3.6.3

* Bump jinja version to support new features

* [version] Update to 3.6.3

* Add support for LFM2 models (#1367)

* Use prefix in lfm2 output location (#1369)

* Update package-lock.json

* Run `npm audit fix`

* Add special tokens in text-generation pipeline if tokenizer requires (#1370)

* Add special tokens in text-generation pipeline if tokenizer requires

* Fix logits processors tests

* Update bundles.test.js

* Update comment

* Formatting

* Add support for ModernBERT Decoder (#1371)

* Use from/to buffer instead of string

Actually fixes #1343

* Add support for Voxtral (#1373)

* Support longform voxtral processing (#1375)

* [version] Update to 3.7.0

* Add support for Arcee (#1377)

* Optimize tensor.slice() (#1381)

* Optimize tensor.slice()

The performance of executing `tensor.slice()` is super poor, especially for
the 'logits' tensor with large dimensions.

```
const logits = outputs.logits.slice(null, -1, null);`
```

This is because currently implementation of the `slice` method manually iterates
through each element and calculate indices which is a big time consuming if
the tensor shape is large.

For cases like `slice(null, -1, null)`, where the slicing operation is
contiguous along certain dimensions, which can be optimized by bulk copy
by using `TypeArray.subarray()` and `TypeArray.set()`.

* nit

* Add a few more tensor slice unit tests

---------

Co-authored-by: Joshua Lochner <26504141+xenova@users.noreply.github.com>

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants