[Model] Add Granite Speech Support #16246

alex-jw-brooks · 2025-04-08T08:12:37Z

This PR adds support for Granite Speech models, and is a port of the corresponding PR in Transformers. The model uses a conformer-based encoder, with a blip2 qformer-based projector to encode the audio, and masks it into a granite LLM. This model also uses an audio-specific lora adapter, which should only be enabled when the model is processing audio inputs. Currently, this means that the user needs to make a LoraRequest every time they send audio.

It is probably a good idea to wait for the transformers PR to be merged so that everything is aligned, but opening this PR in case anyone has feedback 🙂 unfortunately, a model compatible with this PR is not publicly available yet - I am happy to submit a follow-up PR adding an example / docs + tests once one it is out.

Some quirks that are good to be aware of / have kind of gross edge cases that I am actively looking into:

The (rank 64) lora is bundled in the same dir as the model. At least in offline mode, it seems that the lora is loaded, but the lora layers are adding zero tensors, which result in unchanged outputs - still looking into this.
The model is very sensitive - I haven't optimized the conformer implementation yet, but if possible, it would be great if we could for now avoid optimizing the conformer layers until we also have some tests for alignment with HF once the model is released, as the optimizations in the granite LLM already seem to shift things a bit, and I still need to run a quality benchmark (still looking into whatever is going on with the lora first!)
Batching is a bit quirky because we don't use a feature attention mask and do zero padding prior to calculating the Mel features in the HF processor (i.e., the padding indices end up at small negative numbers that are dependent on the batch, though they are masked out in transformers with a masked scatter which is the most important thing). Since the static batch is submitted one instance at a time to the processor in vLLM, this results in the features being unpadded; this PR handles this after the fact by zero padding the 3D Mel features and torch splitting the result, though maybe there is a better small negative value to use here.

CC @DarkLight1337 @njhill @tlrmchlsmth

github-actions · 2025-04-08T08:12:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-04-08T08:17:26Z

Thanks for opening this!

This model also uses an audio-specific lora adapter, which should only be enabled when the model is processing audio inputs. Currently, this means that the user needs to make a LoraRequest every time they send audio.

This is fine, it's somewhat like how Phi-4-multimodal is handled. Can you add this model to the examples so users know how to use it?

Also please update the supported models page, processor tests (tests/models/multimodal/processing/test_common.py) and test registry (tests/models/registry.py).

DarkLight1337 · 2025-04-21T12:47:12Z

Can you also add this model to tests/models/multimodal/processing/test_common.py to test the processing logic?

DarkLight1337 · 2025-04-21T12:47:42Z

Also please add this model to the Supported Models doc page

DarkLight1337 · 2025-04-21T12:48:13Z

vllm/model_executor/models/granite_speech.py

+        self.query = nn.Parameter(
+            torch.zeros(1, self.num_queries,
+                        config.projector_config.hidden_size))
+        self.query.data.normal_(mean=0.0, std=1.0)


Is this initialization necessary in vLLM?

It shouldn't be! Will take a pass at removing things like this and switching out some of the linear layers to parallel versions in the next day or so

Removed it 🙂

vllm/model_executor/models/granite_speech.py

alex-jw-brooks · 2025-04-22T01:40:06Z

Sorry for the ping before finishing the first requested changes, I think it may have automatically re-requested code owner review when I force pushed! That all sounds good to me, will work on it asap now that the transformers PR is merged

alex-jw-brooks · 2025-04-25T23:39:50Z

Hey @DarkLight1337, I think this should be ready for another look when you have a moment!

The bug fix for the lora name parsing #17196 is needed for this model to work properly, but things look aligned with the transformers PR when this PR is rebased on top of that one 🙂

DarkLight1337 · 2025-04-26T03:06:52Z

tests/models/multimodal/processing/test_common.py

@@ -285,6 +285,7 @@ def _test_processing_correctness_mistral(
    "Skywork/Skywork-R1V-38B",
    "fixie-ai/ultravox-v0_5-llama-3_2-1b",
    "openai/whisper-large-v3",
+    "ibm-granite/granite-speech-3.3-8b"


Keep this in alphabetical order of the model architecture (not organization)

vllm/model_executor/models/blip2.py

DarkLight1337 · 2025-04-26T03:08:21Z

vllm/model_executor/models/granite_speech.py

+    def sample(
+        self,
+        logits: torch.Tensor,
+        sampling_metadata: SamplingMetadata,
+    ) -> Optional[SamplerOutput]:
+        return self.language_model.sample(logits, sampling_metadata)


This method is not used anymore since #17084

DarkLight1337 · 2025-04-26T03:09:17Z

vllm/model_executor/models/granite_speech.py

+    def get_mm_max_tokens_per_item(
+        self,
+        seq_len: int,
+        mm_counts: Mapping[str, int],
+    ) -> Mapping[str, int]:
+        return {"audio": self.get_max_audio_tokens()}


This method is not used anymore now. We can remove it

mergify · 2025-04-26T16:45:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alex-jw-brooks.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/models/granite_speech.py

DarkLight1337 · 2025-04-27T02:55:40Z

Otherwise LGTM. Have you verified that the model works correctly on your end?

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

alex-jw-brooks · 2025-04-28T02:24:22Z

Yup, things look right on my side of things! I went ahead and pulled the audio assets fixtures from the ultravox tests to conftest and added a generation test under tests/models/decoder_only/audio_language/test_granite_speech.py, which won't currently run because transformers hasn't cut 4.52 yet, but it does pass on my machine using the tip of transformers.

Also realized the audio placeholder was missing for running online, so added that too 🙂

DarkLight1337

LGTM then!

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: minpeter <kali2005611@gmail.com>

alex-jw-brooks force-pushed the granite_speech branch from 70a0396 to 05ec868 Compare April 8, 2025 08:13

alex-jw-brooks changed the title ~~WIP - Add Granite Speech Support~~ [Model] WIP - Add Granite Speech Support Apr 8, 2025

alex-jw-brooks force-pushed the granite_speech branch from 05ec868 to 090d2e5 Compare April 21, 2025 10:28

alex-jw-brooks requested review from DarkLight1337 and ywang96 as code owners April 21, 2025 10:30

alex-jw-brooks changed the title ~~[Model] WIP - Add Granite Speech Support~~ [Model] Add Granite Speech Support Apr 21, 2025

DarkLight1337 reviewed Apr 21, 2025

View reviewed changes

vllm/model_executor/models/granite_speech.py Show resolved Hide resolved

alex-jw-brooks mentioned this pull request Apr 25, 2025

[Bugfix] Fix Lora Name Parsing #17196

Merged

mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) labels Apr 25, 2025

alex-jw-brooks requested a review from DarkLight1337 April 25, 2025 23:40

DarkLight1337 reviewed Apr 26, 2025

View reviewed changes

vllm/model_executor/models/blip2.py Show resolved Hide resolved

DarkLight1337 reviewed Apr 26, 2025

View reviewed changes

mergify bot added the needs-rebase label Apr 26, 2025

alex-jw-brooks force-pushed the granite_speech branch from 3bc301e to dbf3802 Compare April 26, 2025 16:49

mergify bot removed the needs-rebase label Apr 26, 2025

DarkLight1337 reviewed Apr 27, 2025

View reviewed changes

vllm/model_executor/models/granite_speech.py Show resolved Hide resolved

alex-jw-brooks added 2 commits April 28, 2025 00:42

Wip forgraniite speech composite model

44c1b6f

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>

Add placeholder for granite speech registration

0927010

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>

alex-jw-brooks added 10 commits April 28, 2025 00:42

Use parallel linear layers in ctc encoder residual

78f0352

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Add granite speech to audio language example

1018c1e

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Add granite speech to supported models

8bd18c1

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Push quant config and prefix through all granite speech modules

03f963c

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Comment cleanup

e6a21b7

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Fix model order

5378d62

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Pass blip2 prefix through all modules

2dfe884

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Removed unused methods

3f1c94a

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Move audio assets to conftest

b309dc5

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Add granite speech model test

f772264

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

alex-jw-brooks force-pushed the granite_speech branch from dbf3802 to f772264 Compare April 28, 2025 01:00

alex-jw-brooks added 2 commits April 28, 2025 02:08

Add granite speech placeholder for server

8b1f6e6

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Fix format

1b00eb1

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

mergify bot added the frontend label Apr 28, 2025

DarkLight1337 approved these changes Apr 28, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) April 28, 2025 02:34

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 28, 2025

Fix test name

73f982a

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

auto-merge was automatically disabled April 28, 2025 07:36
Head branch was pushed to by a user without write access

DarkLight1337 enabled auto-merge (squash) April 28, 2025 07:37

DarkLight1337 merged commit fa93cd9 into vllm-project:main Apr 28, 2025
48 checks passed

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[Model] Add Granite Speech Support (vllm-project#16246)

72894a2

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Model] Add Granite Speech Support (vllm-project#16246)

39ed5df

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Model] Add Granite Speech Support (vllm-project#16246)

4fa51d2

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Uh oh!

[Model] Add Granite Speech Support #16246

[Model] Add Granite Speech Support #16246

Uh oh!

Conversation

alex-jw-brooks commented Apr 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 8, 2025

Uh oh!

DarkLight1337 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Apr 21, 2025

Uh oh!

DarkLight1337 commented Apr 21, 2025

Uh oh!

DarkLight1337 Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

alex-jw-brooks Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

alex-jw-brooks Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alex-jw-brooks commented Apr 22, 2025

Uh oh!

alex-jw-brooks commented Apr 25, 2025

Uh oh!

DarkLight1337 Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarkLight1337 Apr 26, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Apr 26, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 26, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Apr 27, 2025

Uh oh!

alex-jw-brooks commented Apr 28, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alex-jw-brooks commented Apr 8, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Apr 8, 2025 •

edited

Loading

DarkLight1337 Apr 26, 2025 •

edited

Loading