-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[Model] Add Granite Speech Support #16246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add Granite Speech Support #16246
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
70a0396
to
05ec868
Compare
Thanks for opening this!
This is fine, it's somewhat like how Phi-4-multimodal is handled. Can you add this model to the examples so users know how to use it? Also please update the supported models page, processor tests ( |
05ec868
to
090d2e5
Compare
Can you also add this model to |
Also please add this model to the Supported Models doc page |
self.query = nn.Parameter( | ||
torch.zeros(1, self.num_queries, | ||
config.projector_config.hidden_size)) | ||
self.query.data.normal_(mean=0.0, std=1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this initialization necessary in vLLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be! Will take a pass at removing things like this and switching out some of the linear layers to parallel versions in the next day or so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it 🙂
Sorry for the ping before finishing the first requested changes, I think it may have automatically re-requested code owner review when I force pushed! That all sounds good to me, will work on it asap now that the transformers PR is merged |
Hey @DarkLight1337, I think this should be ready for another look when you have a moment! The bug fix for the lora name parsing #17196 is needed for this model to work properly, but things look aligned with the transformers PR when this PR is rebased on top of that one 🙂 |
@@ -285,6 +285,7 @@ def _test_processing_correctness_mistral( | |||
"Skywork/Skywork-R1V-38B", | |||
"fixie-ai/ultravox-v0_5-llama-3_2-1b", | |||
"openai/whisper-large-v3", | |||
"ibm-granite/granite-speech-3.3-8b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep this in alphabetical order of the model architecture (not organization)
def sample( | ||
self, | ||
logits: torch.Tensor, | ||
sampling_metadata: SamplingMetadata, | ||
) -> Optional[SamplerOutput]: | ||
return self.language_model.sample(logits, sampling_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not used anymore since #17084
def get_mm_max_tokens_per_item( | ||
self, | ||
seq_len: int, | ||
mm_counts: Mapping[str, int], | ||
) -> Mapping[str, int]: | ||
return {"audio": self.get_max_audio_tokens()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not used anymore now. We can remove it
This pull request has merge conflicts that must be resolved before it can be |
3bc301e
to
dbf3802
Compare
Otherwise LGTM. Have you verified that the model works correctly on your end? |
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
dbf3802
to
f772264
Compare
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Yup, things look right on my side of things! I went ahead and pulled the audio assets fixtures from the ultravox tests to conftest and added a generation test under Also realized the audio placeholder was missing for running online, so added that too 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM then!
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Head branch was pushed to by a user without write access
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: minpeter <kali2005611@gmail.com>
This PR adds support for Granite Speech models, and is a port of the corresponding PR in Transformers. The model uses a conformer-based encoder, with a blip2 qformer-based projector to encode the audio, and masks it into a granite LLM. This model also uses an audio-specific lora adapter, which should only be enabled when the model is processing audio inputs. Currently, this means that the user needs to make a
LoraRequest
every time they send audio.It is probably a good idea to wait for the transformers PR to be merged so that everything is aligned, but opening this PR in case anyone has feedback 🙂 unfortunately, a model compatible with this PR is not publicly available yet - I am happy to submit a follow-up PR adding an example / docs + tests once one it is out.
Some quirks that are good to be aware of / have kind of gross edge cases that I am actively looking into:
The (rank 64) lora is bundled in the same dir as the model. At least in offline mode, it seems that the lora is loaded, but the lora layers are adding zero tensors, which result in unchanged outputs - still looking into this.
The model is very sensitive - I haven't optimized the conformer implementation yet, but if possible, it would be great if we could for now avoid optimizing the conformer layers until we also have some tests for alignment with HF once the model is released, as the optimizations in the granite LLM already seem to shift things a bit, and I still need to run a quality benchmark (still looking into whatever is going on with the lora first!)
Batching is a bit quirky because we don't use a feature attention mask and do zero padding prior to calculating the Mel features in the HF processor (i.e., the padding indices end up at small negative numbers that are dependent on the batch, though they are masked out in transformers with a masked scatter which is the most important thing). Since the static batch is submitted one instance at a time to the processor in vLLM, this results in the features being unpadded; this PR handles this after the fact by zero padding the 3D Mel features and torch splitting the result, though maybe there is a better small negative value to use here.
CC @DarkLight1337 @njhill @tlrmchlsmth