[Model] Re-support MotifForCausalLM #27396

WyldeCat · 2025-10-23T06:28:11Z

Purpose

Re-support MotifForCausalLM, which was removed during the V0 deprecation
- [Model] Remove MotifForCausalLM #25866

Changes

Revert PolyNorm layer removal
- [Chore] Remove unused PolyNorm layer #27110
Add GroupedDifferentialAttention backend
- https://arxiv.org/abs/2510.06949
- A more general version of DifferentialAttention
  - Introduced at [Model] New model support for microsoft/Phi-4-mini-flash-reasoning #20702
  - Removed at Remove V0 attention backends #25351
Add MotifForCausalLM

New model for :
https://huggingface.co/Motif-Technologies/Motif-2-12.7B-Base
https://huggingface.co/Motif-Technologies/Motif-2.6b-v1.1-LC
https://huggingface.co/Motif-Technologies/Motif-2.6B

co-author : @ca1207

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

This reverts commit 3125d79. Signed-off-by: WyldeCat <skan1543@gmail.com>

Signed-off-by: WyldeCat <skan1543@gmail.com>

mergify · 2025-10-23T06:28:52Z

Documentation preview: https://vllm--27396.org.readthedocs.build/en/27396/

gemini-code-assist

Code Review

This pull request re-introduces support for the Motif model, which involves adding a new PolyNorm layer with its corresponding CUDA kernel, and a GroupedDifferentialAttention backend. The implementations for PolyNorm and the new model architecture appear to be correct and follow best practices. However, I've identified a critical issue in the caching logic within the new GroupedDifferentialAttentionBackend that could lead to incorrect behavior and needs to be addressed.

gemini-code-assist · 2025-10-23T06:31:02Z

vllm/v1/attention/backends/grouped_diff_attn.py

+        if (
+            self.kv_sharing_target_layer_name is None
+            and key is not None
+            and value is not None
+        ):
+            # Reshape the input keys and values and store them in the cache.
+            # Skip this if sharing KV cache with an earlier attention layer.
+            # NOTE(woosuk): Here, key and value are padded while slot_mapping is
+            # not padded. However, we don't need to do key[:num_actual_tokens]
+            # and value[:num_actual_tokens] because the reshape_and_cache_flash
+            # op uses the slot_mapping's shape to determine the number of
+            # actual tokens.
+            reshape_and_cache_flash(
+                key,
+                value,
+                key_cache,
+                value_cache,
+                attn_metadata.slot_mapping,
+                self.kv_cache_dtype,
+                layer._k_scale,
+                layer._v_scale,
+            )


The forward_single_attention method includes a block for caching key and value tensors. This is problematic because the main forward method already performs the necessary caching for the Grouped Differential Attention (GDA) splits (k1, v1 and k2, v2) using the populate_kv_cache method before forward_single_attention is ever called.

This leads to two significant issues:

Redundant Caching: The same key-value pairs are cached multiple times, which is inefficient.

Incorrect Caching: For cross-split attention computations (e.g., Attn(q1, K1, V2)), forward_single_attention is invoked with mismatched key and value tensors (like k1 and v2). The reshape_and_cache_flash call within this method is not designed to handle such cases and will likely corrupt the cache state.

The caching logic should be centralized within the forward method. The forward_single_attention method should then only be responsible for the attention computation, using the already populated cache, without performing any caching itself.

To resolve this, the caching block within forward_single_attention should be removed. The key and value arguments are still necessary for other parts of the function (e.g., descale_shape calculation), so they should remain in the function signature.

WyldeCat · 2025-10-27T02:30:08Z

Hi @jeejeelee , I hope you’re doing well!
I just wanted to kindly check if there’s anything I should address with this PR.
Since you reviewed our previous PR, I thought I’d ask if everything looks okay or if there’s anything you’d like me to update.
Thanks a lot for your time!

jeejeelee · 2025-10-27T02:55:58Z

vllm/v1/attention/backends/grouped_diff_attn.py

@@ -0,0 +1,744 @@
+# SPDX-License-Identifier: Apache-2.0


@DarkLight1337 Who can review this attention?

cc @tlrmchlsmth @LucasWilkinson

Hi @tlrmchlsmth @LucasWilkinson
Just a gentle reminder on the above item. Whenever you have a moment, could you please take a look?
I really appreciate your help.

tdoublep · 2025-11-07T10:09:13Z

How does this new attention backend differ from the V0 differential attention backend that removed? Is it a more general version? Would we also be able to use it to re-support Phi 4 Flash Reasoning in V1?

ca1207 · 2025-11-07T11:08:43Z

@tdoublep
Yes, this is the generalized version of differential attention, which also supports grouped differential attention as proposed in this paper
If you don’t specify num_q1_heads and num_q2_heads, it behaves the same as the original differential attention.
So I'm quite sure this attention can be used to re-support Phi 4 Flash Reasoning in V1.

Signed-off-by: ca1207 <ca1207zzz@gmail.com>

mergify · 2025-11-11T06:33:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WyldeCat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ca1207 · 2025-11-11T06:40:30Z

Added tensor parallel support to MotifModel and PolyNorm classes 2cbf5d8

ca1207 · 2025-11-11T06:44:43Z

Hi @tlrmchlsmth @LucasWilkinson
My apologies for following up persistently.
Please let me know if there are any issues with the PR — I’ll be happy to address them.
Thank you for your time and support.

Signed-off-by: ca1207 <ca1207zzz@gmail.com>

mergify · 2025-11-11T12:51:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WyldeCat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ca1207 · 2025-11-21T06:23:47Z

@jeejeelee
Hi, I hope you’re doing well.

Our PR has been open for some time, and I noticed there hasn’t been any recent activity from the reviewers. I completely understand that everyone is busy, but I wanted to kindly ask if there might be any suggestions or feedback that could help move the progression forward.

This PR is not only relevant to our model, but it also enables differential attention support in v1. As you can see from the mentions above, I believe this could be quite valuable for users who are working on phi-4 reasoning as well.

Since you kindly reviewed one of my previous PRs, I wanted to reach out and ask for your thoughts whenever you have a moment. Any guidance would be greatly appreciated.

Thank you very much for your time and consideration.

DarkLight1337 · 2025-11-21T16:43:32Z

@LucasWilkinson @yewentao256 do you have time to take a look?

WyldeCat added 3 commits October 23, 2025 13:40

Revert "[Chore] Remove unused PolyNorm layer (vllm-project#27110)"

e642b14

This reverts commit 3125d79. Signed-off-by: WyldeCat <skan1543@gmail.com>

Add GroupedDifferentialAttentionBackend

a9ad347

Signed-off-by: WyldeCat <skan1543@gmail.com>

Support Motif model

894c3f0

Signed-off-by: WyldeCat <skan1543@gmail.com>

WyldeCat requested review from DarkLight1337, LucasWilkinson, WoosukKwon, mgoin, tlrmchlsmth, yewentao256 and ywang96 as code owners October 23, 2025 06:28

mergify bot added documentation Improvements or additions to documentation new-model Requests to new models performance Performance-related issues v1 labels Oct 23, 2025

WyldeCat changed the title ~~Re-support Motif Model~~ Re-support MotifForCausalLM Oct 23, 2025

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

This was referenced Oct 23, 2025

vLLM PR #27396 变更核心文件提醒 Kay-Tian/vllm#1

Closed

vLLM PR #27396 变更核心文件提醒 Kay-Tian/vllm#5

Closed

WyldeCat changed the title ~~Re-support MotifForCausalLM~~ [Model] Re-support MotifForCausalLM Oct 24, 2025

jeejeelee reviewed Oct 27, 2025

View reviewed changes

support tensor parallel for motif_small and poly_norm

2cbf5d8

Signed-off-by: ca1207 <ca1207zzz@gmail.com>

mergify bot added the needs-rebase label Nov 11, 2025

Merge branch 'main' into feat/motif

5b7017a

Signed-off-by: ca1207 <ca1207zzz@gmail.com>

Merge branch 'main' into feat/motif

4328b24

mergify bot added nvidia and removed needs-rebase labels Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

renll mentioned this pull request Nov 17, 2025

[Feature]: Support Phi4Flash model in V1 #23996

Open

5 tasks

Uh oh!

[Model] Re-support MotifForCausalLM #27396

Are you sure you want to change the base?

[Model] Re-support MotifForCausalLM #27396

Uh oh!

Conversation

WyldeCat commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

WyldeCat commented Oct 27, 2025

Uh oh!

jeejeelee Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

ca1207 Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

tdoublep commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ca1207 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

ca1207 commented Nov 11, 2025

Uh oh!

ca1207 commented Nov 11, 2025

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

ca1207 commented Nov 21, 2025

Uh oh!

DarkLight1337 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WyldeCat commented Oct 23, 2025 •

edited by github-actions bot

Loading

tdoublep commented Nov 7, 2025 •

edited

Loading

ca1207 commented Nov 7, 2025 •

edited

Loading