- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[Misc] Add triton_kernels dependency #27370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Add triton_kernels dependency #27370
Conversation
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds triton_kernels as a dependency to support mxfp4 fused MoE operations. The change itself is correct in identifying the necessary package and version. However, there is a critical concern regarding the packaging for Docker. The new dependency is only added to requirements/cuda.txt, which may not be sufficient for it to be included in the final production Docker images. This could lead to runtime failures. I've added a comment with details on how to address this potential issue.
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
| ) | ||
| from vllm.model_executor.layers.fused_moe.fused_moe import fused_topk | ||
| from vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe import ( | ||
| BatchedOAITritonExperts, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchedOAITritonExperts was removed in PR #24588 and I missed removing it from the tests.
For context,  eventhough matmul_ogs kernel from OpenAI Triton Kernels supports batched mode, it was removed as it is simply a dense gemm (does not mask invalid tokens) and not useful for the WideEP case.
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
| FYI - this is breaking the nightly installation  | 
| It seems this PR also blocks wheel install. (w/ the same uv pip install error message as above) | 
| @varun-sundar-rabindranath can we revert for now to unblock? We can make our own wheel or copy the kernels over | 
| Yes. I am reverting this now 👍 | 
Purpose
Add
triton_kernelsfrom https://github.com/triton-lang/triton/tree/main/python/triton_kernels as a dependency and pin it to tag v3.5.0.Why v3.5.0:
triton_kernels is just a sub-directory in the Triton repo. vLLM supports Torch2.9 now and Torch2.7 ships with the Triton 3.5.0.
Why add
triton_kernelsas a dependency?We use the
matmul_ogsfunction fromtriton_kernelsfor mxfp4 fused_moe operations on Hopper. At the moment, this code-path is the fastest for running mxfp4 models on Hopper. At the moment, users have to installtriton_kernelsmanually to access this code-path, with this, users can use it out-of-the-box.Test Plan
Tried a fresh-build locally and executed,
TP :
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --no-enable-prefix-cachingDP :
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-120b --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-cachingTest Result
On hopper, both commands defaults to using the Triton implementation for Mxfp4.
both commands produce reasonable gpt_oss eval metrics.
Solve issue #26582