Skip to content

Conversation

@jmkuebler
Copy link

@jmkuebler jmkuebler commented Sep 15, 2025

This PR commits an optimized tile size to speedup decoding for models w/ head_dim 64 when running in FP8. It applies for example to GPT-OSS.
See vllm-project/vllm#24916 (optimization 3) for the improvement.

cc @LucasWilkinson

jmkuebler and others added 2 commits September 15, 2025 20:27
Signed-off-by: Jonas Kuebler <kuebj@amazon.com>
Signed-off-by: Jonas Kuebler <kuebj@amazon.com>
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks for the contribution!

@LucasWilkinson LucasWilkinson merged commit 4695e6b into vllm-project:main Sep 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants