-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix block-size description #10938
fix block-size description #10938
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This reverts commit 69ba344.
Hey, while GPU kernels might not support block sizes greater than 32, other accelerators do. On HPU, going below block size 128 is very detrimental to performance. I don't think the option to use greater block sizes should be removed, rather a proper assertion and log should be added to |
This reverts commit 69ba344.
This reverts commit 69ba344. Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Thanks! |
This reverts commit 69ba344.
Per vllm official doc in https://docs.vllm.ai/en/stable/models/engine_args.html, the parameter of --block-size can take the value of {8,16,32,64,128}. However, I hit error "RuntimeError: Unsupported block size: 128" when I tried out "--block-size 128"
I hunted the code and saw the following logic in csrc/attention/paged_attention_v2.cu:
Similar logic is also seen in csrc/attention/paged_attention_v1.cu.
Updated the parameter allowed value per the real logic.