-
Notifications
You must be signed in to change notification settings - Fork 362
[Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl. #2363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the MLA v1 attention implementation by moving all preprocessing operations into the AscendMLAImpl
class. This is a good architectural improvement that centralizes the attention logic and simplifies the model code. However, I've found two critical bugs in the new implementation in vllm_ascend/attention/mla_v1.py
that will cause runtime errors due to incorrect function calls. Please see the detailed comments for fixes.
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
2de99bb
to
51df004
Compare
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
51df004
to
6fad01b
Compare
3f167fd
to
4439176
Compare
Signed-off-by: whx-sjtu <2952154980@qq.com>
4439176
to
a8ee10d
Compare
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: lwq <liwenquan5@huawei.com>
In order to support fused kernels, multi-stream, communication optimization etc, it's better to aggregate all opreations in Attention layer togather. This PR tries to refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl.
Later I will provide a diagram showing the structure of refactored mla_v1.
Note that new mla_v1 doesn't take torchair into consideration. So this PR can only be merged after torchair related mla_v1 is isolated into a new file.