-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync #14377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this optimization! Can you please also update qwen2.5-vl as well?
ae09649
to
1fbb69c
Compare
Head branch was pushed to by a user without write access
Signed-off-by: cynthieye <987073381@qq.com>
Signed-off-by: cynthieye <987073381@qq.com>
@cynthieye Thank you for making this PR! Can you update this branch with our main branch? I think thr CI error should be fixed on main a while ago. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments - Otherwise LGTM!
@@ -259,6 +259,8 @@ def forward( | |||
x: torch.Tensor, | |||
cu_seqlens: torch.Tensor, | |||
rotary_pos_emb: torch.Tensor, | |||
max_seqlen: int = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't max_seqlen
be also Optional[int]
?
max_seqlen: int, | ||
seqlens: list[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please modify the typing accordingly
max_seqlen: int = None, | ||
seqlens: Optional[list[int]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
max_seqlen: int, | ||
seqlens: list[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
max_seqlen: int, | ||
seqlens: list[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably a good idea to add a small documentation here to indicate that max_seqlen
is only used for FA and seqlens
is only used to xformers.
Signed-off-by: cynthieye <987073381@qq.com>
Signed-off-by: cynthieye <987073381@qq.com>
Signed-off-by: cynthieye <987073381@qq.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: cynthieye <987073381@qq.com>
Signed-off-by: cynthieye <987073381@qq.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
qwen2-vl logic optimization: During each forward propagation, the xformer branch of Qwen2VisionTransformer will execute multiple tensor tolist methods (flash attn branch will execute multiple tensor items) to force the GPU tensor to be copied to the CPU, triggering CUDAMemcpyAsync to increase time consumption. Since the input and output are the same multiple times, it will be executed once, and the remaining will reuse the first result. After optimization, the online environment xformer branch QPS can be improved by 15%, and the flash attn branch QPS can be improved by 7%