-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[Models][Qwen3VL] Speedup fast_pos_embed_interpolate
#26647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
| weights = weights.to( | ||
| dtype=self.dtype, device=self.device, non_blocking=True | ||
| ) | ||
| weights = weights.to(dtype=self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights will already be on self.device so no need to copy it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several well-reasoned optimizations to the fast_pos_embed_interpolate function in Qwen3VL. The changes, including refactoring weight calculations, vectorizing index computations, and using more efficient PyTorch operations like .sum() instead of unbind() followed by manual addition, are mathematically sound and contribute to the reported 11% performance improvement. The code is now more concise and idiomatic. I've reviewed the changes and found them to be correct and beneficial for performance without sacrificing readability. This is a solid improvement.
|
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
Re: caching t, h, w - I had the same idea when I first cleaned up the code here that we could build a small cache on CPU for this, but I also wonder whether the h2d cost is worth the effort
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Head branch was pushed to by a user without write access
|
Thanks for the fast review. I also updated the tiling logic in 24b6717 which slightly improves things further. I updated the benchmarks above. |
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: 1994 <1994@users.noreply.github.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…26647) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
Followup on #25337 and #25347.
fast_pos_embed_interpolatelaunches many small cuda ops so this PR slightly simplifies and optimised the implementation./cc @ywang96 @Isotr0py
Test Plan & Results
I verified that the new implementation doesn't change the computation.
A quick micro benchmark on an H100 shows a 15% speedup of
fast_pos_embed_interpolateand I don't think it reduces readability of the code.