Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales #4343
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales #4343
Changes from 30 commits
ccee5d3
8f71c79
6eb01e0
dc89cbc
be60845
4613cb5
3d95d86
642763f
706e931
9a3c78c
1b6f020
b09bcec
052e2b3
b33c6d7
475f58d
56b4880
be37154
9c54d19
c5155ea
948cca7
3feb887
df16316
794f1a1
c13b6a4
5a230ed
80069c9
5ce17d0
92d5162
521a4c8
f330056
2ec375c
72e1e42
0f5824c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Theoretically, we shouldn't use "cuda" in the model code. Since the GPU worker sets "cuda" as the default device in torch, device="cuda" is not necessary. Also, it's not good for the compatibility with non-CUDA devices.
This rule is violated for Mixtral and other MoE models unfortunately. 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll make a follow up PR to remove the
device="cuda"
-- since we also specify it explicitly for the other parameters, I don't want to be inconsistent for this PR :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, I think we should have an
MoELayer
that is shared across modelsAll of these changes are currently only impacting
Mixtral
, but could also be applied to other models. Since we have all this generic logic in the model definitions, we are losing out at running others with these features