-
Notifications
You must be signed in to change notification settings - Fork 249
Fix gpt-oss model export #1861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gpt-oss model export #1861
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the QMoE (Quantized Mixture of Experts) quantization logic to distinguish between block-wise and tensor-level quantization based on whether the int4_block_size parameter is explicitly specified by the user. The key change is making block-wise quantization opt-in rather than automatic.
- Switches QMoE quantization from automatic block-size detection to explicit opt-in behavior
- Defaults to tensor-level quantization (using TensorRT-LLM) when
int4_block_sizeis not specified - Conditionally includes the
block_sizeattribute in the QMoE operator based on the quantization method used
989dcd5 to
d888824
Compare
|
Hi @apsonawane @kunal-vaishnavi not sure if it's part of the scope of this PR but after building from this branch, exporting with DML int4 I tried to run it but it fails |
|
Please merge main to resolve conflicts. |
d888824 to
f6f9ff3
Compare
|
@LorenRd sorry for the late reply. I updated the exception since dml does not support block-wise quant, earlier we were checking for cpu specifically so this PR should not affect dml export. Were you able to run it earlier? |
9fbd482 to
2620d09
Compare
This pull request updates the logic for handling the `block_size` attribute in QMoE (Quantized Mixture of Experts) model building and quantization. The changes ensure that block-wise quantization is only used when explicitly specified, defaulting to tensor-level quantization otherwise. The most important changes are: **Quantization logic updates:** * In `make_qmoe_weights`, block-wise quantization is now only used if `int4_block_size` is explicitly present in `extra_options`; otherwise, tensor-level quantization is used by default. The `block_size` attribute in `moe_attrs` is set accordingly. **Operator construction improvements:** * In `make_qmoe_op`, the `block_size` attribute is only included in the operator's attributes if it was explicitly set in `moe_attrs`, preventing unnecessary or default values from being passed. * The direct passing of `block_size` as a parameter to `make_node` is removed; it is now only included via `extra_kwargs` when appropriate.
This pull request updates the logic for handling the
block_sizeattribute in QMoE (Quantized Mixture of Experts) model building and quantization. The changes ensure that block-wise quantization is only used when explicitly specified, defaulting to tensor-level quantization otherwise. The most important changes are:Quantization logic updates:
make_qmoe_weights, block-wise quantization is now only used ifint4_block_sizeis explicitly present inextra_options; otherwise, tensor-level quantization is used by default. Theblock_sizeattribute inmoe_attrsis set accordingly.Operator construction improvements:
make_qmoe_op, theblock_sizeattribute is only included in the operator's attributes if it was explicitly set inmoe_attrs, preventing unnecessary or default values from being passed.block_sizeas a parameter tomake_nodeis removed; it is now only included viaextra_kwargswhen appropriate.