-
Notifications
You must be signed in to change notification settings - Fork 249
Fix QMoE blockwise quantization support for TRT-RTX execution provider #1926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@kunal-vaishnavi @baijumeswani for review |
| model = ir.from_proto(quant.model.model) | ||
|
|
||
| # Convert float32 scales to bfloat16 if io_dtype is bfloat16. | ||
| # MatMulNBitsQuantizer doesn't natively support bfloat16, so we saved weights as float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, I added support for bfloat16 in the MatMulNBits quantizer. It will be part of the ORT 1.24 release. You can use a nightly ORT package for now and remove this conversion.
| kwargs.get("scales3", ""), | ||
| kwargs.get("bias3", ""), | ||
| kwargs.get("zero_points1", ""), | ||
| kwargs.get("zero_points2", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can be reverted. The new op spec for QMoE includes optional zero points. Those optional inputs are stored as empty strings if unused and empty string inputs should not affect other models that don't support zero points.
|
|
||
| def make_moe(self, layer_id, mlp, root_input): | ||
| if self.ep in {"cpu", "cuda"}: | ||
| if self.ep in {"cpu", "cuda", "NvTensorRtRtx", "trt-rtx"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use trt-rtx internally and substitute with NvTensorRtRtx only when the GenAI config is created.
| if self.ep in {"cpu", "cuda", "NvTensorRtRtx", "trt-rtx"}: | |
| if self.ep in {"cpu", "cuda", "trt-rtx"}: |
| int4_block_size = 16/32/64/128/256: Specify the block size for int4 quantization. | ||
| int4_block_size = 16/32/64/128/256: Specify the block size for int4 quantization (MatMulNBits). | ||
| Default value is 32. | ||
| int4_qmoe_block_size = 16/32/64/128/256: Specify the block size for QMoE expert weights quantization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for using a different block size in MatMulNBits and QMoE?
| # via int4_block_size and when using CPU or WebGPU execution providers, since | ||
| # block_size is only supported for these EPs in the QMoE operator. | ||
| use_blockwise_quant = "int4_block_size" in self.extra_options and self.ep in ["cpu", "webgpu"] | ||
| # Get block size from quantization attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look to be reverting to before this PR was made. There were issues discovered with the old approach that necessitated the linked PR.
Add QMoE and BF16 support for TRT-RTX execution provider