Skip to content

Conversation

@apsonawane
Copy link
Contributor

This pull request updates the logic for handling the block_size attribute in QMoE (Quantized Mixture of Experts) model building and quantization. The changes ensure that block-wise quantization is only used when explicitly specified, defaulting to tensor-level quantization otherwise. The most important changes are:

Quantization logic updates:

  • In make_qmoe_weights, block-wise quantization is now only used if int4_block_size is explicitly present in extra_options; otherwise, tensor-level quantization is used by default. The block_size attribute in moe_attrs is set accordingly.

Operator construction improvements:

  • In make_qmoe_op, the block_size attribute is only included in the operator's attributes if it was explicitly set in moe_attrs, preventing unnecessary or default values from being passed.
  • The direct passing of block_size as a parameter to make_node is removed; it is now only included via extra_kwargs when appropriate.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the QMoE (Quantized Mixture of Experts) quantization logic to distinguish between block-wise and tensor-level quantization based on whether the int4_block_size parameter is explicitly specified by the user. The key change is making block-wise quantization opt-in rather than automatic.

  • Switches QMoE quantization from automatic block-size detection to explicit opt-in behavior
  • Defaults to tensor-level quantization (using TensorRT-LLM) when int4_block_size is not specified
  • Conditionally includes the block_size attribute in the QMoE operator based on the quantization method used

@LorenRd
Copy link
Contributor

LorenRd commented Nov 21, 2025

Hi @apsonawane @kunal-vaishnavi not sure if it's part of the scope of this PR but after building from this branch, exporting with DML int4 I tried to run it but it fails

onnxruntime-directml                     1.23.0
onnxruntime-genai-directml               0.11.0.dev0 C:\Users\ailab\Desktop\onnxruntime-genai\build\Windows\Release\wheel
RuntimeError: Load model from C:\Users\ailab\Desktop\xllm_lib\artifacts\models\gpt_oss_20b_onnx_dml_int4\model.onnx failed:Type Error: Type parameter (T) of Optype (SkipSimplifiedLayerNormalization) bound to different types (tensor(float16) and tensor(float) in node (/model/layers.1/input_layernorm/SkipLayerNorm).

@tianleiwu
Copy link
Contributor

Please merge main to resolve conflicts.

@apsonawane
Copy link
Contributor Author

apsonawane commented Dec 3, 2025

@LorenRd sorry for the late reply. I updated the exception since dml does not support block-wise quant, earlier we were checking for cpu specifically so this PR should not affect dml export. Were you able to run it earlier?

@apsonawane apsonawane enabled auto-merge (squash) December 4, 2025 08:41
@apsonawane apsonawane merged commit 39561d1 into main Dec 5, 2025
15 checks passed
@apsonawane apsonawane deleted the asonawane/fix branch December 5, 2025 09:12
apsonawane added a commit that referenced this pull request Dec 19, 2025
This pull request updates the logic for handling the `block_size`
attribute in QMoE (Quantized Mixture of Experts) model building and
quantization. The changes ensure that block-wise quantization is only
used when explicitly specified, defaulting to tensor-level quantization
otherwise. The most important changes are:

**Quantization logic updates:**

* In `make_qmoe_weights`, block-wise quantization is now only used if
`int4_block_size` is explicitly present in `extra_options`; otherwise,
tensor-level quantization is used by default. The `block_size` attribute
in `moe_attrs` is set accordingly.

**Operator construction improvements:**

* In `make_qmoe_op`, the `block_size` attribute is only included in the
operator's attributes if it was explicitly set in `moe_attrs`,
preventing unnecessary or default values from being passed.
* The direct passing of `block_size` as a parameter to `make_node` is
removed; it is now only included via `extra_kwargs` when appropriate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants