Skip to content

Conversation

@anujj
Copy link
Contributor

@anujj anujj commented Dec 19, 2025

Add QMoE and BF16 support for TRT-RTX execution provider

  • Enable blockwise quantization for TRT-RTX/NvTensorRtRtx EPs
  • Add gpt_oss_swiglu_fusion option for separate gate/up weights
  • Add int4_qdq_block_size for MatMul quantization block size
  • Add BF16 precision support for TRT-RTX
  • Keep padding in QMoE weights for proper alignment

@anujj anujj marked this pull request as draft December 19, 2025 13:35
@anujj
Copy link
Contributor Author

anujj commented Jan 6, 2026

@kunal-vaishnavi @baijumeswani for review

@anujj anujj marked this pull request as ready for review January 6, 2026 08:37
model = ir.from_proto(quant.model.model)

# Convert float32 scales to bfloat16 if io_dtype is bfloat16.
# MatMulNBitsQuantizer doesn't natively support bfloat16, so we saved weights as float32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, I added support for bfloat16 in the MatMulNBits quantizer. It will be part of the ORT 1.24 release. You can use a nightly ORT package for now and remove this conversion.

kwargs.get("scales3", ""),
kwargs.get("bias3", ""),
kwargs.get("zero_points1", ""),
kwargs.get("zero_points2", ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be reverted. The new op spec for QMoE includes optional zero points. Those optional inputs are stored as empty strings if unused and empty string inputs should not affect other models that don't support zero points.


def make_moe(self, layer_id, mlp, root_input):
if self.ep in {"cpu", "cuda"}:
if self.ep in {"cpu", "cuda", "NvTensorRtRtx", "trt-rtx"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use trt-rtx internally and substitute with NvTensorRtRtx only when the GenAI config is created.

Suggested change
if self.ep in {"cpu", "cuda", "NvTensorRtRtx", "trt-rtx"}:
if self.ep in {"cpu", "cuda", "trt-rtx"}:

int4_block_size = 16/32/64/128/256: Specify the block size for int4 quantization.
int4_block_size = 16/32/64/128/256: Specify the block size for int4 quantization (MatMulNBits).
Default value is 32.
int4_qmoe_block_size = 16/32/64/128/256: Specify the block size for QMoE expert weights quantization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for using a different block size in MatMulNBits and QMoE?

# via int4_block_size and when using CPU or WebGPU execution providers, since
# block_size is only supported for these EPs in the QMoE operator.
use_blockwise_quant = "int4_block_size" in self.extra_options and self.ep in ["cpu", "webgpu"]
# Get block size from quantization attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look to be reverting to before this PR was made. There were issues discovered with the old approach that necessitated the linked PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants