Fix QMoE blockwise quantization support for TRT-RTX execution provider #1926

anujj · 2025-12-19T13:35:19Z

Add QMoE and BF16 support for TRT-RTX execution provider

Enable blockwise quantization for TRT-RTX/NvTensorRtRtx EPs
Add gpt_oss_swiglu_fusion option for separate gate/up weights
Add int4_qdq_block_size for MatMul quantization block size
Add BF16 precision support for TRT-RTX
Keep padding in QMoE weights for proper alignment

anujj · 2026-01-06T08:35:51Z

@kunal-vaishnavi @baijumeswani for review

…quantization

…ck_size for QMoE (default 128)

kunal-vaishnavi · 2026-01-07T22:16:53Z

src/python/py/models/builders/base.py

+        model = ir.from_proto(quant.model.model)
+
+        # Convert float32 scales to bfloat16 if io_dtype is bfloat16.
+        # MatMulNBitsQuantizer doesn't natively support bfloat16, so we saved weights as float32


In this PR, I added support for bfloat16 in the MatMulNBits quantizer. It will be part of the ORT 1.24 release. You can use a nightly ORT package for now and remove this conversion.

kunal-vaishnavi · 2026-01-07T22:21:26Z

src/python/py/models/builders/base.py

            kwargs.get("scales3", ""),
            kwargs.get("bias3", ""),
-            kwargs.get("zero_points1", ""),
-            kwargs.get("zero_points2", ""),


This change can be reverted. The new op spec for QMoE includes optional zero points. Those optional inputs are stored as empty strings if unused and empty string inputs should not affect other models that don't support zero points.

kunal-vaishnavi · 2026-01-07T22:25:30Z

src/python/py/models/builders/gptoss.py


    def make_moe(self, layer_id, mlp, root_input):
-        if self.ep in {"cpu", "cuda"}:
+        if self.ep in {"cpu", "cuda", "NvTensorRtRtx", "trt-rtx"}:


We use trt-rtx internally and substitute with NvTensorRtRtx only when the GenAI config is created.

Suggested change

if self.ep in {"cpu", "cuda", "NvTensorRtRtx", "trt-rtx"}:

if self.ep in {"cpu", "cuda", "trt-rtx"}:

kunal-vaishnavi · 2026-01-07T22:28:00Z

src/python/py/models/builder.py

-                int4_block_size = 16/32/64/128/256: Specify the block size for int4 quantization.
+                int4_block_size = 16/32/64/128/256: Specify the block size for int4 quantization (MatMulNBits).
                    Default value is 32.
+                int4_qmoe_block_size = 16/32/64/128/256: Specify the block size for QMoE expert weights quantization.


What is the reason for using a different block size in MatMulNBits and QMoE?

kunal-vaishnavi · 2026-01-07T22:36:55Z

src/python/py/models/builders/base.py

-        # via int4_block_size and when using CPU or WebGPU execution providers, since
-        # block_size is only supported for these EPs in the QMoE operator.
-        use_blockwise_quant = "int4_block_size" in self.extra_options and self.ep in ["cpu", "webgpu"]
+        # Get block size from quantization attributes


These changes look to be reverting to before this PR was made. There were issues discovered with the old approach that necessitated the linked PR.

anujj marked this pull request as draft December 19, 2025 13:35

anujj marked this pull request as ready for review January 6, 2026 08:37

anujj added 2 commits January 6, 2026 17:01

Fix QMoE blockwise quantization support for TRT-RTX execution provider

581a564

remvoed madding

9f88bcd

anujj force-pushed the gpt_oss_trt_rtx branch from b9d8d44 to 9f88bcd Compare January 6, 2026 11:33

anujj added 5 commits January 6, 2026 17:39

trt-rtx guarg

9ae34f6

Only add zero_points inputs to QMoE when needed for Quark asymmetric …

6d4ebca

…quantization

Remove unfused SwiGLU, int4_block_size for MatMulNBits, int4_qmoe_blo…

60dfcdf

…ck_size for QMoE (default 128)

cuda dont support block size qnat for MOE

18edebb

minor fixes

73c67ed

kunal-vaishnavi reviewed Jan 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix QMoE blockwise quantization support for TRT-RTX execution provider #1926

Fix QMoE blockwise quantization support for TRT-RTX execution provider #1926

anujj commented Dec 19, 2025

Uh oh!

anujj commented Jan 6, 2026

Uh oh!

kunal-vaishnavi Jan 7, 2026

Uh oh!

kunal-vaishnavi Jan 7, 2026

Uh oh!

kunal-vaishnavi Jan 7, 2026

Uh oh!

kunal-vaishnavi Jan 7, 2026

Uh oh!

kunal-vaishnavi Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if self.ep in {"cpu", "cuda", "NvTensorRtRtx", "trt-rtx"}:
	if self.ep in {"cpu", "cuda", "trt-rtx"}:

Fix QMoE blockwise quantization support for TRT-RTX execution provider #1926

Are you sure you want to change the base?

Fix QMoE blockwise quantization support for TRT-RTX execution provider #1926

Conversation

anujj commented Dec 19, 2025

Uh oh!

anujj commented Jan 6, 2026

Uh oh!

kunal-vaishnavi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants