Update QMoE kernel with optimizations by apsonawane · Pull Request #26091 · microsoft/onnxruntime

apsonawane · 2025-09-19T01:25:45Z

This pull request introduces several improvements and fixes to the Mixture-of-Experts (MoE) quantization kernel in ONNX Runtime, focusing on threading efficiency, correctness of block-wise quantization, and code maintainability. Notably, it disables direct Q4 GEMM for debugging, adds stricter validation for quantization parameters, and refines parallelization heuristics for both routing and expert computation to better support both batch and decoding scenarios.

Threading and Parallelization Improvements:

Refined the logic for determining the number of routing and expert threads in QMoECPU<T>::Compute to better balance parallelism for both small (decoding) and large (batch) workloads. This includes new adaptive thresholds and more granular control over thread allocation for different workload sizes. [1] [2] [3]
Improved the heuristics for when to parallelize activation and dequantization, scaling thresholds based on model configuration to ensure efficient thread utilization.

Block-wise Quantization and Dequantization Fixes:

Added a new helper function ValidateBlockwiseQuantization to enforce that hidden_size and inter_size are divisible by block_size when block-wise quantization is enabled, preventing misconfiguration. [1] [2]
Corrected the calculation of blocks_per_row for block-wise scales in both FC1 and FC2 layers, ensuring proper indexing and memory access during dequantization. [1] [2]
Updated parallel dequantization logic to use the correct blocks_per_row calculation and to avoid unnecessary thread pool usage for small workloads. [1] [2]

Debugging and Temporary Changes:

Temporarily disabled the use of direct Q4 GEMM by always returning false in CanUseMlasQ4Gemm to facilitate debugging of output issues.

These changes collectively improve the robustness, efficiency, and maintainability of the MoE quantization implementation.

tianleiwu · 2025-09-23T16:55:57Z

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

-  size_t expected_size = MlasQ4GemmPackBSize(out_qtype, static_cast<size_t>(cols), static_cast<size_t>(rows));
-  return expected_size > 0;
+  // TEMPORARY: Disable direct Q4 GEMM
+  return false;


Any reason to disable this optimized path?

It was giving gibberish output after some tokens. Need some deep dive on that side

tianleiwu · 2025-09-23T16:57:16Z

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

+  } else if (num_tokens == 1) {
+    // Single token decoding: use 1 thread (routing overhead not worth parallelizing)
+    optimal_routing_threads = 1;
+  } else if (num_tokens < min_work_per_thread) {


Any benchmark results show that this change can help performance?

There was not much performance difference, it was not more than 10% improvement.

tianleiwu · 2025-09-23T17:00:42Z

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

+    num_expert_threads = std::min(num_expert_threads, 8);
+  } else if (total_work < 384) {
+    // Very large workload - use more threads
+    num_expert_threads = std::min(num_expert_threads, 12);


Did we consider number of CPU cores in num_expert_threads computation? If I have a CPU with 8 cores, using more than 8 threads will slow down.

I have not, will try it

apsonawane mentioned this pull request Sep 19, 2025

QMoE kernel further optimizations #26048

Closed

apsonawane force-pushed the asonawane/update branch from 426e1ac to 509e17c Compare September 22, 2025 18:00

apsonawane requested a review from tianleiwu September 22, 2025 21:43

tianleiwu reviewed Sep 23, 2025

View reviewed changes

apsonawane added 2 commits September 25, 2025 21:52

Fix merge conflicts

83ffc5d

Re-enable quantized Mlas

d399c66

apsonawane force-pushed the asonawane/update branch from 509e17c to 41133b7 Compare September 25, 2025 22:19

Add overflow safety changes

8289fcb

apsonawane force-pushed the asonawane/update branch from 41133b7 to 8289fcb Compare September 25, 2025 22:52

Disable quantized Mlas, still not giving good tps

d57a7c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update QMoE kernel with optimizations#26091

Update QMoE kernel with optimizations#26091
apsonawane wants to merge 4 commits intomainfrom
asonawane/update

apsonawane commented Sep 19, 2025

Uh oh!

tianleiwu Sep 23, 2025

Uh oh!

apsonawane Sep 23, 2025

Uh oh!

tianleiwu Sep 23, 2025

Uh oh!

apsonawane Sep 23, 2025

Uh oh!

tianleiwu Sep 23, 2025

Uh oh!

apsonawane Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

apsonawane commented Sep 19, 2025

Uh oh!

tianleiwu Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

apsonawane Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

tianleiwu Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

apsonawane Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

tianleiwu Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

apsonawane Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants