Open
Conversation
426e1ac to
509e17c
Compare
tianleiwu
reviewed
Sep 23, 2025
| size_t expected_size = MlasQ4GemmPackBSize(out_qtype, static_cast<size_t>(cols), static_cast<size_t>(rows)); | ||
| return expected_size > 0; | ||
| // TEMPORARY: Disable direct Q4 GEMM | ||
| return false; |
Contributor
There was a problem hiding this comment.
Any reason to disable this optimized path?
Contributor
Author
There was a problem hiding this comment.
It was giving gibberish output after some tokens. Need some deep dive on that side
tianleiwu
reviewed
Sep 23, 2025
| } else if (num_tokens == 1) { | ||
| // Single token decoding: use 1 thread (routing overhead not worth parallelizing) | ||
| optimal_routing_threads = 1; | ||
| } else if (num_tokens < min_work_per_thread) { |
Contributor
There was a problem hiding this comment.
Any benchmark results show that this change can help performance?
Contributor
Author
There was a problem hiding this comment.
There was not much performance difference, it was not more than 10% improvement.
tianleiwu
reviewed
Sep 23, 2025
| num_expert_threads = std::min(num_expert_threads, 8); | ||
| } else if (total_work < 384) { | ||
| // Very large workload - use more threads | ||
| num_expert_threads = std::min(num_expert_threads, 12); |
Contributor
There was a problem hiding this comment.
Did we consider number of CPU cores in num_expert_threads computation? If I have a CPU with 8 cores, using more than 8 threads will slow down.
Contributor
Author
There was a problem hiding this comment.
I have not, will try it
509e17c to
41133b7
Compare
41133b7 to
8289fcb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several improvements and fixes to the Mixture-of-Experts (MoE) quantization kernel in ONNX Runtime, focusing on threading efficiency, correctness of block-wise quantization, and code maintainability. Notably, it disables direct Q4 GEMM for debugging, adds stricter validation for quantization parameters, and refines parallelization heuristics for both routing and expert computation to better support both batch and decoding scenarios.
Threading and Parallelization Improvements:
QMoECPU<T>::Computeto better balance parallelism for both small (decoding) and large (batch) workloads. This includes new adaptive thresholds and more granular control over thread allocation for different workload sizes. [1] [2] [3]Block-wise Quantization and Dequantization Fixes:
ValidateBlockwiseQuantizationto enforce thathidden_sizeandinter_sizeare divisible byblock_sizewhen block-wise quantization is enabled, preventing misconfiguration. [1] [2]blocks_per_rowfor block-wise scales in both FC1 and FC2 layers, ensuring proper indexing and memory access during dequantization. [1] [2]blocks_per_rowcalculation and to avoid unnecessary thread pool usage for small workloads. [1] [2]Debugging and Temporary Changes:
falseinCanUseMlasQ4Gemmto facilitate debugging of output issues.These changes collectively improve the robustness, efficiency, and maintainability of the MoE quantization implementation.