Closed
Conversation
tianleiwu
reviewed
Sep 16, 2025
tianleiwu
reviewed
Sep 16, 2025
| const int64_t thread_divisor = std::max(1, max_threads * 4); | ||
| const int64_t min_work_per_thread = std::max(int64_t{32}, static_cast<int64_t>(num_tokens / thread_divisor)); | ||
| const int optimal_routing_threads = (tp == nullptr || num_tokens < min_work_per_thread) ? 1 : std::min(static_cast<int>(num_tokens / std::max(int64_t{1}, min_work_per_thread)), max_threads); | ||
| const int optimal_routing_threads = (tp == nullptr || num_tokens < min_work_per_thread) ? 1 : std::min(static_cast<int>(num_tokens / min_work_per_thread), max_threads); |
Contributor
There was a problem hiding this comment.
num_tokens / min_work_per_thread could be zero here since min_work_per_thread >= 32 in previous line. If num_tokens < 32, then num_tokens / min_work_per_thread = 0.
In the end, you will use only one thread when num_tokens < 32. Is it expected if we want optimize performance for decoding?
tianleiwu
reviewed
Sep 16, 2025
tianleiwu
reviewed
Sep 16, 2025
tianleiwu
reviewed
Sep 16, 2025
| fc1_gemm_done: | ||
|
|
||
| const int64_t activation_threshold = std::max(int64_t{4}, 256 / std::max(int64_t{1}, inter_size)); | ||
| const int64_t activation_threshold = std::max(int64_t{4}, 256 / inter_size); |
Contributor
There was a problem hiding this comment.
How do you choose the magic numbers (4, 256) here?
Contributor
Author
There was a problem hiding this comment.
256 is chosen because it fits well in L1 cache and is better for CPU Cache efficiency. We get the number 4 based on the inter_size it is the minimum number of token required before considering parallel processing.
inter_size = 64 --> 256/64 = 4.
Contributor
Author
|
Created a new PR: #26091 This PR is not required |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request focuses on improving the robustness and reliability of the
QMoECPUquantization code inmoe_quantization_cpu.cc. The main changes add extensive input validation and bounds checking throughout the dequantization and routing logic, helping to prevent out-of-bounds memory access and potential crashes. Additionally, buffer size calculations are simplified for clarity and consistency.The most important changes are:
Input Validation and Bounds Checking:
scale_idxbounds in all dequantization code paths (including 8-bit and 4-bit cases) to prevent out-of-bounds access to thescalesarray. [1] [2] [3]Buffer Size Calculation Simplification: