-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[fix](llamafile-moe): fix the argument passed to llamafile_sgemm
#1735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[fix](llamafile-moe): fix the argument passed to llamafile_sgemm
#1735
Conversation
…of llamafile_sgemm
Summary of ChangesHello @SCDESPERTATE, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical argument passing error within the Mixture-of-Experts (MoE) operator in Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an argument passed to llamafile_sgemm, specifically the ldc parameter. However, the analysis of the surrounding code suggests that the proposed changes are incorrect. The ldc parameter, representing the leading dimension (stride) of the output matrix, is being changed from the stride of the full matrix (e.g., config_.intermediate_size) to the width of the computed tile (m_block). This will likely cause incorrect memory writes because the subsequent code expects the sgemm output to be correctly placed within a larger, strided matrix. I have provided detailed comments on each of the three sgemm calls in forward_many, explaining why the original values for ldc appear correct and why the proposed changes are likely to introduce bugs. I recommend reverting these changes.
| config_.hidden_size / ggml_blck_size((ggml_type)config_.gate_type), gate_input_ptr, | ||
| config_.hidden_size / ggml_blck_size((ggml_type)config_.gate_type), gate_output_ptr, | ||
| config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type, | ||
| m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ldc parameter for llamafile_sgemm appears to be incorrect. ldc should be the leading dimension (stride) of the full output matrix to allow sgemm to write its tiled result correctly.
The output buffer m_local_gate_output_ptr_[expert_idx] is later accessed (lines 650-652) as a row-major matrix with dimensions m_local_num_[expert_idx] x config_.intermediate_size. This implies a row stride of config_.intermediate_size.
By setting ldc to m_block, you are specifying the stride of the tile itself, not the containing matrix. This will likely lead to incorrect memory access and wrong results, as the sgemm output will be written contiguously instead of into the strided layout of the full matrix. The original value config_.intermediate_size seems correct.
This same issue applies to the up_proj and down_proj sgemm calls.
config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type,| up_proj_ptr, config_.hidden_size / ggml_blck_size((ggml_type)config_.up_type), up_input_ptr, | ||
| config_.hidden_size / ggml_blck_size((ggml_type)config_.up_type), up_output_ptr, | ||
| config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type, | ||
| m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ldc parameter here seems incorrect for the same reason as in the gate_proj sgemm call. The output buffer m_local_up_output_ptr_ has an effective row stride of config_.intermediate_size. Setting ldc to m_block will likely cause memory corruption. The original value config_.intermediate_size should be used.
config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type,| config_.intermediate_size / ggml_blck_size((ggml_type)config_.down_type), down_input_ptr, | ||
| config_.intermediate_size / ggml_blck_size((ggml_type)config_.down_type), down_output_ptr, | ||
| config_.hidden_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type, | ||
| m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ldc parameter for this down_proj sgemm call also appears incorrect. The full output matrix m_local_down_output_ptr_ has dimensions m_local_num_[expert_idx] x config_.hidden_size, so its row stride is config_.hidden_size. ldc should be set to config_.hidden_size to correctly write the output tile into the larger matrix. The change to m_block is likely a bug.
config_.hidden_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type,|
Unsurprisingly, this PR accelerates GEMM and measurably raises end-to-end throughput.
Before the change: After the change: |
|
@chenht2022 PTAL. Thanks |
|
I think you should add some self-test scripts to verify the results. Have you run the:https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/examples/test_moe.py? And you may have a more detailed description of the change. |
Got it, I would double-check on this. Thank you for your time. |
What does this PR do?
Fix the arguments passed to
llamafile_sgemmin the moe operator in llamafile so that theldcparameter always receives the right row-stride value of the output matrix.Before submitting