Skip to content

Conversation

@SCDESPERTATE
Copy link
Contributor

What does this PR do?

Fix the arguments passed to llamafile_sgemm in the moe operator in llamafile so that the ldc parameter always receives the right row-stride value of the output matrix.

Before submitting

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SCDESPERTATE, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical argument passing error within the Mixture-of-Experts (MoE) operator in llamafile. The change ensures that the llamafile_sgemm function, responsible for matrix multiplications, receives the correct row-stride value for its output matrix, thereby preventing potential calculation inaccuracies and improving the robustness of the MoE implementation.

Highlights

  • llamafile_sgemm argument fix: Corrected the ldc parameter passed to llamafile_sgemm calls within the Mixture-of-Experts (MoE) operator, ensuring it consistently uses m_block for the output matrix's row-stride instead of config_.intermediate_size or config_.hidden_size.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an argument passed to llamafile_sgemm, specifically the ldc parameter. However, the analysis of the surrounding code suggests that the proposed changes are incorrect. The ldc parameter, representing the leading dimension (stride) of the output matrix, is being changed from the stride of the full matrix (e.g., config_.intermediate_size) to the width of the computed tile (m_block). This will likely cause incorrect memory writes because the subsequent code expects the sgemm output to be correctly placed within a larger, strided matrix. I have provided detailed comments on each of the three sgemm calls in forward_many, explaining why the original values for ldc appear correct and why the proposed changes are likely to introduce bugs. I recommend reverting these changes.

config_.hidden_size / ggml_blck_size((ggml_type)config_.gate_type), gate_input_ptr,
config_.hidden_size / ggml_blck_size((ggml_type)config_.gate_type), gate_output_ptr,
config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type,
m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The ldc parameter for llamafile_sgemm appears to be incorrect. ldc should be the leading dimension (stride) of the full output matrix to allow sgemm to write its tiled result correctly.

The output buffer m_local_gate_output_ptr_[expert_idx] is later accessed (lines 650-652) as a row-major matrix with dimensions m_local_num_[expert_idx] x config_.intermediate_size. This implies a row stride of config_.intermediate_size.

By setting ldc to m_block, you are specifying the stride of the tile itself, not the containing matrix. This will likely lead to incorrect memory access and wrong results, as the sgemm output will be written contiguously instead of into the strided layout of the full matrix. The original value config_.intermediate_size seems correct.

This same issue applies to the up_proj and down_proj sgemm calls.

                          config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type,

up_proj_ptr, config_.hidden_size / ggml_blck_size((ggml_type)config_.up_type), up_input_ptr,
config_.hidden_size / ggml_blck_size((ggml_type)config_.up_type), up_output_ptr,
config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type,
m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The ldc parameter here seems incorrect for the same reason as in the gate_proj sgemm call. The output buffer m_local_up_output_ptr_ has an effective row stride of config_.intermediate_size. Setting ldc to m_block will likely cause memory corruption. The original value config_.intermediate_size should be used.

              config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type,

config_.intermediate_size / ggml_blck_size((ggml_type)config_.down_type), down_input_ptr,
config_.intermediate_size / ggml_blck_size((ggml_type)config_.down_type), down_output_ptr,
config_.hidden_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type,
m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The ldc parameter for this down_proj sgemm call also appears incorrect. The full output matrix m_local_down_output_ptr_ has dimensions m_local_num_[expert_idx] x config_.hidden_size, so its row stride is config_.hidden_size. ldc should be set to config_.hidden_size to correctly write the output tile into the larger matrix. The change to m_block is likely a bug.

                          config_.hidden_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type,

@SCDESPERTATE
Copy link
Contributor Author

Unsurprisingly, this PR accelerates GEMM and measurably raises end-to-end throughput.
Environments:

  • GPU: Quadro RTX 5000 x1
  • CPU: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
  • model
    • GPU: Qwen3-30B-A3B-Instruct-2507
    • CPU: Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF

Before the change:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    2.0       
Max request concurrency:                 256       
Successful requests:                     49        
Benchmark duration (s):                  208.04    
Total input tokens:                      26476     
Total generated tokens:                  2167      
Total generated tokens (retokenized):    1900      
Request throughput (req/s):              0.24      
Input token throughput (tok/s):          127.27    
Output token throughput (tok/s):         10.42     
Total token throughput (tok/s):          137.68    
Concurrency:                             42.04     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   178492.71 
Median E2E Latency (ms):                 184658.36 
---------------Time to First Token----------------
Mean TTFT (ms):                          93600.42  
Median TTFT (ms):                        94187.38  
P99 TTFT (ms):                           159513.79 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           1999.87   
Median ITL (ms):                         346.83    
P95 ITL (ms):                            6894.01   
P99 ITL (ms):                            43198.38  
Max ITL (ms):                            144404.67 
==================================================

After the change:

============ Serving Benchmark Result ============                                                                                 
Backend:                                 sglang-oai                                                                                
Traffic request rate:                    2.0                                                                                       
Max request concurrency:                 256                                                                                       
Successful requests:                     49                                                                                        
Benchmark duration (s):                  151.74                                                                                    
Total input tokens:                      26476                                                                                     
Total generated tokens:                  2167                                                                                      
Total generated tokens (retokenized):    1892                                                                                      
Request throughput (req/s):              0.32                                                                                      
Input token throughput (tok/s):          174.48                                                                                    
Output token throughput (tok/s):         14.28                                                                                     
Total token throughput (tok/s):          188.76                                                                                    
Concurrency:                             39.59                                                                                     
----------------End-to-End Latency----------------                                                                                 
Mean E2E Latency (ms):                   122597.20                                                                                 
Median E2E Latency (ms):                 126372.90                                                                                 
---------------Time to First Token----------------                                                                                 
Mean TTFT (ms):                          63159.78                                                                                  
Median TTFT (ms):                        69371.40                                                                                  
P99 TTFT (ms):                           100054.40                                                                                 
---------------Inter-Token Latency----------------                                                                                 
Mean ITL (ms):                           1433.76                                                                                   
Median ITL (ms):                         422.25                                                                                    
P95 ITL (ms):                            4848.20                                                                                   
P99 ITL (ms):                            33468.57                                                                                  
Max ITL (ms):                            96660.79                                                                                  
==================================================

@SCDESPERTATE
Copy link
Contributor Author

@chenht2022 PTAL. Thanks

@KMSorSMS
Copy link
Collaborator

I think you should add some self-test scripts to verify the results. Have you run the:https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/examples/test_moe.py? And you may have a more detailed description of the change.

@SCDESPERTATE
Copy link
Contributor Author

I think you should add some self-test scripts to verify the results. Have you run the:https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/examples/test_moe.py? And you may have a more detailed description of the change.

Got it, I would double-check on this. Thank you for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants