[fix](llamafile-moe): fix the argument passed to `llamafile_sgemm` #1735

SCDESPERTATE · 2025-12-20T09:20:29Z

What does this PR do?

Fix the arguments passed to llamafile_sgemm in the moe operator in llamafile so that the ldc parameter always receives the right row-stride value of the output matrix.

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

…of llamafile_sgemm

gemini-code-assist · 2025-12-20T09:20:42Z

Summary of Changes

Hello @SCDESPERTATE, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical argument passing error within the Mixture-of-Experts (MoE) operator in llamafile. The change ensures that the llamafile_sgemm function, responsible for matrix multiplications, receives the correct row-stride value for its output matrix, thereby preventing potential calculation inaccuracies and improving the robustness of the MoE implementation.

Highlights

llamafile_sgemm argument fix: Corrected the ldc parameter passed to llamafile_sgemm calls within the Mixture-of-Experts (MoE) operator, ensuring it consistently uses m_block for the output matrix's row-stride instead of config_.intermediate_size or config_.hidden_size.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to fix an argument passed to llamafile_sgemm, specifically the ldc parameter. However, the analysis of the surrounding code suggests that the proposed changes are incorrect. The ldc parameter, representing the leading dimension (stride) of the output matrix, is being changed from the stride of the full matrix (e.g., config_.intermediate_size) to the width of the computed tile (m_block). This will likely cause incorrect memory writes because the subsequent code expects the sgemm output to be correctly placed within a larger, strided matrix. I have provided detailed comments on each of the three sgemm calls in forward_many, explaining why the original values for ldc appear correct and why the proposed changes are likely to introduce bugs. I recommend reverting these changes.

gemini-code-assist · 2025-12-20T09:23:26Z

kt-kernel/operators/llamafile/moe.hpp

                          config_.hidden_size / ggml_blck_size((ggml_type)config_.gate_type), gate_input_ptr,
                          config_.hidden_size / ggml_blck_size((ggml_type)config_.gate_type), gate_output_ptr,
-                          config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type,
+                          m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type,


The ldc parameter for llamafile_sgemm appears to be incorrect. ldc should be the leading dimension (stride) of the full output matrix to allow sgemm to write its tiled result correctly.

The output buffer m_local_gate_output_ptr_[expert_idx] is later accessed (lines 650-652) as a row-major matrix with dimensions m_local_num_[expert_idx] x config_.intermediate_size. This implies a row stride of config_.intermediate_size.

By setting ldc to m_block, you are specifying the stride of the tile itself, not the containing matrix. This will likely lead to incorrect memory access and wrong results, as the sgemm output will be written contiguously instead of into the strided layout of the full matrix. The original value config_.intermediate_size seems correct.

This same issue applies to the up_proj and down_proj sgemm calls.

config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.gate_type,

gemini-code-assist · 2025-12-20T09:23:26Z

kt-kernel/operators/llamafile/moe.hpp

              up_proj_ptr, config_.hidden_size / ggml_blck_size((ggml_type)config_.up_type), up_input_ptr,
              config_.hidden_size / ggml_blck_size((ggml_type)config_.up_type), up_output_ptr,
-              config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type,
+              m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type,


The ldc parameter here seems incorrect for the same reason as in the gate_proj sgemm call. The output buffer m_local_up_output_ptr_ has an effective row stride of config_.intermediate_size. Setting ldc to m_block will likely cause memory corruption. The original value config_.intermediate_size should be used.

config_.intermediate_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.up_type,

gemini-code-assist · 2025-12-20T09:23:26Z

kt-kernel/operators/llamafile/moe.hpp

                          config_.intermediate_size / ggml_blck_size((ggml_type)config_.down_type), down_input_ptr,
                          config_.intermediate_size / ggml_blck_size((ggml_type)config_.down_type), down_output_ptr,
-                          config_.hidden_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type,
+                          m_block, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type,


The ldc parameter for this down_proj sgemm call also appears incorrect. The full output matrix m_local_down_output_ptr_ has dimensions m_local_num_[expert_idx] x config_.hidden_size, so its row stride is config_.hidden_size. ldc should be set to config_.hidden_size to correctly write the output tile into the larger matrix. The change to m_block is likely a bug.

config_.hidden_size, 0, 1, GGML_TASK_TYPE_COMPUTE, (ggml_type)config_.down_type,

SCDESPERTATE · 2025-12-20T10:02:28Z

Unsurprisingly, this PR accelerates GEMM and measurably raises end-to-end throughput.
Environments:

GPU: Quadro RTX 5000 x1
CPU: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
model
- GPU: Qwen3-30B-A3B-Instruct-2507
- CPU: Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF

Before the change:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    2.0       
Max request concurrency:                 256       
Successful requests:                     49        
Benchmark duration (s):                  208.04    
Total input tokens:                      26476     
Total generated tokens:                  2167      
Total generated tokens (retokenized):    1900      
Request throughput (req/s):              0.24      
Input token throughput (tok/s):          127.27    
Output token throughput (tok/s):         10.42     
Total token throughput (tok/s):          137.68    
Concurrency:                             42.04     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   178492.71 
Median E2E Latency (ms):                 184658.36 
---------------Time to First Token----------------
Mean TTFT (ms):                          93600.42  
Median TTFT (ms):                        94187.38  
P99 TTFT (ms):                           159513.79 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           1999.87   
Median ITL (ms):                         346.83    
P95 ITL (ms):                            6894.01   
P99 ITL (ms):                            43198.38  
Max ITL (ms):                            144404.67 
==================================================

After the change:

============ Serving Benchmark Result ============                                                                                 
Backend:                                 sglang-oai                                                                                
Traffic request rate:                    2.0                                                                                       
Max request concurrency:                 256                                                                                       
Successful requests:                     49                                                                                        
Benchmark duration (s):                  151.74                                                                                    
Total input tokens:                      26476                                                                                     
Total generated tokens:                  2167                                                                                      
Total generated tokens (retokenized):    1892                                                                                      
Request throughput (req/s):              0.32                                                                                      
Input token throughput (tok/s):          174.48                                                                                    
Output token throughput (tok/s):         14.28                                                                                     
Total token throughput (tok/s):          188.76                                                                                    
Concurrency:                             39.59                                                                                     
----------------End-to-End Latency----------------                                                                                 
Mean E2E Latency (ms):                   122597.20                                                                                 
Median E2E Latency (ms):                 126372.90                                                                                 
---------------Time to First Token----------------                                                                                 
Mean TTFT (ms):                          63159.78                                                                                  
Median TTFT (ms):                        69371.40                                                                                  
P99 TTFT (ms):                           100054.40                                                                                 
---------------Inter-Token Latency----------------                                                                                 
Mean ITL (ms):                           1433.76                                                                                   
Median ITL (ms):                         422.25                                                                                    
P95 ITL (ms):                            4848.20                                                                                   
P99 ITL (ms):                            33468.57                                                                                  
Max ITL (ms):                            96660.79                                                                                  
==================================================

SCDESPERTATE · 2025-12-23T02:13:37Z

@chenht2022 PTAL. Thanks

KMSorSMS · 2025-12-23T02:28:39Z

I think you should add some self-test scripts to verify the results. Have you run the:https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/examples/test_moe.py? And you may have a more detailed description of the change.

SCDESPERTATE · 2025-12-23T04:11:35Z

I think you should add some self-test scripts to verify the results. Have you run the:https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/examples/test_moe.py? And you may have a more detailed description of the change.

Got it, I would double-check on this. Thank you for your time.

[fix](llamafile-moe): fix the argument passed to the parameter ldc …

3a9019d

…of llamafile_sgemm

gemini-code-assist bot reviewed Dec 20, 2025

View reviewed changes

KMSorSMS added the run-ci label Dec 20, 2025

KMSorSMS requested review from KMSorSMS and chenht2022 December 20, 2025 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix](llamafile-moe): fix the argument passed to `llamafile_sgemm` #1735

[fix](llamafile-moe): fix the argument passed to `llamafile_sgemm` #1735

Uh oh!

SCDESPERTATE commented Dec 20, 2025

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 20, 2025

Uh oh!

gemini-code-assist bot Dec 20, 2025

Uh oh!

gemini-code-assist bot Dec 20, 2025

Uh oh!

SCDESPERTATE commented Dec 20, 2025

Uh oh!

SCDESPERTATE commented Dec 23, 2025

Uh oh!

KMSorSMS commented Dec 23, 2025

Uh oh!

SCDESPERTATE commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[fix](llamafile-moe): fix the argument passed to llamafile_sgemm #1735

Are you sure you want to change the base?

[fix](llamafile-moe): fix the argument passed to llamafile_sgemm #1735

Uh oh!

Conversation

SCDESPERTATE commented Dec 20, 2025

What does this PR do?

Before submitting

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

SCDESPERTATE commented Dec 20, 2025

Uh oh!

SCDESPERTATE commented Dec 23, 2025

Uh oh!

KMSorSMS commented Dec 23, 2025

Uh oh!

SCDESPERTATE commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[fix](llamafile-moe): fix the argument passed to `llamafile_sgemm` #1735

[fix](llamafile-moe): fix the argument passed to `llamafile_sgemm` #1735