-
Notifications
You must be signed in to change notification settings - Fork 13.5k
llama-bench: add --n-cpu-moe support #15952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
08a2164 to
c7a04f2
Compare
|
Nice, I had this on my TODO list as well :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but would be nice if this had --cpu-moe support as well so that we can get it done in one go.
tools/llama-bench/llama-bench.cpp
Outdated
|
|
||
| if (field == "tensor_buft_overrides") { | ||
| if (value.size() > width) | ||
| value.erase(width); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have the typical behavior of "strip last 3 characters + add '...'" instead to make it clear to the user we're truncating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about several ways to handle it:
- simple cut
- “nice” cut (with “…”)
- using a special string like “-” instead of the value
- renaming the “ot” column to “ncmoe”
- removing the “ot” column
I went with the simplest option just to start the discussion.
I think the last option (skipping that column) is the best.
However, since someone added “-ot” to llama-bench, I don’t want to introduce a regression.
Still, if you look at the table without the cut, it only works correctly for short strings.
Is this column useful? The way I implemented --n-cpu-moe, it can have only one value, so each row will have the same value in that column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, final version: I think we should unify how parameters to llama-bench behave:
- make --n-cpu-moe a vector-type parameter, similar to all the other switches (like
-fa,-p,-nand the like), so that it's suitable for comparison - keep
-otfor now, but: - move all the parameters that don't take multiple options, i.e. don't take part in the comparison, above the table. Should look something like this:
PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
model: qwen3moe 30B.A3B Q3_K - Medium
model size: 13.70 GiB
model parameters: 30.53 B
override_tensors: blk\.0\.ffn_(up|down|gate)_exps=CPU;blk... (full expression here if actually ot was set)
| backend | ngl | test | t/s |
| CUDA | 99 | pp512 | 1253.62 ± 20.28 |
| CUDA | 99 | tg128 | 54.04 ± 0.27 |This way, all parameters that are actually part of the benchmark are in the table, while the table isn't cluttered with repeated values that don't actually change during the test.
(if you don't want to implement all of this, just do the --n-cpu-moe as a vector part and move override_tensors above and we can add another PR for the rest)
@ggerganov @slaren @CISC what do you think? does this sound like a reasonable idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n-cpu-moe should be a different parameter and shown as its own column in the markdown output, not just merged with the rest of the overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n-cpu-moeshould be a different parameter and shown as its own column in the markdown output, not just merged with the rest of the overrides.
I agree that this would be the ideal behavior, and I will try to implement it that way.
n_cpu_moe will probably be part of struct test (not just an alias for ot in the argument parser), so the change will be bigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 0,6,12,18
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | n_cpu_moe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 0 | pp512 | 433.44 + 2.62 |
| qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 0 | tg128 | 10.85 + 0.00 |
| qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 6 | pp512 | 388.82 + 8.21 |
| qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 6 | tg128 | 13.56 + 0.01 |
| qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 12 | pp512 | 1239.89 + 11.49 |
| qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 12 | tg128 | 51.03 + 1.88 |
| qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 18 | pp512 | 652.27 + 20.82 |
| qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 18 | tg128 | 33.97 + 1.17 |
build: c0bb1ae67 (6457)
c7a04f2 to
9de1592
Compare
tools/llama-bench/llama-bench.cpp
Outdated
| } | ||
|
|
||
| int width = get_field_width(field); | ||
| unsigned int width = get_field_width(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to change this to unsigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to fix the CI (but it’s still failing on mac for an unrelated reason)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI issue is not related. Please revert this change if it is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Support --n-cpu-moe in llama-bench the same way it is supported by llama-server. Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.
9de1592 to
33d9591
Compare
Maybe we can add |
* llama-bench: add --n-cpu-moe support Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.
|
looks like it's actually useful :) |
followup to #15077
This changeset addresses three items (I can split it into atomic commits if needed):
Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.
Fix the table by trimming tensor_buft_overrides output in markdown_printer.
Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.
tested on Windows: