Skip to content

llama + spec: MTP Support #22673

Merged
am17an merged 28 commits into
ggml-org:masterfrom
am17an:mtp-clean
May 16, 2026
Merged

llama + spec: MTP Support #22673
am17an merged 28 commits into
ggml-org:masterfrom
am17an:mtp-clean

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 4, 2026

Overview

This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:

Tip

MTP is compatible with Vision input and Tensor/Pipeline Parallelism

Note

Prompt processing (PP) speed typically takes a negative hit when MTP is enabled mainly due to Device-To-Host (D2H) embedding transfers. It's something to be optimized in the future.

Note

Parallel decoding with MTP is supported, but not fully optimized yet.

Performance

A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:

Performance on DGX Spark 🧵

No MTP (baseline)

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}

How to use

I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model

These are some sample commands to get started with MTP:

# MTP with draft size N (values for N: 2,3,...)
llama-server -hf [model-with-mtp] --spec-type draft-mtp --spec-draft-n-max 2

# add `--no-mmproj` to disable vision support if not needed (uses less memory)
llama-server ... --no-mmproj

# [ADVANCED]
# combine MTP + ngram-* (experimental, suitable for non-CUDA systems)
# use these combinations only if you know what you are doing 
llama-server -hf [model-with-mtp] \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64

# (same as above, but shorter)
llama-server -hf [model-with-mtp] --spec-default --spec-type draft-mtp --spec-draft-n-max 3

Models

Quality check

The results from 4 runs of the AIME2026 eval (4x30 questions in total) with MTP enabled, using llama-eval, are within expectation and match the reported value by Qwen team.

image

Full data: aime2026-qwen3.6-27b-mtp-q4_k-x4.json.html

Next Steps until merge

TODOs after merge

  • Improve ngram compatibility with mtp
  • Add recurrent state tests to CI
  • Re-enable --spec-draft-p-min support for mtp
  • Fix partial rollback for batch size > 1 + n_rs_seq (sample patch)
  • Improve multi-seq performance of the recurrent memory for n_rs_seq > 0 (currently the multi-seq states are not contiguous in memory so cannot be batched together)
  • Avoid D2H + H2D pre-norm embedding transfers somehow?
  • Metal drafting improvements metal: reuse K/V in flash-attn vec for spec-decode #23114 ?

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM.

@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 4, 2026

Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue)

There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of)

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a review, but opening some discussions)

Comment thread src/llama-memory-recurrent.h
Comment thread src/models/qwen35.cpp Outdated
Comment thread tools/server/server-context.cpp Outdated
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 4, 2026

Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel.

@cmp-nct
Copy link
Copy Markdown
Contributor

cmp-nct commented May 4, 2026

in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable.
I wonder if this, once merged, could be combined with ngram drafting ?
So MTP is used until ngram is triggered - switching to ngram until rejection and back to MTP

ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing
and MTP fills the gap

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented May 4, 2026

" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?)

If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory.

@mbednarek360
Copy link
Copy Markdown

I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282.

Prompt: "Hello!"
Response:

The from,

;::...

... on;srible威风to{ islitor

\ ...

• We
&eq和chn ***, on
Prompt (:
mouth

“ ? forM� P 

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cmp-nct I'm not sure, but could be possible

@Dampfinchen as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

@mbednarek360 I've only tested this on a small number of CUDA devices as of now, once it's ready to review I would have tested more devices/backends. In particular this PR relies on #22400 which is not implemented for vulkan for now, if you ask an LLM to add support for that you might get a little further Vulkan and Metal also tested

@nawoa
Copy link
Copy Markdown

nawoa commented May 4, 2026

Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention.

@cturan
Copy link
Copy Markdown

cturan commented May 4, 2026

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__
Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch
Hardware: RTX 3090 24GB + RTX 3060 12GB
Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt
Endpoint: /completion, raw text prompt
Prompt: 6978 tokens
Generation: 256 tokens
Runs: 3 measured runs after warmup

mode model prefill tok/s avg generation tok/s avg MTP acceptance loaded VRAM
MTP enabled Qwen3.6-27B-MTP-Q6_K.gguf + --spec-type mtp --spec-draft-n-max 3 665.14 42.45 76.0% 24.96 GiB
MTP disabled, same GGUF Qwen3.6-27B-MTP-Q6_K.gguf, no spec 1315.46 22.97 n/a 22.47 GiB
Existing non-MTP Q6 Qwen3.6-27B-Q6_K.gguf, no spec 1260.12 22.39 n/a 22.59 GiB

Result:

  • MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.
  • Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.
  • Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).
  • MTP adds about 2.49 GiB loaded VRAM in this setup.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix.

@iiLaurens
Copy link
Copy Markdown

Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped.

@nybblr
Copy link
Copy Markdown

nybblr commented May 4, 2026

Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers.

@am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count)

@volkermauel
Copy link
Copy Markdown

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!

@alexandrupetraru
Copy link
Copy Markdown

this is a game changer, on Strix Halo with the q8 Qwen 3.6 35B3A jumping from 40 to 70 tg at low context and for the 27B from 12 to 25 tg(with layer split 7900 xtx and strix halo 50,50) for coding. We need this one to master asap together with turbo4, it performs very well and without any issues. Good job

@GloballyUniquePlaceholder
Copy link
Copy Markdown

On a 3060 Laptop 6GB vram + 64GB ram running your provided Qwen 3.6 35A3B gguf there is a reasonable speed up.

spec-draft-n-max average tk\s wall_s_total aggregate_accept_rate
n/a - no mtp 22.92 77.69 n/a
1 27.58 68.34 0.8835
2 29.39 66.00 0.815
3 27.78 67.96 0.7127
4 26.09 72.23 0.6421
raw results

spec-draft-n-max 4

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 4

python mtp-bench.py
  code_python        pred= 192 draft= 180 acc= 146 rate=0.811 tok/s=31.3
  code_cpp           pred= 192 draft= 216 acc= 136 rate=0.630 tok/s=22.7
  explain_concept    pred= 192 draft= 224 acc= 134 rate=0.598 tok/s=22.3
  summarize          pred=  53 draft=  52 acc=  39 rate=0.750 tok/s=33.3
  qa_factual         pred= 192 draft= 196 acc= 141 rate=0.719 tok/s=29.2
  translation        pred=  22 draft=  32 acc=  13 rate=0.406 tok/s=19.4
  creative_short     pred= 192 draft= 264 acc= 124 rate=0.470 tok/s=20.7
  stepwise_math      pred= 192 draft= 192 acc= 143 rate=0.745 tok/s=30.7
  long_code_review   pred= 192 draft= 220 acc= 136 rate=0.618 tok/s=25.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1576,
  "total_draft_accepted": 1012,
  "aggregate_accept_rate": 0.6421,
  "wall_s_total": 72.23
}

spec-draft-n-max 3

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

python mtp-bench.py
  code_python        pred= 192 draft= 165 acc= 136 rate=0.824 tok/s=30.2
  code_cpp           pred= 192 draft= 168 acc= 135 rate=0.804 tok/s=27.6
  explain_concept    pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=25.3
  summarize          pred=  53 draft=  48 acc=  36 rate=0.750 tok/s=32.5
  qa_factual         pred= 192 draft= 180 acc= 131 rate=0.728 tok/s=29.2
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=24.5
  creative_short     pred= 192 draft= 210 acc= 120 rate=0.571 tok/s=23.2
  stepwise_math      pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=30.5
  long_code_review   pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=27.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1347,
  "total_draft_accepted": 960,
  "aggregate_accept_rate": 0.7127,
  "wall_s_total": 67.96
}

spec-draft-n-max 2

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

python mtp-bench.py
  code_python        pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=31.5
  code_cpp           pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=27.0
  explain_concept    pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=25.6
  summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=32.2
  qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=31.1
  translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=30.8
  creative_short     pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=25.9
  stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=31.3
  long_code_review   pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=29.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1070,
  "total_draft_accepted": 872,
  "aggregate_accept_rate": 0.815,
  "wall_s_total": 66.0
}

spec-draft-n-max 1

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 1

python mtp-bench.py
  code_python        pred= 192 draft=  96 acc=  94 rate=0.979 tok/s=28.3
  code_cpp           pred= 192 draft= 100 acc=  90 rate=0.900 tok/s=26.2
  explain_concept    pred= 192 draft= 102 acc=  89 rate=0.873 tok/s=25.9
  summarize          pred=  56 draft=  29 acc=  26 rate=0.897 tok/s=30.6
  qa_factual         pred= 192 draft= 100 acc=  90 rate=0.900 tok/s=28.5
  translation        pred=  22 draft=  12 acc=   9 rate=0.750 tok/s=27.0
  creative_short     pred= 192 draft= 104 acc=  86 rate=0.827 tok/s=24.9
  stepwise_math      pred= 192 draft= 102 acc=  88 rate=0.863 tok/s=28.7
  long_code_review   pred= 192 draft= 102 acc=  88 rate=0.863 tok/s=28.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1422,
  "total_draft": 747,
  "total_draft_accepted": 660,
  "aggregate_accept_rate": 0.8835,
  "wall_s_total": 68.34
}

no mtp

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}"

python mtp-bench.py
  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.2
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=25.9
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=22.3
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=21.4
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=24.0
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=24.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 77.69
}

@ninjas28
Copy link
Copy Markdown

ninjas28 commented May 5, 2026

Crashes when using -sm tensor with llama-server launch command args -hf am17an/Qwen3.6-27B-MTP-GGUF:Q8_0 -sm tensor -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3. Using -sm tensor without MTP works fine. This is on a triple GPU setup using ROCm.

srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 356
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 352, batch.n_tokens = 352, progress = 0.988764
/root/llama.cpp/ggml/src/ggml-backend-meta.cpp:1013: GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
/root/llama.cpp/build/bin/libggml-base.so.0(+0x1b25b)[0x74b4b4ca925b]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21f)[0x74b4b4ca96df]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x152)[0x74b4b4ca98b2]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41506)[0x74b4b4ccf506]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x3d579)[0x74b4b4ccb579]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41adb)[0x74b4b4ccfadb]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_gallocr_alloc_graph+0x474)[0x74b4b4cbff54]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111)[0x74b4b4cc6351]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe8)[0x74b4b44dac08]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context21handle_mtp_for_ubatchEiPKiS1_P11ggml_tensor+0x20d)[0x74b4b44da9bd]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x142)[0x74b4b44dac62]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
llama-server(+0xf846e)[0x63c5e42c046e]
llama-server(+0x172971)[0x63c5e433a971]
llama-server(+0x5842c)[0x63c5e422042c]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74b4b3c29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74b4b3c29e40]
llama-server(+0x58cd5)[0x63c5e4220cd5]
Aborted```

@superjamie
Copy link
Copy Markdown

Tested on 3x RTX3060 12Gb. Sorry I don't have the VRAM for your Q8, I used RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF which was quantized with ik_llama's MTP.

Prompt: "Write a simple minimal hash table implementation in C99."

Three runs with no MTP, avg generation 18.51 tok/sec:

llama-server --model /models/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF/Qwen3.6-27B-MTP-Q4_K_M.gguf \
 --port 8080 --host 0.0.0.0 --n-gpu-layers 999 --flash-attn on --ctx-size $((16*1024)) \
 --temp 0.6 --top-p 0.95 --presence-penalty 0.0 --top-k 20 --min-p 0.0 --repeat_penalty 1.0 \
 --no-mmproj --chat-template-kwargs '{"enable_thinking":false}'

prompt eval time =     177.62 ms /    24 tokens (    7.40 ms per token,   135.12 tokens per second)
       eval time =   99331.08 ms /  1837 tokens (   54.07 ms per token,    18.49 tokens per second)
      total time =   99508.70 ms /  1861 tokens

prompt eval time =     159.10 ms /    24 tokens (    6.63 ms per token,   150.85 tokens per second)
       eval time =  107505.42 ms /  1988 tokens (   54.08 ms per token,    18.49 tokens per second)
      total time =  107664.52 ms /  2012 tokens

prompt eval time =     158.43 ms /    24 tokens (    6.60 ms per token,   151.49 tokens per second)
       eval time =   48263.07 ms /   895 tokens (   53.93 ms per token,    18.54 tokens per second)
      total time =   48421.51 ms /   919 tokens

Three runs with MTP, avg generation 32.24 tok/sec:

llama-server --model /models/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF/Qwen3.6-27B-MTP-Q4_K_M.gguf \
 --port 8080 --host 0.0.0.0 --n-gpu-layers 999 --flash-attn on --ctx-size $((16*1024)) \
 --temp 0.6 --top-p 0.95 --presence-penalty 0.0 --top-k 20 --min-p 0.0 --repeat_penalty 1.0 \
 --no-mmproj --chat-template-kwargs '{"enable_thinking":false}' \
 --spec-type mtp --spec-draft-n-max 3 --parallel 1

prompt eval time =     232.24 ms /    24 tokens (    9.68 ms per token,   103.34 tokens per second)
       eval time =   34610.94 ms /  1110 tokens (   31.18 ms per token,    32.07 tokens per second)
      total time =   34843.18 ms /  1134 tokens 
      
prompt eval time =     207.99 ms /    24 tokens (    8.67 ms per token,   115.39 tokens per second)
       eval time =   32110.05 ms /  1064 tokens (   30.18 ms per token,    33.14 tokens per second)
      total time =   32318.03 ms /  1088 tokens
      
prompt eval time =     208.50 ms /    24 tokens (    8.69 ms per token,   115.11 tokens per second)
       eval time =   39029.34 ms /  1230 tokens (   31.73 ms per token,    31.51 tokens per second)
      total time =   39237.84 ms /  1254 tokens 

Result 74% speedup. Wow!

Thank you for your work. You will make many users happy with this. What an exciting PR!

One small hiccup. On my initial attempt I got the error message:

load_model: MTP currently supports only n_parallel=1; got 4

Adding --parallel 1 fixed that.

@richardcb

This comment has been minimized.

@candrews
Copy link
Copy Markdown

--fit doesn't appear to take MTP into account - is that being worked on?

@AbdulrahmanHashem
Copy link
Copy Markdown

AbdulrahmanHashem commented May 20, 2026

@ggerganov @am17an somewhere between merged PRs #23234 and #23333 something made models need more ram to fit into the same setup
and i'm 100% sure of that

15 May 19:46 build_4_BFMTP
18 May 02:13 build_5 <--- not broken
19 May 23:32 build_7_Bugged <--- broken
20 May 22:02 build_8

it seems to be MTP related, i tested the none MTP model from unsloth and i didn't see the problem.

And the model has been doing this ever since that update happened
though i haven't tried normal models enough to see if they do that

build_5
never did this like at all.

image

@andyskw
Copy link
Copy Markdown

andyskw commented May 21, 2026

Adding the first gfx1150 (AMD Radeon 890M / Strix Point APU, Ryzen AI 9 HX 470) data point. Tested on unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL, all 42 layers fully on Vulkan0 (UMA, no CPU fallback — verified with --verbose).

Setup

  • gfx1150, RDNA 3.5, RADV STRIX1, Mesa 26.1.0, Vulkan 1.4.348 (subgroup_size 32, subgroup_clustered yes)
  • Ubuntu 24.04, kernel 6.17.0-23, 57 GiB UMA
  • llama.cpp commit ad27757 (master 2026-05-20), Vulkan build
  • Bench: am17an's gist (https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090), 9 prompts, max_tokens=192, seed=42, temp=1.0
./llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  -ngl 999 -c 8192 -fa on -np 1 \
  -ctk q8_0 -ctv q8_0 --jinja --no-mmap \
  --spec-type draft-mtp --spec-draft-n-max 2

n-max sweep

n-max runs mean tps std accept speedup
baseline (no MTP) 3 21.50 0.24 1.000×
1 2 23.99 0.19 0.883 1.116×
2 2 25.77 0.04 0.803 1.199×
3 3 24.78 0.11 0.698 1.153×
4 2 22.51 0.01 0.588 1.047×
6 2 22.25 0.02 0.538 1.034×

Best config (mtp-d2-r1) — per-prompt detail

  code_python        pred= 192 draft= 147 acc= 117 rate=0.796 tok/s= 25.37
  code_cpp           pred= 192 draft= 137 acc= 122 rate=0.890 tok/s= 27.37
  explain_concept    pred= 192 draft= 150 acc= 116 rate=0.773 tok/s= 25.21
  summarize          pred= 192 draft= 143 acc= 119 rate=0.832 tok/s= 26.19
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s= 27.16
  translation        pred= 192 draft= 144 acc= 119 rate=0.826 tok/s= 26.22
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s= 23.39
  stepwise_math      pred= 192 draft= 140 acc= 121 rate=0.864 tok/s= 26.94
  long_code_review   pred= 192 draft= 157 acc= 112 rate=0.713 tok/s= 23.91
{
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1316,
  "total_draft_accepted": 1057,
  "aggregate_accept_rate": 0.8032,
  "wall_s_total": 73.0
}

Observations

  1. Sweet spot is n=2 on this hardware, not n=3-6. Per-step Vulkan verify overhead grows faster than accepted-token throughput beyond n=2.
  2. Consistent with what @sheevy reported earlier on Strix Halo for this same MoE model (51→60 tg/s ≈ 1.18×). The bigger Strix Halo speedups in this thread are on dense Qwen3.6-27B (~2.5×) or higher quants — MoE A3B has less MTP headroom because only ~3 GB of expert weights are read per token, so even baseline is already near memory-bw ceiling on these LPDDR5x APUs. Would predict 27B-dense to give ~2× on gfx1150 as well, by the same ratio.
  3. wave32 vs wave64: RADV_DEBUG=nosubgroupsizectrl to force wave64 = identical perf (25.82 vs 25.77 tok/s). Both subgroup modes are equally well-tuned on RDNA 3.5.
  4. Default config is near-optimal on gfx1150. Swept KV cache quant (f16/q8_0/q4_0 combos), ctx (4k/8k/16k), ubatch (256-2048), threads (8-24), and all RADV_PERFTEST/DEBUG flags. All within ±1% of mtp-d2 default. q4_0 KV is the one degradation worth flagging (-4% tps and -4pp accept rate). Default -ctk q8_0 -ctv q8_0 is correct.
  5. GDN shader lanes_per_column (spec constant in gated_delta_net.comp): patched ggml-vulkan.cpp locally to env-override and swept {1,2,4,8,16}. Clean U-curve with the upstream default of 8 at the peak. The auto-chosen value for S_V=128 + subgroup_clustered is correct on gfx1150.

Two minor things worth filing separately if reproducible

  • GGML_VK_GDN_LANES=32 (i.e., lanes_per_column == subgroup_size, which switches to the nocluster shader path) crashed the server mid-bench in this model. The --verbose log just shows the gen loop terminating with no error. I haven't followed up to reproduce reliably or get a backtrace, but might be worth a glance at the nocluster path for that boundary case.
  • -fa off with -ctv q8_0 segfaults silently. The error V cache quantization requires flash_attn is printed but exit is via SIGSEGV, not clean. A small return after that check would be friendlier.

@ggerganov
Copy link
Copy Markdown
Member

@AbdulrahmanHashem Do you have a consistent repro of the garbage generation? If yes, please open an issue with detailed information about your hardware, model, logs, etc.

@AbdulrahmanHashem
Copy link
Copy Markdown

AbdulrahmanHashem commented May 21, 2026

@AbdulrahmanHashem Do you have a consistent repro of the garbage generation? If yes, please open an issue with detailed information about your hardware, model, logs, etc.

sadly i can't get enough time to make the issue but i just made a build and it's alittle better with when it comes to using more VRAM but it still uses at least about 250 mb more ram at the moment

i tested it on my own projects and i haven't seen garbage again. Update : it just gave me garbage again.

@nipeone
Copy link
Copy Markdown

nipeone commented May 22, 2026

Great! The current branch has been merged into the main branch. But how do I use it after merging into the main branch? After compiling with the latest code of master branch, llama-server --spec-type does not support draft-mtp.

error while handling argument "--spec-type": unknown speculative decoding type without draft model

usage:
--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
                                        type of speculative decoding to use when no draft model is provided
                                        (default: none)
                                        
                                        (env: LLAMA_ARG_SPEC_TYPE)

@kyuz0
Copy link
Copy Markdown

kyuz0 commented May 22, 2026

For somebody interested in a full comparison on a coding benchmark to see how much benefit MTP gives you, I run SWE Verified mini on Strix Halo and AMD R9700:

https://pi-local-coding-bench.dev/

In a nutshell, there is a measurable improvement in average time to complete tasks, more than I expected given the negative performance hit with prompt processing:

Strix Halo / Qwen 3.6 35B-A3B UD_Q8-K-XL:

image

R9700 / Qwen 3.6 27B UD_Q4-K-XL:

image

I also observed an improvement in task completion - is it just random or does MTP change some of the sampling strategy?

@ggerganov
Copy link
Copy Markdown
Member

I also observed an improvement in task completion - is it just random or does MTP change some of the sampling strategy?

It should be random. You need to run the eval multiple times to reduce the variance of the result.

@kyuz0
Copy link
Copy Markdown

kyuz0 commented May 22, 2026

It should be random. You need to run the eval multiple times to reduce the variance of the result.

Yup, that was what I thought. I need to find the time to do it, even single runs of the full benchmark can take half a day!

@ggerganov
Copy link
Copy Markdown
Member

Yes, you need to distribute it on many machines.

Btw, without MTP, the Qwen3.x models should support parallel processing efficiently. So depending on the max context needed for these tasks, you can run requests in parallel on a single server. The more requests you can batch, the better.

However, batching with MTP enabled using a recurrent model (i.e. Qwen3.x) is currently not optimized, so you won't benefit from parallel processing on a single machine in that case. The only way atm is to scale the machines.

@kyuz0
Copy link
Copy Markdown

kyuz0 commented May 22, 2026

Thanks @ggerganov , I'm definitely doing that... Interestingly a comment on my video on this seems to think instead MTP might actually improve performance:

Screenshot_20260522-155201

But, this will be clear after I've re-run all benchmarks.

@cmp-nct
Copy link
Copy Markdown
Contributor

cmp-nct commented May 22, 2026

Yes, you need to distribute it on many machines.

Btw, without MTP, the Qwen3.x models should support parallel processing efficiently. So depending on the max context needed for these tasks, you can run requests in parallel on a single server. The more requests you can batch, the better.

However, batching with MTP enabled using a recurrent model (i.e. Qwen3.x) is currently not optimized, so you won't benefit from parallel processing on a single machine in that case. The only way atm is to scale the machines.

What I would have liked to see is to restrict n_rs_seq to one slot only.
So we could run parallel workloads (with just ngram-mod) and then a single slot continuation on the same context with MTP enabled.
Currently enabling n_rs_seq will multiply it into all slots, that's half a GB vram per slot. So it can not selectively run on slot 0 only.
Though my use case is less standard, most people just run a parallel server or a single slot instance

@darkbasic
Copy link
Copy Markdown

@kyuz0 did you limit the parallel agents for the non-mtp use case? Because MTP currently limits you to 1 concurrent agent, while I noticed that qwen3.6-35B likes to use several parallel agents in opencode (as opposed to qwen3.5-122B) which makes things faster on its own.

@kyuz0
Copy link
Copy Markdown

kyuz0 commented May 22, 2026

@kyuz0 did you limit the parallel agents for the non-mtp use case? Because MTP currently limits you to 1 concurrent agent, while I noticed that qwen3.6-35B likes to use several parallel agents in opencode (as opposed to qwen3.5-122B) which makes things faster on its own.

@darkbasic the benchmark was done with pi, no sub agents, so both MTP and non-MTP had 1 agent thread.

@darkbasic
Copy link
Copy Markdown

@kyuz0 that explains it. Also keep in mind that opencode tends to bloat the context much more if it doesn't know the model. You can fake it to a known one by running llama-server with --alias gpt-5.5, but beware that the model must be smart enough to handle what the harness expects it to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes server testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.