Skip to content

Eval bug: draft-mtp changes deterministic output on Qwen3.6 MTP model #23335

@carbocation

Description

@carbocation

Name and Version

llama-server --version                                                                                                                                                    
version: 9216 (1ff0fc138)
built with AppleClang 21.0.0.21000101 for Darwin arm64

Platform: macOS 26.5, arm64
GPU backend: Metal
Hardware: Apple M4 Pro
Model: Qwen3.6-27B-Q4_K_M.gguf
Model metadata includes:
general.architecture = qwen35
qwen35.nextn_predict_layers = 1

Operating systems

Mac

GGML backends

Metal

Hardware

M4 Macbook Pro

Models

unsloth/Qwen3.6-27B-MTP-GGUF - which is a Qwen3.6-27B-Q4_K_M.gguf

Problem description & steps to reproduce

draft-mtp changes the committed token stream for a Qwen3.6 MTP GGUF model under deterministic sampling. In llama-server, the same prompt and request body produce one token stream with no speculative decoding, and different token streams when --spec-type draft-mtp is enabled. This happens with an explicit fixed seed and temperature: 0.

The request body is identical for all runs:

curl -s http://127.0.0.1:8080/completion \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"seed":1234,"cache_prompt":false,"stream":false,"return_tokens":true}'

The full logs are down in the console log area below, but the pattern is:

no speculative decoding: output A
draft-mtp width 1:     output B
draft-mtp width 2:     output C
draft-mtp width 3:     output C
draft-mtp width 4:     output C

So, enabling draft-mtp changes the committed token stream with --spec-draft-n-max 1 or higher. The width-1 run is notable because it looks like all tokens were accepted but the output still differs from the no-MTP case:

draft_n: 31
draft_n_accepted: 31

Expected behavior: Speculative decoding should not change the committed token stream. With the same prompt, same sampler settings, explicit seed:1234, and temperature:0, draft-mtp should produce the same tokens as no speculative decoding.

Actual behavior: draft-mtp from #22673 produces different committed tokens from the no-speculative baseline. The exact divergent output depends on --spec-draft-n-max. llama-completion does not appear to support --spec-type so this repro is only in llama-server.

(This issue corrects #23302 which did not use a fixed seed.)

First Bad Commit

No response

Relevant log output

Logs
Query is the same for all:
$ curl -s http://127.0.0.1:8080/completion \                                              
  -H 'Content-Type: application/json' \                                          
  -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"seed":1234,"cache_prompt":false,"stream":false,"return_tokens":true}'

No MTP
* Server:
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-draft-n-max 1 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui

* Result
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.\n3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,20319,24660,383,279,3545,13,198,18,13,3113,8427,1928,22756,11,1439,3983,7698,37972,321,3300,10281,1965,2942,7463,310,9158,5792],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":706.884,"prompt_per_token_ms":21.420727272727273,"prompt_per_second":46.68375575058991,"predicted_n":64,"predicted_ms":5603.585,"predicted_per_token_ms":87.556015625,"predicted_per_second":11.421259782799762}}

MTP 1

* Server
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-type draft-mtp --spec-draft-n-max 1 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui

* Result
{"index":0,"content":"1. Local on-device language models run directly on a user's hardware, such as smartphones or laptops, rather than relying on remote cloud servers.\n2. This architecture significantly enhances privacy by keeping sensitive data and conversations strictly within the device's secure environment.\n3. Users benefit from reduced latency since the model does not","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,383,264,1156,579,11436,11,1680,430,33863,466,46286,11,4598,1056,37281,383,8434,9158,15814,13,198,17,13,1061,17120,11602,54925,11992,539,9976,15739,795,321,20319,24660,2785,279,3545,579,9475,4424,13,198,18,13,14206,8495,494,10723,37972,2394,279,1558,1503,524],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":725.492,"prompt_per_token_ms":21.98460606060606,"prompt_per_second":45.48637338523375,"predicted_n":64,"predicted_ms":7248.6,"predicted_per_token_ms":113.259375,"predicted_per_second":8.829291173467979,"draft_n":31,"draft_n_accepted":31}}

MTP 2

* Server
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui

* Result
{"index":0,"content":"1. Local language models run directly on a user's device, such as a smartphone or laptop, rather than relying on remote cloud servers.\n2. This architecture significantly enhances privacy by keeping sensitive data and conversations strictly within the user's hardware.\n3. Because they operate offline, these models eliminate latency issues associated with","tokens":[16,13,8509,3992,3983,1542,5774,383,264,1156,579,3545,11,1680,430,264,20853,466,20012,11,4598,1056,37281,383,8434,9158,15814,13,198,17,13,1061,17120,11602,54925,11992,539,9976,15739,795,321,20319,24660,2785,279,1156,579,11436,13,198,18,13,8938,781,14061,25331,11,1439,3983,21054,37972,4562,5634,440],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":739.039,"prompt_per_token_ms":22.39512121212121,"prompt_per_second":44.6525826106606,"predicted_n":64,"predicted_ms":8248.707,"predicted_per_token_ms":128.886046875,"predicted_per_second":7.758791771849818,"draft_n":44,"draft_n_accepted":41}}

MTP 3

* Server
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui

* Result
{"index":0,"content":"1. Local language models run directly on a user's device, such as a smartphone or laptop, rather than relying on remote cloud servers.\n2. This architecture significantly enhances privacy by keeping sensitive data and conversations strictly within the user's hardware.\n3. Because they operate offline, these models eliminate latency issues associated with","tokens":[16,13,8509,3992,3983,1542,5774,383,264,1156,579,3545,11,1680,430,264,20853,466,20012,11,4598,1056,37281,383,8434,9158,15814,13,198,17,13,1061,17120,11602,54925,11992,539,9976,15739,795,321,20319,24660,2785,279,1156,579,11436,13,198,18,13,8938,781,14061,25331,11,1439,3983,21054,37972,4562,5634,440],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":754.674,"prompt_per_token_ms":22.86890909090909,"prompt_per_second":43.72749028057148,"predicted_n":64,"predicted_ms":12511.36,"predicted_per_token_ms":195.49,"predicted_per_second":5.1153511688577415,"draft_n":59,"draft_n_accepted":43}}

MTP 4

* Server
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui

* Result
{"index":0,"content":"1. Local language models run directly on a user's device, such as a smartphone or laptop, rather than relying on remote cloud servers.\n2. This architecture significantly enhances privacy by keeping sensitive data and conversations strictly within the user's hardware.\n3. Because they operate offline, these models eliminate latency issues associated with","tokens":[16,13,8509,3992,3983,1542,5774,383,264,1156,579,3545,11,1680,430,264,20853,466,20012,11,4598,1056,37281,383,8434,9158,15814,13,198,17,13,1061,17120,11602,54925,11992,539,9976,15739,795,321,20319,24660,2785,279,1156,579,11436,13,198,18,13,8938,781,14061,25331,11,1439,3983,21054,37972,4562,5634,440],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":770.078,"prompt_per_token_ms":23.33569696969697,"prompt_per_second":42.85280192396095,"predicted_n":64,"predicted_ms":15053.165,"predicted_per_token_ms":235.205703125,"predicted_per_second":4.251597587616956,"draft_n":70,"draft_n_accepted":45}}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions