Query is the same for all:
$ curl -s http://127.0.0.1:8080/completion \
-H 'Content-Type: application/json' \
-d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"seed":1234,"cache_prompt":false,"stream":false,"return_tokens":true}'
No MTP
* Server:
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-draft-n-max 1 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui
* Result
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.\n3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,20319,24660,383,279,3545,13,198,18,13,3113,8427,1928,22756,11,1439,3983,7698,37972,321,3300,10281,1965,2942,7463,310,9158,5792],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":706.884,"prompt_per_token_ms":21.420727272727273,"prompt_per_second":46.68375575058991,"predicted_n":64,"predicted_ms":5603.585,"predicted_per_token_ms":87.556015625,"predicted_per_second":11.421259782799762}}
MTP 1
* Server
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-type draft-mtp --spec-draft-n-max 1 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui
* Result
{"index":0,"content":"1. Local on-device language models run directly on a user's hardware, such as smartphones or laptops, rather than relying on remote cloud servers.\n2. This architecture significantly enhances privacy by keeping sensitive data and conversations strictly within the device's secure environment.\n3. Users benefit from reduced latency since the model does not","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,383,264,1156,579,11436,11,1680,430,33863,466,46286,11,4598,1056,37281,383,8434,9158,15814,13,198,17,13,1061,17120,11602,54925,11992,539,9976,15739,795,321,20319,24660,2785,279,3545,579,9475,4424,13,198,18,13,14206,8495,494,10723,37972,2394,279,1558,1503,524],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":725.492,"prompt_per_token_ms":21.98460606060606,"prompt_per_second":45.48637338523375,"predicted_n":64,"predicted_ms":7248.6,"predicted_per_token_ms":113.259375,"predicted_per_second":8.829291173467979,"draft_n":31,"draft_n_accepted":31}}
MTP 2
* Server
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui
* Result
{"index":0,"content":"1. Local language models run directly on a user's device, such as a smartphone or laptop, rather than relying on remote cloud servers.\n2. This architecture significantly enhances privacy by keeping sensitive data and conversations strictly within the user's hardware.\n3. Because they operate offline, these models eliminate latency issues associated with","tokens":[16,13,8509,3992,3983,1542,5774,383,264,1156,579,3545,11,1680,430,264,20853,466,20012,11,4598,1056,37281,383,8434,9158,15814,13,198,17,13,1061,17120,11602,54925,11992,539,9976,15739,795,321,20319,24660,2785,279,1156,579,11436,13,198,18,13,8938,781,14061,25331,11,1439,3983,21054,37972,4562,5634,440],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":739.039,"prompt_per_token_ms":22.39512121212121,"prompt_per_second":44.6525826106606,"predicted_n":64,"predicted_ms":8248.707,"predicted_per_token_ms":128.886046875,"predicted_per_second":7.758791771849818,"draft_n":44,"draft_n_accepted":41}}
MTP 3
* Server
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui
* Result
{"index":0,"content":"1. Local language models run directly on a user's device, such as a smartphone or laptop, rather than relying on remote cloud servers.\n2. This architecture significantly enhances privacy by keeping sensitive data and conversations strictly within the user's hardware.\n3. Because they operate offline, these models eliminate latency issues associated with","tokens":[16,13,8509,3992,3983,1542,5774,383,264,1156,579,3545,11,1680,430,264,20853,466,20012,11,4598,1056,37281,383,8434,9158,15814,13,198,17,13,1061,17120,11602,54925,11992,539,9976,15739,795,321,20319,24660,2785,279,1156,579,11436,13,198,18,13,8938,781,14061,25331,11,1439,3983,21054,37972,4562,5634,440],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":754.674,"prompt_per_token_ms":22.86890909090909,"prompt_per_second":43.72749028057148,"predicted_n":64,"predicted_ms":12511.36,"predicted_per_token_ms":195.49,"predicted_per_second":5.1153511688577415,"draft_n":59,"draft_n_accepted":43}}
MTP 4
* Server
llama-server -m '/Users/james/Library/Group Containers/group.com.carbocation.shared/Models/19855312-ECD6-4E00-B909-AEF5D4984F10/Qwen3.6-27B-Q4_K_M.gguf' --ctx-size 32768 --batch-size 2048 --ubatch-size 2048 --gpu-layers all --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-n-min 0 --temp 0 --top-p 0.9 --top-k 40 --min-p 0 --host 127.0.0.1 --port 8080 --no-ui
* Result
{"index":0,"content":"1. Local language models run directly on a user's device, such as a smartphone or laptop, rather than relying on remote cloud servers.\n2. This architecture significantly enhances privacy by keeping sensitive data and conversations strictly within the user's hardware.\n3. Because they operate offline, these models eliminate latency issues associated with","tokens":[16,13,8509,3992,3983,1542,5774,383,264,1156,579,3545,11,1680,430,264,20853,466,20012,11,4598,1056,37281,383,8434,9158,15814,13,198,17,13,1061,17120,11602,54925,11992,539,9976,15739,795,321,20319,24660,2785,279,1156,579,11436,13,198,18,13,8938,781,14061,25331,11,1439,3983,21054,37972,4562,5634,440],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":1234,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":770.078,"prompt_per_token_ms":23.33569696969697,"prompt_per_second":42.85280192396095,"predicted_n":64,"predicted_ms":15053.165,"predicted_per_token_ms":235.205703125,"predicted_per_second":4.251597587616956,"draft_n":70,"draft_n_accepted":45}}
Name and Version
llama-server --version version: 9216 (1ff0fc138) built with AppleClang 21.0.0.21000101 for Darwin arm64 Platform: macOS 26.5, arm64 GPU backend: Metal Hardware: Apple M4 Pro Model: Qwen3.6-27B-Q4_K_M.gguf Model metadata includes: general.architecture = qwen35 qwen35.nextn_predict_layers = 1Operating systems
Mac
GGML backends
Metal
Hardware
M4 Macbook Pro
Models
unsloth/Qwen3.6-27B-MTP-GGUF - which is a Qwen3.6-27B-Q4_K_M.gguf
Problem description & steps to reproduce
draft-mtpchanges the committed token stream for a Qwen3.6 MTP GGUF model under deterministic sampling. Inllama-server, the same prompt and request body produce one token stream with no speculative decoding, and different token streams when--spec-type draft-mtpis enabled. This happens with an explicit fixed seed andtemperature: 0.The request body is identical for all runs:
The full logs are down in the console log area below, but the pattern is:
So, enabling draft-mtp changes the committed token stream with --spec-draft-n-max 1 or higher. The width-1 run is notable because it looks like all tokens were accepted but the output still differs from the no-MTP case:
Expected behavior: Speculative decoding should not change the committed token stream. With the same prompt, same sampler settings, explicit seed:1234, and temperature:0,
draft-mtpshould produce the same tokens as no speculative decoding.Actual behavior:
draft-mtpfrom #22673 produces different committed tokens from the no-speculative baseline. The exact divergent output depends on--spec-draft-n-max.llama-completiondoes not appear to support--spec-typeso this repro is only inllama-server.(This issue corrects #23302 which did not use a fixed seed.)
First Bad Commit
No response
Relevant log output
Logs