MTP 3:
$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and prompts stored locally rather than transmitting them to external servers.\n3. By processing information on the chip, these models offer lower latency and","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,49219,9476,22756,4598,1056,75010,1070,310,8976,15814,13,198,18,13,3113,8427,1928,383,279,15911,11,1439,3983,2915,4570,37972,321],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":833.058,"prompt_per_token_ms":25.24418181818182,"prompt_per_second":39.6130881643295,"predicted_n":64,"predicted_ms":11325.616,"predicted_per_token_ms":176.96275,"predicted_per_second":5.650906758625756,"draft_n":75,"draft_n_accepted":37}}%
MTP 2:
$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.\n3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,20319,24660,383,279,3545,13,198,18,13,3113,8427,1928,22756,11,1439,3983,7698,37972,321,3300,10281,1965,2942,7463,310,9158,5792],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":806.237,"prompt_per_token_ms":24.431424242424242,"prompt_per_second":40.930892529119845,"predicted_n":64,"predicted_ms":7704.204,"predicted_per_token_ms":120.3781875,"predicted_per_second":8.307152822017693,"draft_n":48,"draft_n_accepted":38}}%
No MTP:
$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.\n3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,20319,24660,383,279,3545,13,198,18,13,3113,8427,1928,22756,11,1439,3983,7698,37972,321,3300,10281,1965,2942,7463,310,9158,5792],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":770.524,"prompt_per_token_ms":23.349212121212123,"prompt_per_second":42.827997570484506,"predicted_n":64,"predicted_ms":6273.78,"predicted_per_token_ms":98.0278125,"predicted_per_second":10.20118652550775}}%
Name and Version
llama-server --version version: 9216 (1ff0fc138) built with AppleClang 21.0.0.21000101 for Darwin arm64 Platform: macOS 26.5, arm64 GPU backend: Metal Hardware: Apple M4 Pro Model: Qwen3.6-27B-Q4_K_M.gguf Model metadata includes: general.architecture = qwen35 qwen35.nextn_predict_layers = 1Operating systems
Mac
GGML backends
Metal
Hardware
M4 Macbook Pro
Models
unsloth/Qwen3.6-27B-MTP-GGUF - which is a Qwen3.6-27B-Q4_K_M.gguf
Problem description & steps to reproduce
I’m seeing
llama-serverproduce different committed tokens when draft-mtp is enabled with--spec-draft-n-max 3. In contrast, the same prompt and sampler settings match the non-speculative baseline when--spec-draft-n-max 2. I'm reporting this because speculative decoding should only affect performance; it should not change the committed token sequence. (This also affects my usage of llama.cpp with custom Swift bindings with the exact same input/output, but I'm describing via llama-server since that's easier to reproduce.)The request is:
The expected result (yielded by non-MTP and MTP <= 2):
The llama-server startup command (note the
--spec-draft-n-max 3):llama-server \ -m "$MODEL" \ --ctx-size 32768 \ --batch-size 2048 \ --ubatch-size 2048 \ --gpu-layers all \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ --spec-draft-n-min 0 \ --temp 0 \ --top-p 0.9 \ --top-k 40 \ --min-p 0 \ --host 127.0.0.1 \ --port 8080 \ --no-uiThe output with MTP >= 3:
So at token 38 we get:
The expected behavior is that
draft-mtpshould commit exactly the same token sequence as the target model would produce without speculative decoding. Increasing--spec-draft-n-maxfrom 2 to 3 should not alter deterministic output.First Bad Commit
No response
Relevant log output
Logs