Skip to content

Eval bug: draft-mtp changes deterministic output with --spec-draft-n-max 3 on Qwen3.6 MTP model #23302

@carbocation

Description

@carbocation

Name and Version

llama-server --version                                                                                                                                                    
version: 9216 (1ff0fc138)
built with AppleClang 21.0.0.21000101 for Darwin arm64

Platform: macOS 26.5, arm64
GPU backend: Metal
Hardware: Apple M4 Pro
Model: Qwen3.6-27B-Q4_K_M.gguf
Model metadata includes:
general.architecture = qwen35
qwen35.nextn_predict_layers = 1

Operating systems

Mac

GGML backends

Metal

Hardware

M4 Macbook Pro

Models

unsloth/Qwen3.6-27B-MTP-GGUF - which is a Qwen3.6-27B-Q4_K_M.gguf

Problem description & steps to reproduce

I’m seeing llama-server produce different committed tokens when draft-mtp is enabled with --spec-draft-n-max 3. In contrast, the same prompt and sampler settings match the non-speculative baseline when --spec-draft-n-max 2. I'm reporting this because speculative decoding should only affect performance; it should not change the committed token sequence. (This also affects my usage of llama.cpp with custom Swift bindings with the exact same input/output, but I'm describing via llama-server since that's easier to reproduce.)

The request is:

curl -s http://127.0.0.1:8080/completion \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'

The expected result (yielded by non-MTP and MTP <= 2):

1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.
2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.
3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based

The llama-server startup command (note the --spec-draft-n-max 3):

llama-server \
  -m "$MODEL" \
  --ctx-size 32768 \
  --batch-size 2048 \
  --ubatch-size 2048 \
  --gpu-layers all \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --spec-draft-n-min 0 \
  --temp 0 \
  --top-p 0.9 \
  --top-k 40 \
  --min-p 0 \
  --host 127.0.0.1 \
  --port 8080 \
  --no-ui

The output with MTP >= 3:

1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.
2. This architecture significantly enhances user privacy by keeping sensitive data and prompts stored locally rather than transmitting them to external servers.
3. By processing information on the chip, these models offer lower latency and

So at token 38 we get:

expected: 20319:" conversations"
actual:   49219:" prompts"

The expected behavior is that draft-mtp should commit exactly the same token sequence as the target model would produce without speculative decoding. Increasing --spec-draft-n-max from 2 to 3 should not alter deterministic output.

First Bad Commit

No response

Relevant log output

Logs
MTP 3:

$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and prompts stored locally rather than transmitting them to external servers.\n3. By processing information on the chip, these models offer lower latency and","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,49219,9476,22756,4598,1056,75010,1070,310,8976,15814,13,198,18,13,3113,8427,1928,383,279,15911,11,1439,3983,2915,4570,37972,321],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":833.058,"prompt_per_token_ms":25.24418181818182,"prompt_per_second":39.6130881643295,"predicted_n":64,"predicted_ms":11325.616,"predicted_per_token_ms":176.96275,"predicted_per_second":5.650906758625756,"draft_n":75,"draft_n_accepted":37}}%

MTP 2:
$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.\n3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,20319,24660,383,279,3545,13,198,18,13,3113,8427,1928,22756,11,1439,3983,7698,37972,321,3300,10281,1965,2942,7463,310,9158,5792],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":806.237,"prompt_per_token_ms":24.431424242424242,"prompt_per_second":40.930892529119845,"predicted_n":64,"predicted_ms":7704.204,"predicted_per_token_ms":120.3781875,"predicted_per_second":8.307152822017693,"draft_n":48,"draft_n_accepted":38}}%

No MTP:
$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.\n3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,20319,24660,383,279,3545,13,198,18,13,3113,8427,1928,22756,11,1439,3983,7698,37972,321,3300,10281,1965,2942,7463,310,9158,5792],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":770.524,"prompt_per_token_ms":23.349212121212123,"prompt_per_second":42.827997570484506,"predicted_n":64,"predicted_ms":6273.78,"predicted_per_token_ms":98.0278125,"predicted_per_second":10.20118652550775}}%

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions