Eval bug: draft-mtp changes deterministic output with --spec-draft-n-max 3 on Qwen3.6 MTP model

### Name and Version

```sh
llama-server --version                                                                                                                                                    
version: 9216 (1ff0fc138)
built with AppleClang 21.0.0.21000101 for Darwin arm64

Platform: macOS 26.5, arm64
GPU backend: Metal
Hardware: Apple M4 Pro
Model: Qwen3.6-27B-Q4_K_M.gguf
Model metadata includes:
general.architecture = qwen35
qwen35.nextn_predict_layers = 1
```

### Operating systems

Mac

### GGML backends

Metal

### Hardware

M4 Macbook Pro

### Models

unsloth/Qwen3.6-27B-MTP-GGUF - which is a Qwen3.6-27B-Q4_K_M.gguf

### Problem description & steps to reproduce

I’m seeing `llama-server` produce different committed tokens when draft-mtp is enabled with `--spec-draft-n-max 3`. In contrast, the same prompt and sampler settings match the non-speculative baseline when `--spec-draft-n-max 2`. I'm reporting this because speculative decoding should only affect performance; it should not change the committed token sequence. (This also affects my usage of llama.cpp with custom Swift bindings with the exact same input/output, but I'm describing via llama-server since that's easier to reproduce.)

The request is:

```sh
curl -s http://127.0.0.1:8080/completion \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
```

The expected result (yielded by non-MTP and MTP <= 2):
```sh
1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.
2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.
3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based
```

The llama-server startup command (note the `--spec-draft-n-max 3`):

```sh
llama-server \
  -m "$MODEL" \
  --ctx-size 32768 \
  --batch-size 2048 \
  --ubatch-size 2048 \
  --gpu-layers all \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --spec-draft-n-min 0 \
  --temp 0 \
  --top-p 0.9 \
  --top-k 40 \
  --min-p 0 \
  --host 127.0.0.1 \
  --port 8080 \
  --no-ui
```

The output with MTP >= 3:
```sh
1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.
2. This architecture significantly enhances user privacy by keeping sensitive data and prompts stored locally rather than transmitting them to external servers.
3. By processing information on the chip, these models offer lower latency and
```
So at token 38 we get:
```
expected: 20319:" conversations"
actual:   49219:" prompts"
```

The expected behavior is that `draft-mtp` should commit exactly the same token sequence as the target model would produce without speculative decoding. Increasing `--spec-draft-n-max` from 2 to 3 should not alter deterministic output.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console
MTP 3:

$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and prompts stored locally rather than transmitting them to external servers.\n3. By processing information on the chip, these models offer lower latency and","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,49219,9476,22756,4598,1056,75010,1070,310,8976,15814,13,198,18,13,3113,8427,1928,383,279,15911,11,1439,3983,2915,4570,37972,321],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":833.058,"prompt_per_token_ms":25.24418181818182,"prompt_per_second":39.6130881643295,"predicted_n":64,"predicted_ms":11325.616,"predicted_per_token_ms":176.96275,"predicted_per_second":5.650906758625756,"draft_n":75,"draft_n_accepted":37}}%

MTP 2:
$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.\n3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,20319,24660,383,279,3545,13,198,18,13,3113,8427,1928,22756,11,1439,3983,7698,37972,321,3300,10281,1965,2942,7463,310,9158,5792],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none,draft-mtp","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":806.237,"prompt_per_token_ms":24.431424242424242,"prompt_per_second":40.930892529119845,"predicted_n":64,"predicted_ms":7704.204,"predicted_per_token_ms":120.3781875,"predicted_per_second":8.307152822017693,"draft_n":48,"draft_n_accepted":38}}%

No MTP:
$ curl -s http://127.0.0.1:8080/completion -H 'Content-Type: application/json' -d '{"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","n_predict":64,"temperature":0,"top_p":0.9,"top_k":40,"min_p":0,"repeat_penalty":1.3,"repeat_last_n":16,"cache_prompt":false,"stream":false,"return_tokens":true}'
{"index":0,"content":"1. Local on-device language models run directly within a device's hardware, eliminating the need for constant internet connectivity.\n2. This architecture significantly enhances user privacy by keeping sensitive data and conversations strictly on the device.\n3. By processing information locally, these models reduce latency and provide faster response times compared to cloud-based","tokens":[16,13,8509,383,63393,3992,3983,1542,5774,2785,264,3545,579,11436,11,38192,279,1144,364,6570,7361,29270,13,198,17,13,1061,17120,11602,54925,1156,11992,539,9976,15739,795,321,20319,24660,383,279,3545,13,198,18,13,3113,8427,1928,22756,11,1439,3983,7698,37972,321,3300,10281,1965,2942,7463,310,9158,5792],"id_slot":3,"stop":true,"model":"Qwen3.6-27B-Q4_K_M.gguf","tokens_predicted":64,"tokens_evaluated":33,"generation_settings":{"seed":4294967295,"temperature":0.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.8999999761581421,"min_p":0.0,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":16,"repeat_penalty":1.2999999523162842,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":64,"n_predict":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<|im_start|>system\nYou are concise and helpful.<|im_end|>\n<|im_start|>user\nWrite ten sentences about local on-device language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":96,"timings":{"cache_n":0,"prompt_n":33,"prompt_ms":770.524,"prompt_per_token_ms":23.349212121212123,"prompt_per_second":42.827997570484506,"predicted_n":64,"predicted_ms":6273.78,"predicted_per_token_ms":98.0278125,"predicted_per_second":10.20118652550775}}%

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: draft-mtp changes deterministic output with --spec-draft-n-max 3 on Qwen3.6 MTP model #23302

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: draft-mtp changes deterministic output with --spec-draft-n-max 3 on Qwen3.6 MTP model #23302

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions