`llama_decode` is significantly slower if `n_tokens > 1` 

Issue
---
It is expected that `llama_decode` should take more time if more tokens are present in the batch, but on my system (Apple M1 Max 32GB) with `mistral-7b-instruct-v0.2.Q4_0.gguf` model, the increase in time taken is quite significant. I plotted some avg latencies on my system with different `n_tokens` using a modified version of `speculative` and putting timing around `llama_decode(ctx_tgt, batch_tgt);`:

![image](https://github.com/ggerganov/llama.cpp/assets/1957903/d9683434-6278-41b2-9018-d60acbe4ec2a)

There is more 5x jump in latency of `llama_decode` when `n_tokens` goes from 1 to 2 (which I feel is too high), but a very gradual increase after that. This means that techniques like `speculative` and `lookup` decoding **cannot give speed benefits** for small draft sizes ( `n_draft < 5`) even if drafts are 100% correct, since **autoregressively decoding 5 tokens 1 at a time is just as fast as decoding 5 tokens at once**, so the advantage of speculation is lost.

I'm not sure this counts as a bug or expected behaviour, but the stark difference in latencies b/w 1 token decoding and 2 token decoding seems weird to me. Decoding 2 tokens should at most take 2x the time, not 5x?

To reproduce:
---
The easiest way to see this is running `main` with a one word prompt. The `prompt eval time` will be the time taken for the few prompt tokens, and `eval time` will show throughput for rest of tokens. e.g. `./main -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -p "A" -n 100 -e` gives me

```
llama_print_timings:        load time =     385.80 ms
llama_print_timings:      sample time =       8.03 ms /   100 runs   (    0.08 ms per token, 12451.75 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.90 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1637.12 ms /    99 runs   (   16.54 ms per token,    60.47 tokens per second)
llama_print_timings:       total time =    1744.09 ms
```

which shows ~85ms for the initial forward pass with just 2 tokens, and ~16ms for all other tokens.

To see this effect in `speculative`, one can compare `--draft 0` with `--draft 1`. Use same model as draft model and main model to ensure 100% acceptance. On my system, draft 0 gave better timing of target model than draft 1, which shouldn't really happen IMO

draft = 0 command:
```
./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 0 -np 1 --temp 0.0 --verbose-prompt --color
```

Timings:
```
n_draft   = 0
n_predict = 101
n_drafted = 0
n_accept  = 0
accept    = nan%

draft:

llama_print_timings:        load time =     982.45 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =      85.60 ms /     2 tokens (   42.80 ms per token,    23.36 tokens per second)
llama_print_timings:        eval time =    1653.63 ms /   101 runs   (   16.37 ms per token,    61.08 tokens per second)
llama_print_timings:       total time =    3453.52 ms

target:

llama_print_timings:        load time =     479.45 ms
llama_print_timings:      sample time =      17.57 ms /   101 runs   (    0.17 ms per token,  5750.07 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1676.51 ms /   102 runs   (   16.44 ms per token,    60.84 tokens per second)
llama_print_timings:       total time =    4460.08 ms
```

draft = 1 command:
```
./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 1 -np 1 --temp 0.0 --verbose-prompt --color
```

Timings:
```
n_draft   = 1
n_predict = 102
n_drafted = 36
n_accept  = 36
accept    = 100.000%

draft:

llama_print_timings:        load time =     960.89 ms
llama_print_timings:      sample time =     124.45 ms /     1 runs   (  124.45 ms per token,     8.04 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.91 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1701.90 ms /   102 runs   (   16.69 ms per token,    59.93 tokens per second)
llama_print_timings:       total time =    5584.70 ms

target:

llama_print_timings:        load time =     431.73 ms
llama_print_timings:      sample time =      19.67 ms /   102 runs   (    0.19 ms per token,  5184.77 tokens per second)
llama_print_timings: prompt eval time =    3076.34 ms /    72 tokens (   42.73 ms per token,    23.40 tokens per second)
llama_print_timings:        eval time =     520.40 ms /    31 runs   (   16.79 ms per token,    59.57 tokens per second)
llama_print_timings:       total time =    6569.38 ms
```

So draft=1 has much slower target model, taking 6.5 sec compared to 4.4 sec if there was no draft model, which is weird. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`llama_decode` is significantly slower if `n_tokens > 1` #4624

Issue

To reproduce:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama_decode is significantly slower if n_tokens > 1 #4624

Description

Issue

To reproduce:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`llama_decode` is significantly slower if `n_tokens > 1` #4624