Description
Issue
It is expected that llama_decode
should take more time if more tokens are present in the batch, but on my system (Apple M1 Max 32GB) with mistral-7b-instruct-v0.2.Q4_0.gguf
model, the increase in time taken is quite significant. I plotted some avg latencies on my system with different n_tokens
using a modified version of speculative
and putting timing around llama_decode(ctx_tgt, batch_tgt);
:
There is more 5x jump in latency of llama_decode
when n_tokens
goes from 1 to 2 (which I feel is too high), but a very gradual increase after that. This means that techniques like speculative
and lookup
decoding cannot give speed benefits for small draft sizes ( n_draft < 5
) even if drafts are 100% correct, since autoregressively decoding 5 tokens 1 at a time is just as fast as decoding 5 tokens at once, so the advantage of speculation is lost.
I'm not sure this counts as a bug or expected behaviour, but the stark difference in latencies b/w 1 token decoding and 2 token decoding seems weird to me. Decoding 2 tokens should at most take 2x the time, not 5x?
To reproduce:
The easiest way to see this is running main
with a one word prompt. The prompt eval time
will be the time taken for the few prompt tokens, and eval time
will show throughput for rest of tokens. e.g. ./main -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -p "A" -n 100 -e
gives me
llama_print_timings: load time = 385.80 ms
llama_print_timings: sample time = 8.03 ms / 100 runs ( 0.08 ms per token, 12451.75 tokens per second)
llama_print_timings: prompt eval time = 85.81 ms / 2 tokens ( 42.90 ms per token, 23.31 tokens per second)
llama_print_timings: eval time = 1637.12 ms / 99 runs ( 16.54 ms per token, 60.47 tokens per second)
llama_print_timings: total time = 1744.09 ms
which shows ~85ms for the initial forward pass with just 2 tokens, and ~16ms for all other tokens.
To see this effect in speculative
, one can compare --draft 0
with --draft 1
. Use same model as draft model and main model to ensure 100% acceptance. On my system, draft 0 gave better timing of target model than draft 1, which shouldn't really happen IMO
draft = 0 command:
./speculative \
-m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
-p "A" \
-e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 0 -np 1 --temp 0.0 --verbose-prompt --color
Timings:
n_draft = 0
n_predict = 101
n_drafted = 0
n_accept = 0
accept = nan%
draft:
llama_print_timings: load time = 982.45 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 85.60 ms / 2 tokens ( 42.80 ms per token, 23.36 tokens per second)
llama_print_timings: eval time = 1653.63 ms / 101 runs ( 16.37 ms per token, 61.08 tokens per second)
llama_print_timings: total time = 3453.52 ms
target:
llama_print_timings: load time = 479.45 ms
llama_print_timings: sample time = 17.57 ms / 101 runs ( 0.17 ms per token, 5750.07 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 1676.51 ms / 102 runs ( 16.44 ms per token, 60.84 tokens per second)
llama_print_timings: total time = 4460.08 ms
draft = 1 command:
./speculative \
-m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
-p "A" \
-e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 1 -np 1 --temp 0.0 --verbose-prompt --color
Timings:
n_draft = 1
n_predict = 102
n_drafted = 36
n_accept = 36
accept = 100.000%
draft:
llama_print_timings: load time = 960.89 ms
llama_print_timings: sample time = 124.45 ms / 1 runs ( 124.45 ms per token, 8.04 tokens per second)
llama_print_timings: prompt eval time = 85.81 ms / 2 tokens ( 42.91 ms per token, 23.31 tokens per second)
llama_print_timings: eval time = 1701.90 ms / 102 runs ( 16.69 ms per token, 59.93 tokens per second)
llama_print_timings: total time = 5584.70 ms
target:
llama_print_timings: load time = 431.73 ms
llama_print_timings: sample time = 19.67 ms / 102 runs ( 0.19 ms per token, 5184.77 tokens per second)
llama_print_timings: prompt eval time = 3076.34 ms / 72 tokens ( 42.73 ms per token, 23.40 tokens per second)
llama_print_timings: eval time = 520.40 ms / 31 runs ( 16.79 ms per token, 59.57 tokens per second)
llama_print_timings: total time = 6569.38 ms
So draft=1 has much slower target model, taking 6.5 sec compared to 4.4 sec if there was no draft model, which is weird.