Skip to content

llama_decode is significantly slower if n_tokens > 1  #4624

Closed
@apoorvumang

Description

@apoorvumang

Issue

It is expected that llama_decode should take more time if more tokens are present in the batch, but on my system (Apple M1 Max 32GB) with mistral-7b-instruct-v0.2.Q4_0.gguf model, the increase in time taken is quite significant. I plotted some avg latencies on my system with different n_tokens using a modified version of speculative and putting timing around llama_decode(ctx_tgt, batch_tgt);:

image

There is more 5x jump in latency of llama_decode when n_tokens goes from 1 to 2 (which I feel is too high), but a very gradual increase after that. This means that techniques like speculative and lookup decoding cannot give speed benefits for small draft sizes ( n_draft < 5) even if drafts are 100% correct, since autoregressively decoding 5 tokens 1 at a time is just as fast as decoding 5 tokens at once, so the advantage of speculation is lost.

I'm not sure this counts as a bug or expected behaviour, but the stark difference in latencies b/w 1 token decoding and 2 token decoding seems weird to me. Decoding 2 tokens should at most take 2x the time, not 5x?

To reproduce:

The easiest way to see this is running main with a one word prompt. The prompt eval time will be the time taken for the few prompt tokens, and eval time will show throughput for rest of tokens. e.g. ./main -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -p "A" -n 100 -e gives me

llama_print_timings:        load time =     385.80 ms
llama_print_timings:      sample time =       8.03 ms /   100 runs   (    0.08 ms per token, 12451.75 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.90 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1637.12 ms /    99 runs   (   16.54 ms per token,    60.47 tokens per second)
llama_print_timings:       total time =    1744.09 ms

which shows ~85ms for the initial forward pass with just 2 tokens, and ~16ms for all other tokens.

To see this effect in speculative, one can compare --draft 0 with --draft 1. Use same model as draft model and main model to ensure 100% acceptance. On my system, draft 0 gave better timing of target model than draft 1, which shouldn't really happen IMO

draft = 0 command:

./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 0 -np 1 --temp 0.0 --verbose-prompt --color

Timings:

n_draft   = 0
n_predict = 101
n_drafted = 0
n_accept  = 0
accept    = nan%

draft:

llama_print_timings:        load time =     982.45 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =      85.60 ms /     2 tokens (   42.80 ms per token,    23.36 tokens per second)
llama_print_timings:        eval time =    1653.63 ms /   101 runs   (   16.37 ms per token,    61.08 tokens per second)
llama_print_timings:       total time =    3453.52 ms

target:

llama_print_timings:        load time =     479.45 ms
llama_print_timings:      sample time =      17.57 ms /   101 runs   (    0.17 ms per token,  5750.07 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1676.51 ms /   102 runs   (   16.44 ms per token,    60.84 tokens per second)
llama_print_timings:       total time =    4460.08 ms

draft = 1 command:

./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 1 -np 1 --temp 0.0 --verbose-prompt --color

Timings:

n_draft   = 1
n_predict = 102
n_drafted = 36
n_accept  = 36
accept    = 100.000%

draft:

llama_print_timings:        load time =     960.89 ms
llama_print_timings:      sample time =     124.45 ms /     1 runs   (  124.45 ms per token,     8.04 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.91 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1701.90 ms /   102 runs   (   16.69 ms per token,    59.93 tokens per second)
llama_print_timings:       total time =    5584.70 ms

target:

llama_print_timings:        load time =     431.73 ms
llama_print_timings:      sample time =      19.67 ms /   102 runs   (    0.19 ms per token,  5184.77 tokens per second)
llama_print_timings: prompt eval time =    3076.34 ms /    72 tokens (   42.73 ms per token,    23.40 tokens per second)
llama_print_timings:        eval time =     520.40 ms /    31 runs   (   16.79 ms per token,    59.57 tokens per second)
llama_print_timings:       total time =    6569.38 ms

So draft=1 has much slower target model, taking 6.5 sec compared to 4.4 sec if there was no draft model, which is weird.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions