Skip to content

Bug: Weird output from llama-speculative #8499

Closed
@bong-furiosa

Description

@bong-furiosa

What happened?

Hello, llama.cpp experts! Thank you for creating such an amazing LLM Inference system. 😁
However, while using this system, I encountered an unusual results when checking the speculative decoding output.
I believe the observed issue is a bug and reporting it as a Bug ISSUE on this github project.

First of all, I want to provide a configuration of my system.

  • OS: ubuntu 22.04
  • CUDA: 12.4
  • GPU: A100 80GB

Next, I will explain the steps I took to download and run the model until the bug occurred.
It was somewhat challenging to use the llama.cpp systems.

# download draft model
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir=./llama-1.1b
./venv/bin/python3 convert_hf_to_gguf.py ./llama-1.1b
# download target model
huggingface-cli download NousResearch/Llama-2-7b-hf --local-dir=./llama-7b
./venv/bin/python3 convert_hf_to_gguf.py ./llama-7b
# run llama-speculative
./build/bin/llama-speculative -m ./llama-7b/ggml-model-f16.gguf -md ./llama-1.1b/ggml-model-f16.gguf -p "Making cake is like" -e -ngl 100 -ngld 100 -t 4 --temp 1.0 -n 128 -c 4096 -s 20 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5

And the printed result is as follows:

draft:

llama_print_timings:        load time =    4430.64 ms
llama_print_timings:      sample time =     897.28 ms /   555 runs   (    1.62 ms per token,   618.54 tokens per second)
llama_print_timings: prompt eval time =    9531.68 ms /   228 tokens (   41.81 ms per token,    23.92 tokens per second)
llama_print_timings:        eval time =    1968.11 ms /   444 runs   (    4.43 ms per token,   225.60 tokens per second)
llama_print_timings:       total time =   19874.43 ms /   672 tokens

target:

llama_print_timings:        load time =   26494.43 ms
llama_print_timings:      sample time =    1337.68 ms /   112 runs   (   11.94 ms per token,    83.73 tokens per second)
llama_print_timings: prompt eval time =    1840.43 ms /   673 tokens (    2.73 ms per token,   365.68 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   24380.18 ms /   674 tokens

Here, unlike #3649, I got the inf eval time of the target model.

I am currently comparing the generation phase latency of the draft model and the target model in Speculative Decoding.
So far, I have used llama-bench and llama-cli to measure tokens per second for each model, and the results have been different (e.g. the latency ratio measured with llama-bench was significanlty larger than that measured with llama-cli).

Therefore I attempted additional measurements with llama-speculative, but I obtained an unusual value of inf. I would like to request confirmation on whether this measurement result is a bug or if it is expected behavior of llama.cpp. 🙏

Name and Version

./build/bin/llama-cli --version
version: 3392 (bda62d79) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions