Bug: Weird output from llama-speculative

### What happened?

Hello, llama.cpp experts! Thank you for creating such an amazing LLM Inference system. 😁
**However, while using this system, I encountered an unusual results when checking the speculative decoding output.**
I believe the observed issue is a bug and reporting it as a Bug ISSUE on this github project.

First of all, I want to provide a configuration of my system.
- OS: ubuntu 22.04
- CUDA: 12.4
- GPU: A100 80GB

Next, I will explain the steps I took to download and run the model until the bug occurred.
It was somewhat challenging to use the llama.cpp systems.
```
# download draft model
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir=./llama-1.1b
./venv/bin/python3 convert_hf_to_gguf.py ./llama-1.1b
```
```
# download target model
huggingface-cli download NousResearch/Llama-2-7b-hf --local-dir=./llama-7b
./venv/bin/python3 convert_hf_to_gguf.py ./llama-7b
```
```
# run llama-speculative
./build/bin/llama-speculative -m ./llama-7b/ggml-model-f16.gguf -md ./llama-1.1b/ggml-model-f16.gguf -p "Making cake is like" -e -ngl 100 -ngld 100 -t 4 --temp 1.0 -n 128 -c 4096 -s 20 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5
```

And the printed result is as follows:
```
draft:

llama_print_timings:        load time =    4430.64 ms
llama_print_timings:      sample time =     897.28 ms /   555 runs   (    1.62 ms per token,   618.54 tokens per second)
llama_print_timings: prompt eval time =    9531.68 ms /   228 tokens (   41.81 ms per token,    23.92 tokens per second)
llama_print_timings:        eval time =    1968.11 ms /   444 runs   (    4.43 ms per token,   225.60 tokens per second)
llama_print_timings:       total time =   19874.43 ms /   672 tokens

target:

llama_print_timings:        load time =   26494.43 ms
llama_print_timings:      sample time =    1337.68 ms /   112 runs   (   11.94 ms per token,    83.73 tokens per second)
llama_print_timings: prompt eval time =    1840.43 ms /   673 tokens (    2.73 ms per token,   365.68 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   24380.18 ms /   674 tokens
```
Here, unlike #3649, I got the `inf` eval time of the target model. 

I am currently comparing the generation phase latency of the draft model and the target model in Speculative Decoding.
So far, I have used `llama-bench` and `llama-cli` to measure tokens per second for each model, and the results have been different (e.g. the latency ratio measured with `llama-bench` was significanlty larger than that measured with `llama-cli`).

**Therefore I attempted additional measurements with `llama-speculative`, but I obtained an unusual value of `inf`. I would like to request confirmation on whether this measurement result is a bug or if it is expected behavior of llama.cpp.** 🙏


### Name and Version

```
./build/bin/llama-cli --version
version: 3392 (bda62d79) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
```

### What operating system are you seeing the problem on?

Linux

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Weird output from llama-speculative #8499

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Weird output from llama-speculative #8499

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions