Bug: sample time becomes very long when using Llama-3

### What happened?

I was running Llama-3 on 3090 and I encountered the same performance problem in [#1376](https://github.com/abetlen/llama-cpp-python/discussions/1376). 
When using grammar files, sample time becomes very long and GPU utilization dropped from 70%+(when not using grammar) to 10%.
I tried two different fine-tuned version of Llama-3 and the problem remains. 
With Llama-2  there is no such problem. So I believe it is due to some kind of bug in llama.cpp
I offloaded all layers to GPU and I believe I have llama.cpp properly configured.

### Name and Version

version: 2998 (9588f196)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
Llama-3-8B-Instruct with grammar:
llama_print_timings:        load time =     195.81 ms
llama_print_timings:      sample time =    7656.05 ms /    90 runs   (   85.07 ms per token,    11.76 tokens per second)
llama_print_timings: prompt eval time =     192.27 ms /   410 tokens (    0.47 ms per token,  2132.44 tokens per second)
llama_print_timings:        eval time =     944.78 ms /    89 runs   (   10.62 ms per token,    94.20 tokens per second)
llama_print_timings:       total time =    9298.97 ms /   499 tokens

Llama3-8B-Instruct without grammar:
llama_print_timings:        load time =     193.30 ms
llama_print_timings:      sample time =     387.66 ms /   233 runs   (    1.66 ms per token,   601.04 tokens per second)
llama_print_timings: prompt eval time =     192.93 ms /   410 tokens (    0.47 ms per token,  2125.09 tokens per second)
llama_print_timings:        eval time =    2355.86 ms /   232 runs   (   10.15 ms per token,    98.48 tokens per second)
llama_print_timings:       total time =    3277.20 ms /   642 tokens

Llama-2-8B with grammar:
llama_print_timings:        load time =     210.30 ms
llama_print_timings:      sample time =     354.68 ms /    54 runs   (    6.57 ms per token,   152.25 tokens per second)
llama_print_timings: prompt eval time =     209.69 ms /   464 tokens (    0.45 ms per token,  2212.84 tokens per second)
llama_print_timings:        eval time =     492.42 ms /    53 runs   (    9.29 ms per token,   107.63 tokens per second)
llama_print_timings:       total time =    1128.22 ms /   517 tokens

Llama-2-8B without grammar:
llama_print_timings:        load time =     194.85 ms
llama_print_timings:      sample time =     153.25 ms /   367 runs   (    0.42 ms per token,  2394.76 tokens per second)
llama_print_timings: prompt eval time =     194.44 ms /   464 tokens (    0.42 ms per token,  2386.38 tokens per second)
llama_print_timings:        eval time =    3512.26 ms /   366 runs   (    9.60 ms per token,   104.21 tokens per second)
llama_print_timings:       total time =    4094.80 ms /   830 tokens
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: sample time becomes very long when using Llama-3 #7554

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: sample time becomes very long when using Llama-3 #7554

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions