Closed
Description
What happened?
I was running Llama-3 on 3090 and I encountered the same performance problem in #1376.
When using grammar files, sample time becomes very long and GPU utilization dropped from 70%+(when not using grammar) to 10%.
I tried two different fine-tuned version of Llama-3 and the problem remains.
With Llama-2 there is no such problem. So I believe it is due to some kind of bug in llama.cpp
I offloaded all layers to GPU and I believe I have llama.cpp properly configured.
Name and Version
version: 2998 (9588f19)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
Llama-3-8B-Instruct with grammar:
llama_print_timings: load time = 195.81 ms
llama_print_timings: sample time = 7656.05 ms / 90 runs ( 85.07 ms per token, 11.76 tokens per second)
llama_print_timings: prompt eval time = 192.27 ms / 410 tokens ( 0.47 ms per token, 2132.44 tokens per second)
llama_print_timings: eval time = 944.78 ms / 89 runs ( 10.62 ms per token, 94.20 tokens per second)
llama_print_timings: total time = 9298.97 ms / 499 tokens
Llama3-8B-Instruct without grammar:
llama_print_timings: load time = 193.30 ms
llama_print_timings: sample time = 387.66 ms / 233 runs ( 1.66 ms per token, 601.04 tokens per second)
llama_print_timings: prompt eval time = 192.93 ms / 410 tokens ( 0.47 ms per token, 2125.09 tokens per second)
llama_print_timings: eval time = 2355.86 ms / 232 runs ( 10.15 ms per token, 98.48 tokens per second)
llama_print_timings: total time = 3277.20 ms / 642 tokens
Llama-2-8B with grammar:
llama_print_timings: load time = 210.30 ms
llama_print_timings: sample time = 354.68 ms / 54 runs ( 6.57 ms per token, 152.25 tokens per second)
llama_print_timings: prompt eval time = 209.69 ms / 464 tokens ( 0.45 ms per token, 2212.84 tokens per second)
llama_print_timings: eval time = 492.42 ms / 53 runs ( 9.29 ms per token, 107.63 tokens per second)
llama_print_timings: total time = 1128.22 ms / 517 tokens
Llama-2-8B without grammar:
llama_print_timings: load time = 194.85 ms
llama_print_timings: sample time = 153.25 ms / 367 runs ( 0.42 ms per token, 2394.76 tokens per second)
llama_print_timings: prompt eval time = 194.44 ms / 464 tokens ( 0.42 ms per token, 2386.38 tokens per second)
llama_print_timings: eval time = 3512.26 ms / 366 runs ( 9.60 ms per token, 104.21 tokens per second)
llama_print_timings: total time = 4094.80 ms / 830 tokens