Closed
Description
I noticed that some of the responses I got from llama-cpp server (latest master) are unnaturally fast for 70b model, and it happens randomly. And when this happens the response has worse quality. The model I'm using is https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/blob/main/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf with the command line llama-server -m Meta-Llama-3-70B-Instruct-Q5_K_M.gguf -c 0 -t 24 -ngl 24
. It's only partially offloaded to gpu (with rocm on linux) so maybe somehow llama-cpp doesn't use all layers when it responds quickly.