Description
Hello,
I've noticed a significant speed reduction in prompt processing when comparing the latest llama.cpp builds to slightly older ones.
I think it has something to do with the batch size. The speed at a batch size of 512 is the same as it always has been, but if I'm using -b 1024 it's significantly slower.
Comparison latest llama.cpp: -n 180 -c 4096 -t 6 --gpu-layers 5 --ignore-eos -b 1024, Mixtral IQ4_XS, Core i7 9750H, 32 GB RAM, RTX 2060
version: 2431 (4755afd)
llama_print_timings: load time = 2339,43 ms
llama_print_timings: sample time = 67,74 ms / 180 runs ( 0,38 ms per token, 2657,10 tokens per second)
llama_print_timings: prompt eval time = 72387,34 ms / 3602 tokens ( 20,10 ms per token, 49,76 tokens per second)
llama_print_timings: eval time = 44119,33 ms / 179 runs ( 246,48 ms per token, 4,06 tokens per second)
llama_print_timings: total time = 116631,73 ms / 3781 tokens
version: 2405 (5cdb371)
llama_print_timings: load time = 2482,92 ms
llama_print_timings: sample time = 69,55 ms / 180 runs ( 0,39 ms per token, 2587,99 tokens per second)
llama_print_timings: prompt eval time = 51669,64 ms / 3602 tokens ( 14,34 ms per token, 69,71 tokens per second)
llama_print_timings: eval time = 42287,08 ms / 179 runs ( 236,24 ms per token, 4,23 tokens per second)
llama_print_timings: total time = 94085,31 ms / 3781 tokens
@slaren Do you think there is a commit that could have caused this? Listening to the coil whine of my laptop while processing the prompt, there's a very noticeable different in the sound. With the recent commit, it sounds like it's processing two 512 batches instead of one 1024 batch (there's a noticeable pause in the coil whine at some point) even though in the terminal it looks like the usual 1024 batch size. With the older commit, there is no such pause and the sound is continuous for the whole 1024 tokens.
The speed difference is quite stark (20 ms/t vs 14 ms/t). I hope you can take a look at this! Thank you