Description
Name and Version
llama-cli b5050 vs b5017
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
llama-cli -m "modelname.gguf" -p "prompt" -ngl 50
Problem description & steps to reproduce
Under Windows 11, taking Mistral-Nemo-Instruct_2407 Q4_K_M as reference, performance went down from 4.7 tok/s with b5017 to 4.3 tok/s b5050 (same intel drivers 32.0.1016651)
For gemma-3-4b-it-Q6-K performance is the same in both builds at 9.7 tok/sec
While this may be anecdotal given Iris Xe is a rather basic integrated CPU, b5017 was the first version where I noticed llama.cpp Vulkan ran faster than AVX-2 on an i7 1165G7 (this did not use to be the case, Vulkan was quite more sluggish before), and running inference on the ris Xe is quite energy efficient : laptop can run inference at 100% GPU with fans off.
Reporting this in case the Vulkan wizards can take advantage of that data point!
With all the recent enhancements to vulkan in llama.cpp, it's now rather comfortable to run small-ish models on regular laptops.
First Bad Commit
No response