-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Intel Fixes, Optimizations and Debugging Flags #5301
Conversation
Optimize matmul for Intel ARC Add Vulkan dequant test
The async functions are supposed to work in this way:
In short:
|
The thing is, unlike CUDA, Vulkan is meant for predictable, repeated work (like generating frames). I have significant overhead for submitting work to a queue, which is why I need to batch as much as possible into (best case) a single command buffer before I submit it. I am relatively certain that if I open a command buffer, record a copy, close the command buffer and submit it to the queue each time the functions get called I'd lose any advantage this would bring to the overhead. If it's not feasible to have a synchronize call at the end of a batch of async set/get calls then I can't implement those functions. I'll revert the changes in that case. |
Ignoring the asynchronous functions is probably a good idea for now - the interface is not completely well defined yet (ie. the problem with |
Alright, thank you. I'll revert it for now and keep the functions around for future use once the interface is defined. |
Thanks so much @0cc4m! Inference works on my Tntel Xe igpu on my framework laptop now. Great work! |
That's great to hear! Does it provide a speedup over CPU-only? I'd be interested in seeing benchmarks of that, I don't think I've seen anyone use an Xe iGPU with my code yet. |
It's about the same as running it on my cpu, 5-6 t/s with mistral q4_0. But I do notice a reduction in my fans spinning up. |
That was a 2x improvement. Maybe you could create a fork as a playground where you'll test optimizations for Intel Arc? |
q4_0 isn't as optimized with my code yet, I would recommend k-quants instead. They should be faster.
That was the result of me buying an A770 and checking why it was actually that slow before. Turns out Intel doesn't like my larger matmul shaders, so I disabled them, and that was the improvement. The next optimization won't be as easy. |
On Intel i7-1165G7
|
* Fix Vulkan on Intel ARC Optimize matmul for Intel ARC Add Vulkan dequant test * Add Vulkan debug and validate flags to Make and CMakeLists.txt * Enable asynchronous transfers in Vulkan backend * Fix flake8 * Disable Vulkan async backend functions for now * Also add Vulkan run tests command to Makefile and CMakeLists.txt
* Fix Vulkan on Intel ARC Optimize matmul for Intel ARC Add Vulkan dequant test * Add Vulkan debug and validate flags to Make and CMakeLists.txt * Enable asynchronous transfers in Vulkan backend * Fix flake8 * Disable Vulkan async backend functions for now * Also add Vulkan run tests command to Makefile and CMakeLists.txt
I managed to find the cause of the incoherence of Vulkan on Intel ARC GPUs with k-quants. For some reason, individually the matrix multiplication shader and the dequant shaders work, but in the actual compute graph the 16-bit float read from the struct was always 0. I think that's a driver bug, but I found a workaround: not using float16 values in the dequant shaders.
I managed to increase prompt processing speed on these GPUs a little as well, but it's not that good yet. I'll have to look into how to optimize for Intel GPUs in the future.
I also added the Vulkan debugging preprocessor flags to make and cmake.
@slaren I tried implementing async copies for the Vulkan backend, but due to the nature of how recording Command Buffers works in Vulkan, it dispatches the work only when synchronize is called. This was commented out in
llama.cpp
previously, since CUDA (and maybe others) don't need it? Is it an issue to always call synchronize there?