Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan Intel Fixes, Optimizations and Debugging Flags #5301

Merged
merged 6 commits into from
Feb 3, 2024

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Feb 3, 2024

I managed to find the cause of the incoherence of Vulkan on Intel ARC GPUs with k-quants. For some reason, individually the matrix multiplication shader and the dequant shaders work, but in the actual compute graph the 16-bit float read from the struct was always 0. I think that's a driver bug, but I found a workaround: not using float16 values in the dequant shaders.
I managed to increase prompt processing speed on these GPUs a little as well, but it's not that good yet. I'll have to look into how to optimize for Intel GPUs in the future.

I also added the Vulkan debugging preprocessor flags to make and cmake.

@slaren I tried implementing async copies for the Vulkan backend, but due to the nature of how recording Command Buffers works in Vulkan, it dispatches the work only when synchronize is called. This was commented out in llama.cpp previously, since CUDA (and maybe others) don't need it? Is it an issue to always call synchronize there?

@0cc4m 0cc4m requested a review from slaren February 3, 2024 11:01
@slaren
Copy link
Member

slaren commented Feb 3, 2024

The async functions are supposed to work in this way:

  • Each ggml_backend instance has an independent asynchronous queue
  • The async functions (including graph_compute) add work at the end of this queue and return immediately
  • The synchronize function waits until all the work in the queue is completed
  • The current copy_async function is flawed - it is not clear how it is supposed to behave when copying data between different backends, since only the dst backend is provided. This will change in pipeline parallelism demo #4918
  • When calling ggml_backend_tensor_copy_async to copy a tensor from the CPU backend (which is the only type of async copy you should see without multi GPU support), set_async will be called instead anyway. (this is still flawed since it assumes that the CPU backend is always synchronous, which is not always going to be the case, but it doesn't matter right now).
  • It is never going to be ok to call synchronize immediately after copy_async/set_async/get_async - that would just be equivalent to calling the synchronous version instead. The copy should be queued at the end of the backend queue and executed as soon as possible - there is no point in making an async copy that only starts when synchronizing

In short:

  • Ignore copy_async for now, it is flawed and will change in the near future
  • Implement set_async and get_async instead to enqueue a copy at the end of the backend's instance queue and start as soon as possible

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 3, 2024

* It is never going to be ok to call `synchronize` immediately after `copy_async/set_async/get_async` - that would just be equivalent to calling the synchronous version instead. The copy should be queued at the end of the backend queue and executed as soon as possible - there is no point in making an async copy that only starts when synchronizing

The thing is, unlike CUDA, Vulkan is meant for predictable, repeated work (like generating frames). I have significant overhead for submitting work to a queue, which is why I need to batch as much as possible into (best case) a single command buffer before I submit it. I am relatively certain that if I open a command buffer, record a copy, close the command buffer and submit it to the queue each time the functions get called I'd lose any advantage this would bring to the overhead.

If it's not feasible to have a synchronize call at the end of a batch of async set/get calls then I can't implement those functions. I'll revert the changes in that case.

@slaren
Copy link
Member

slaren commented Feb 3, 2024

Ignoring the asynchronous functions is probably a good idea for now - the interface is not completely well defined yet (ie. the problem with copy_async), and the performance improvement is not likely to be very significant even in the best case. In the future, implementing the async functions will enable pipeline parallelism, which can significant improve batch performance with multi GPU - but that's not merged yet, and the first step to support this in vulkan would be to implement multi GPU support.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 3, 2024

Ignoring the asynchronous functions is probably a good idea for now - the interface is not completely well defined yet (ie. the problem with copy_async), and the performance improvement is not likely to be very significant even in the best case. In the future, implementing the async functions will enable pipeline parallelism, which can significant improve batch performance with multi GPU - but that's not merged yet, and the first step to support this in vulkan would be to implement multi GPU support.

Alright, thank you. I'll revert it for now and keep the functions around for future use once the interface is defined.

@0cc4m 0cc4m changed the title Vulkan Intel Fixes, Optimizations, Debugging Flags and Async Copies Vulkan Intel Fixes, Optimizations and Debugging Flags Feb 3, 2024
@0cc4m 0cc4m merged commit e920ed3 into master Feb 3, 2024
56 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-fixes branch February 3, 2024 17:15
@Titaniumtown
Copy link

Thanks so much @0cc4m! Inference works on my Tntel Xe igpu on my framework laptop now. Great work!

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 3, 2024

Thanks so much @0cc4m! Inference works on my Tntel Xe igpu on my framework laptop now. Great work!

That's great to hear! Does it provide a speedup over CPU-only? I'd be interested in seeing benchmarks of that, I don't think I've seen anyone use an Xe iGPU with my code yet.

@Titaniumtown
Copy link

Thanks so much @0cc4m! Inference works on my Tntel Xe igpu on my framework laptop now. Great work!

That's great to hear! Does it provide a speedup over CPU-only? I'd be interested in seeing benchmarks of that, I don't think I've seen anyone use an Xe iGPU with my code yet.

It's about the same as running it on my cpu, 5-6 t/s with mistral q4_0. But I do notice a reduction in my fans spinning up.

@characharm
Copy link

characharm commented Feb 3, 2024

@0cc4m

I managed to increase prompt processing speed on these GPUs a little as well,

That was a 2x improvement. Maybe you could create a fork as a playground where you'll test optimizations for Intel Arc?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 3, 2024

Thanks so much @0cc4m! Inference works on my Tntel Xe igpu on my framework laptop now. Great work!

That's great to hear! Does it provide a speedup over CPU-only? I'd be interested in seeing benchmarks of that, I don't think I've seen anyone use an Xe iGPU with my code yet.

It's about the same as running it on my cpu, 5-6 t/s with mistral q4_0. But I do notice a reduction in my fans spinning up.

q4_0 isn't as optimized with my code yet, I would recommend k-quants instead. They should be faster.

@0cc4m

I managed to increase prompt processing speed on these GPUs a little as well,

That was a 2x improvement. Maybe you could create a fork as a playground where you'll test optimizations for Intel Arc?

That was the result of me buying an A770 and checking why it was actually that slow before. Turns out Intel doesn't like my larger matmul shaders, so I disabled them, and that was the improvement. The next optimization won't be as easy.

@easyfab
Copy link

easyfab commented Feb 4, 2024

On Intel i7-1165G7

model size params backend ngl test t/s
llama 7B Q4_K - Medium 4.07 GiB 7.24 B OpenCL 99 pp 512 25.20 ± 0.36
llama 7B Q4_K - Medium 4.07 GiB 7.24 B OpenCL 99 tg 128 3.55 ± 0.05
llama 7B Q4_K - Medium 4.07 GiB 7.24 B Vulkan 99 pp 512 31.92 ± 0.21
llama 7B Q4_K - Medium 4.07 GiB 7.24 B Vulkan 99 tg 128 5.54 ± 0.01
llama 7B Q4_K - Medium 4.07 GiB 7.24 B CPU AVX2 4 pp 512 10.08 ± 0.19
llama 7B Q4_K - Medium 4.07 GiB 7.24 B CPU AVX2 4 tg 128 6.53 ± 0.06
llama 7B Q4_K - Medium 4.07 GiB 7.24 B CPU AVX512 4 pp 512 10.51 ± 0.60
llama 7B Q4_K - Medium 4.07 GiB 7.24 B CPU AVX512 4 tg 128 6.70 ± 0.04

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* Fix Vulkan on Intel ARC

Optimize matmul for Intel ARC

Add Vulkan dequant test

* Add Vulkan debug and validate flags to Make and CMakeLists.txt

* Enable asynchronous transfers in Vulkan backend

* Fix flake8

* Disable Vulkan async backend functions for now

* Also add Vulkan run tests command to Makefile and CMakeLists.txt
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Fix Vulkan on Intel ARC

Optimize matmul for Intel ARC

Add Vulkan dequant test

* Add Vulkan debug and validate flags to Make and CMakeLists.txt

* Enable asynchronous transfers in Vulkan backend

* Fix flake8

* Disable Vulkan async backend functions for now

* Also add Vulkan run tests command to Makefile and CMakeLists.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants