Current status of Intel Arc GPUs for llama.cpp #12570

ky438 · 2025-03-25T12:39:35Z

ky438
Mar 25, 2025

What is the current status of support for Intel Arc GPUs such as B580?

In testing, B580's OpenGL performance is quite good, llama.cpp text-generation performance via SYCL is extremely bad.

Does anyone understand exactly why Intel's GPUs perform so terribly on llama.cpp? I am guessing multiple defects in the software stack from llama's SYCL backend and continuing down?

0cc4m · 2025-03-25T13:08:36Z

0cc4m
Mar 25, 2025
Collaborator

#12135 (comment) I found a way to improve Vulkan prompt processing performance significantly for Intel Arc, that should help a little. More software improvements are possible, but it's not easy to figure out how to optimize for Intel. tg should already be decent.

6 replies

0cc4m Mar 25, 2025
Collaborator

You complained about text generation (tg) performance. As far as I know, text generation speed is decent with Intel on Vulkan. Prompt processing speed is currently not good, but the comment I linked shows improvements for that, soon to be merged hopefully.

Have you tried Vulkan?

0cc4m Mar 25, 2025
Collaborator

Ignore (fl)ops numbers, they are only theoretical and not overly helpful most of the time. tg speeds are limited by memory bandwidth, how quickly the model can be read from memory. Prompt processing is limited by flops/cache/algorithm implementation details. GPUs are many-dimensional devices and no single number will give you anywhere close to a full picture about performance.

ky438 Mar 25, 2025
Author

Yes, I've tried vulkan. Performance was 1/4 that of SYCL, but I'm not sure if it was "really" using Vulkan or just the host CPU, as I received a warning message about the Intel B580 GPU being unsupported. I've reverted to SYCL.

Re: ignore ops numbers, OK, let's continue to take Intel Arc B580 as an example. Assuming the entire model is streamed from memory to generate each token, the B580 has 456GB/s of memory bandwidth. On a 3.56GB model, I should expect close to 456/3.56 = 128tokens/sec, right? But I see 41.7.

I'm sure you will retort "ignore bandwidth numbers, they are only theoretical and not overly helpful most of the time" and I would again reply that that is an odd comment, as if my stating B580 has 456GB/s of bandwidth is somehow detached from the physical reality of the hardware.

But this is really the essence of my question: there would be no problem if we were seeing performance around 90% of the theoretical performance of the hardware (flops, memory, whatever). We could conclude that the implementation was running efficiently.

On other GPUs (e.g. from NVIDIA) that is basically what we observe.

But on Intel GPUs, we see numbers like 32%. That is what I mean by "extremely bad" performance, and is what I'm trying to understand.

NeoZhangJianyu Mar 26, 2025
Collaborator

If you want to see the best performance on Intel GPU, please try OpenVINO.

0cc4m Mar 26, 2025
Collaborator

Yes, I've tried vulkan. Performance was 1/4 that of SYCL, but I'm not sure if it was "really" using Vulkan or just the host CPU, as I received a warning message about the Intel B580 GPU being unsupported. I've reverted to SYCL.

That probably means you didn't actually use it, yeah. On B580 it still needs a very new kernel and a new mesa version, I think. I can only give you information about Vulkan, since that is what I'm developing. If you figure out the driver situation, you should get better performance.

Re: ignore ops numbers, OK, let's continue to take Intel Arc B580 as an example. Assuming the entire model is streamed from memory to generate each token, the B580 has 456GB/s of memory bandwidth. On a 3.56GB model, I should expect close to 456/3.56 = 128tokens/sec, right? But I see 41.7.

I'm sure you will retort "ignore bandwidth numbers, they are only theoretical and not overly helpful most of the time" and I would again reply that that is an odd comment, as if my stating B580 has 456GB/s of bandwidth is somehow detached from the physical reality of the hardware.

No, I won't retort with that. I brought it up because you can actually expect something like 90% of that value or better from an efficient implementation, at least in tg numbers. But this also requires hardware and driver support that actually enables you to write code to fully utilize it. This is what has been spotty with Intel in my experience on Vulkan. For example optimizations that improved performance on AMD and Nvidia have sometimes reduced performance on Intel, and some quants have unexplicably low performance on Intel, while others perform alright.

But this is really the essence of my question: there would be no problem if we were seeing performance around 90% of the theoretical performance of the hardware (flops, memory, whatever). We could conclude that the implementation was running efficiently.

On other GPUs (e.g. from NVIDIA) that is basically what we observe.

But on Intel GPUs, we see numbers like 32%. That is what I mean by "extremely bad" performance, and is what I'm trying to understand.

Yes, and the Vulkan backend also sees this problem with Intel. We are moving to adding DP4A support currently, which has improved prompt processing performance significantly on Intel, and once we implement it for matrix vector multiplication, hopefully it also improves text generation performance to a point that makes more sense.

Currently it seems to be similar to SYCL performance, according to the number you quoted. You can look at some Vulkan performance numbers (including some Intel GPUs) here: #10879

Rbiessy · 2025-03-25T13:15:18Z

Rbiessy
Mar 25, 2025
Collaborator

Hello @ky438,

We are aware of the optimization issues with the text generation.
For your information Codeplay is also working on some optimizations with the SYCL backend. We are focusing more single iGPU setups but I expect some of the work will improve performance across Intel devices. The main tasks we are currently working on and planning to work on are:

to optimize the mul_mat_vec_q kernels with better memory accesses and potentially with using the matrix engine as mentioned above. We also plan to rework how the quantization data is laid out to optimize the loads, similarly to how it was done in [SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035.
to optimize the host side of the application and reduce the time spent submitting kernels. This could be done with improvements to SYCL-Graphs or SYCL runtime improvements. It will become more important for iGPUs.

3 replies

ky438 Mar 25, 2025
Author

Hi @Rbiessy ,

I'm concerned that, as commented in #12035 "This solution is not the better solution on Intel GPU. There is still huge potential of Intel GPU. Need more study work in the feature." that such an approach is doomed to failure, as it is fundamentally flawed in some way, and I would like to understand how.

Is part of the issue here that Intel's SYCL software stack is so bad that, far from being a useful as a "high level tensor math compiler", it's a barrier to getting any kind of sane operation out of Intel's GPUs?

When you say "potentially using the matrix engine" - why potentially? Wouldn't you expect performance to be astonishingly poor on Intel's current GPUs if they were unused?

When you say "SYCL backend", are you referring to the code in src/ggml-sycl, the "intel graphics compiler" as "SYCL backend", or both?

Finally, I noticed you added "support" for the DP4A instruction in src/ggml-sycl - isn't it much more important to try and target the DPAS / DPASW instructions https://github.com/intel/intel-graphics-compiler/blob/master/documentation/visa/instructions/DPAS.md ? Either directly or via the SYCL joint_matrix extension?

Sorry for all the questions, but I'm just in shock as to how low the performance of llama.cpp on current Intel GPUs is, and want to clearly understand the steps required to get reasonable performance (> 90% of peak) on the current Intel XE2 GPU architecture.

NeoZhangJianyu Mar 26, 2025
Collaborator

@ky438
We are private contributor to maintain the SYCL backend on Intel GPU.
You shouldn't complain so much, since we spend our spare time in past year to maintain it and make it work.
Yes, it works, instead of work perfect.

For BMG, we don't promise to optimize it in time of the marketing.

I suggest you pushing Intel to contribute to this project on Intel GPU.

Rbiessy Mar 26, 2025
Collaborator

@ky438,

I'm concerned that, as commented in #12035 "This solution is not the better solution on Intel GPU. There is still huge potential of Intel GPU. Need more study work in the feature." that such an approach is doomed to failure, as it is fundamentally flawed in some way, and I would like to understand how.

The key point of this PR is to reorder the quantization format so that the quantized data and metadata are separated which will allow us to load the quantized data more efficiently. This is also something the ggml-cuda backend does and it should be beneficial to Intel GPUs too. As far as I understand the comment you are quoting implies that the way this is done in ggml-sycl is not optimal currently and that's something we plan to improve.

Is part of the issue here that Intel's SYCL software stack is so bad that, far from being a useful as a "high level tensor math compiler", it's a barrier to getting any kind of sane operation out of Intel's GPUs?

No, there's really nothing hinting at that currently but if there were we would report that to Intel and see what can be done. I think a lot can be improved with enough effort on ggml-sycl itself.

When you say "potentially using the matrix engine" - why potentially? Wouldn't you expect performance to be astonishingly poor on Intel's current GPUs if they were unused?

I say potentially because currently the kernel is memory bound in the configurations we have tried. If we're still not able to improve that for some reason using HMX won't help.

When you say "SYCL backend", are you referring to the code in src/ggml-sycl, the "intel graphics compiler" as "SYCL backend", or both?

In this case I am referring to ggml-sycl and the whole SYCL stack for Intel devices meaning DPC++ compiler and DPC++ runtime, IGC and Level Zero.

Finally, I noticed you added "support" for the DP4A instruction in src/ggml-sycl - isn't it much more important to try and target the DPAS / DPASW instructions https://github.com/intel/intel-graphics-compiler/blob/master/documentation/visa/instructions/DPAS.md ? Either directly or via the SYCL joint_matrix extension?

My dp4a patch was a simple way to improve performance on Nvidia devices.
I agree using DPAS should be a good improvement assuming we can ensure the kernel is fed with enough data which is often easier said than done. We don't think the joint_matrix extension will be a good fit due to the way the data is laid out, we're planning to use DPAS instructions more "directly".

I hope this will alleviate your concerns.

ky438 · 2025-03-26T03:14:33Z

ky438
Mar 26, 2025
Author

The contributors to the other backends could say exactly the same thing, but their results are vastly different, which is probably why they do not. Rather than tell me what I should or shouldn't do, why not try and identify the actual technical issues that are causing this poor performance?

…

On Tue, Mar 25, 2025 at 07:41:53PM -0700, Neo Zhang Jianyu wrote: @ky438 We are private contributor to maintain the SYCL backend on Intel GPU. You shouldn't complain so much, since we spend our spare time in past year to maintain it and make it work. Yes, it works, instead of work perfect. For BMG, we don't promise to optimize it in time of the marketing. I suggest you pushing Intel to contribute to this project on Intel GPU. -- Reply to this email directly or view it on GitHub: #12570 (reply in thread) You are receiving this because you were mentioned. Message ID: ***@***.***>

0 replies

ky438 · 2025-03-26T03:14:46Z

ky438
Mar 26, 2025
Author

Thanks, that is helpful advice.

…

On Tue, Mar 25, 2025 at 07:44:36PM -0700, Neo Zhang Jianyu wrote: If you want to see the best performance on Intel GPU, please try OpenVINO. -- Reply to this email directly or view it on GitHub: #12570 (reply in thread) You are receiving this because you were mentioned. Message ID: ***@***.***>

0 replies

qnixsynapse · 2025-03-26T06:14:37Z

qnixsynapse
Mar 26, 2025
Collaborator

I guess missing flash attention and missing mmq kernels.

I intend to improve it a bit (I'm not from Intel).

We just integrated SYCL CI/CD tests on this project which were missing.

I hope things will improve from this point on.

OpenGL is different thing altogether.

0 replies

ky438 · 2025-03-26T23:29:51Z

ky438
Mar 26, 2025
Author

Hi Romain, Many thanks for this incredibly helpful message, that all makes sense. This is a naive question, but do you have a sense as to why when running on a modern server-class CPU system, cache hit rates throughout the memory hierarchy are quite reasonable (as a specific datapoint, on Zen5 desktop systems I see L1D hit rate of 99%, L2 hit rate of 92%, for various 32B Q6 models) yet the kernel is memory bound on the configurations you have tried? This gives me hope that you'll eventually be able to get rather high performance out of Intel GPUs with DPAS but I'm not familiar enough with the data layout issues or Intel GPU memory hierarchy. Regards, Kumi

…

On Wed, Mar 26, 2025 at 07:18:28AM -0700, Romain Biessy wrote: @ky438, > I'm concerned that, as commented in #12035 "This solution is not the better solution on Intel GPU. There is still huge potential of Intel GPU. Need more study work in the feature." that such an approach is doomed to failure, as it is fundamentally flawed in some way, and I would like to understand how. The key point of this PR is to reorder the quantization format so that the quantized data and metadata are separated which will allow us to load the quantized data more efficiently. This is also something the ggml-cuda backend does and it should be beneficial to Intel GPUs too. As far as I understand the comment you are quoting implies that the way this is done in ggml-sycl is not optimal currently and that's something we plan to improve. > Is part of the issue here that Intel's SYCL software stack is so bad that, far from being a useful as a "high level tensor math compiler", it's a barrier to getting any kind of sane operation out of Intel's GPUs? No, there's really nothing hinting at that currently but if there were we would report that to Intel and see what can be done. I think a lot can be improved with enough effort on ggml-sycl itself. > When you say "potentially using the matrix engine" - why potentially? Wouldn't you expect performance to be astonishingly poor on Intel's current GPUs if they were unused? I say potentially because currently the kernel is memory bound in the configurations we have tried. If we're still not able to improve that for some reason using HMX won't help. > When you say "SYCL backend", are you referring to the code in src/ggml-sycl, the "intel graphics compiler" as "SYCL backend", or both? In this case I am referring to ggml-sycl and the whole SYCL stack for Intel devices meaning DPC++ compiler and DPC++ runtime, IGC and Level Zero. > Finally, I noticed you added "support" for the DP4A instruction in src/ggml-sycl - isn't it _much_ more important to try and target the DPAS / DPASW instructions https://github.com/intel/intel-graphics-compiler/blob/master/documentation/visa/instructions/DPAS.md ? Either directly or via the SYCL joint_matrix extension? My dp4a patch was a simple way to improve performance on Nvidia devices. I agree using DPAS should be a good improvement assuming we can ensure the kernel is fed with enough data which is often easier said than done. We don't think the joint_matrix extension will be a good fit due to the way the data is laid out, we're planning to use DPAS instructions more "directly". I hope this will alleviate your concerns. -- Reply to this email directly or view it on GitHub: #12570 (reply in thread) You are receiving this because you were mentioned. Message ID: ***@***.***>

1 reply

Rbiessy Mar 27, 2025
Collaborator

I think you are talking about cache hit rates on the CPU, right? When I say the kernel is memory bound I mean that the limitation comes from the kernel that is not moving data from the global memory of the device to the registers (of the device) in the most efficient way. This is not something that can be measured by cache hit rates on the CPU. VTune can give you more details on this.

Also note that our approach is to try and optimize a different variant of mul_mat_vec_q kernel than what is selected for Intel devices currently. We want to try and optimize mul_mat_vec_q which is the one selected by the CUDA backend but relies on shared memory. The equivalent of that for Intel GPU is to use prefetch. This kernel is the one using dp4a and we noticed that switching from a software implementation of dp4a to a hardware one made no impact hence we suspected that this kernel is memory bound.

ky438 · 2025-03-28T04:21:44Z

ky438
Mar 28, 2025
Author

Hi Romain, Yes, I was talking about CPU cache hit rates, but for the case of CPU execution. I'm still struggling to understand how these could be memory bound on an Intel GPU; this isn't sparse LU factorization etc, this is dense matrix multiplications of dense vectors, right? So a relatively "compute intense" rather than "bandwidth intense" problem, that should be amenable to "cache blocking" and similar strategies? Would back-of-the-envelope calculations be helpful here? I wonder if we take a specific model and Intel Arc B580 GPU, we can determine the performance we would expect to see for: - Infinite I/O, actual B580 arithmetic - Infinite arithmetic, actual B580 bandwidth and compare these numbers to each other, and the observed performance, which I suspect will be much lower than either, which is why I am guessing that right now, the Intel discrete GPUs are not "truly" memory or compute bound (despite your comment about DP4A not changing observed performance). Sorry, I realise this is just a very long way of saying "I wonder what the realistically achievable performance on Intel Arc GPUs might be, and how far away are we from that right now?" Thanks again, Kumi

…

On Thu, Mar 27, 2025 at 04:19:33AM -0700, Romain Biessy wrote: I think you are talking about cache hit rates on the CPU, right? When I say the kernel is memory bound I mean that the limitation comes from the kernel that is not moving data from the global memory of the device to the registers (of the device) in the most efficient way. This is not something that can be measured by cache hit rates on the CPU. VTune can give you more details on this. Also note that our approach is to try and optimize a different variant of mul_mat_vec_q kernel than what is selected for Intel devices currently. We want to try and optimize [mul_mat_vec_q](https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/mmvq.cpp#L6) which is the one selected by the CUDA backend but relies on shared memory. The equivalent of that for Intel GPU is to use prefetch. This kernel is the one using dp4a and we noticed that switching from a software implementation of dp4a to a hardware one made no impact hence we suspected that this kernel is memory bound. -- Reply to this email directly or view it on GitHub: #12570 (reply in thread) You are receiving this because you were mentioned. Message ID: ***@***.***>

4 replies

Rbiessy Apr 2, 2025
Collaborator

I wonder what the realistically achievable performance on Intel Arc GPUs might be, and how far away are we from that right now?

The simple answer is that the SYCL backend should at least reach the results of the Vulkan backend for the text generation. Depending on the system and model used I've seen the text generation being about 30% to 60% faster with Vulkan currently (for instance here). The prompt processing phase performs much better and I don't expect meaningful improvements there.
I can't really give more details on how far away we are from that than on my previous comments though, just that we are working on that :)

0cc4m Apr 2, 2025
Collaborator

#10879 (comment) is more up to date. Both backends have improved in the meantime, it seems.

Alcpz Apr 2, 2025
Collaborator

The numbers from #10879 (comment) are an example of what kind of performance can be expected from SYCL, but it's unfortunately not representative, as the q4_0 quantization has been a specific target for an optimization. The rest of the quantizations are more in line to what Rbiessy commented above.

NeoZhangJianyu Apr 3, 2025
Collaborator

Vulkan backend has obviously improvement of performance recently.
I think SYCL can do same.

But we need time to optimize the code for Intel GPU.
It's hard work and takes more time.

Optimization of Q4_0 is the beginning.
The framework allows to easily optimize for other data type like Q5, Q8...

Current status of Intel Arc GPUs for llama.cpp #12570

Uh oh!

ky438 Mar 25, 2025

Replies: 7 comments · 14 replies

Uh oh!

Uh oh!

0cc4m Mar 25, 2025 Collaborator

Uh oh!

0cc4m Mar 25, 2025 Collaborator

Uh oh!

0cc4m Mar 25, 2025 Collaborator

Uh oh!

ky438 Mar 25, 2025 Author

Uh oh!

NeoZhangJianyu Mar 26, 2025 Collaborator

Uh oh!

0cc4m Mar 26, 2025 Collaborator

Uh oh!

Rbiessy Mar 25, 2025 Collaborator

Uh oh!

Uh oh!

ky438 Mar 25, 2025 Author

Uh oh!

NeoZhangJianyu Mar 26, 2025 Collaborator

Uh oh!

Rbiessy Mar 26, 2025 Collaborator

Uh oh!

ky438 Mar 26, 2025 Author

Uh oh!

ky438 Mar 26, 2025 Author

Uh oh!

Uh oh!

qnixsynapse Mar 26, 2025 Collaborator

Uh oh!

ky438 Mar 26, 2025 Author

Uh oh!

Rbiessy Mar 27, 2025 Collaborator

Uh oh!

ky438 Mar 28, 2025 Author

Uh oh!

Rbiessy Apr 2, 2025 Collaborator

Uh oh!

0cc4m Apr 2, 2025 Collaborator

Uh oh!

Uh oh!

Alcpz Apr 2, 2025 Collaborator

Uh oh!

NeoZhangJianyu Apr 3, 2025 Collaborator

ky438
Mar 25, 2025

Replies: 7 comments 14 replies

0cc4m
Mar 25, 2025
Collaborator

0cc4m Mar 25, 2025
Collaborator

0cc4m Mar 25, 2025
Collaborator

ky438 Mar 25, 2025
Author

NeoZhangJianyu Mar 26, 2025
Collaborator

0cc4m Mar 26, 2025
Collaborator

Rbiessy
Mar 25, 2025
Collaborator

ky438 Mar 25, 2025
Author

NeoZhangJianyu Mar 26, 2025
Collaborator

Rbiessy Mar 26, 2025
Collaborator

ky438
Mar 26, 2025
Author

ky438
Mar 26, 2025
Author

qnixsynapse
Mar 26, 2025
Collaborator

ky438
Mar 26, 2025
Author

Rbiessy Mar 27, 2025
Collaborator

ky438
Mar 28, 2025
Author

Rbiessy Apr 2, 2025
Collaborator

0cc4m Apr 2, 2025
Collaborator

Alcpz Apr 2, 2025
Collaborator

NeoZhangJianyu Apr 3, 2025
Collaborator