Replies: 7 comments 14 replies
-
#12135 (comment) I found a way to improve Vulkan prompt processing performance significantly for Intel Arc, that should help a little. More software improvements are possible, but it's not easy to figure out how to optimize for Intel. tg should already be decent. |
Beta Was this translation helpful? Give feedback.
-
Hello @ky438, We are aware of the optimization issues with the text generation.
|
Beta Was this translation helpful? Give feedback.
-
The contributors to the other backends could say exactly the same
thing, but their results are vastly different, which is probably
why they do not.
Rather than tell me what I should or shouldn't do, why not try
and identify the actual technical issues that are causing this
poor performance?
…On Tue, Mar 25, 2025 at 07:41:53PM -0700, Neo Zhang Jianyu wrote:
@ky438
We are private contributor to maintain the SYCL backend on Intel GPU.
You shouldn't complain so much, since we spend our spare time in past year to maintain it and make it work.
Yes, it works, instead of work perfect.
For BMG, we don't promise to optimize it in time of the marketing.
I suggest you pushing Intel to contribute to this project on Intel GPU.
--
Reply to this email directly or view it on GitHub:
#12570 (reply in thread)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks, that is helpful advice.
…On Tue, Mar 25, 2025 at 07:44:36PM -0700, Neo Zhang Jianyu wrote:
If you want to see the best performance on Intel GPU, please try OpenVINO.
--
Reply to this email directly or view it on GitHub:
#12570 (reply in thread)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I guess missing flash attention and missing mmq kernels. I intend to improve it a bit (I'm not from Intel). We just integrated SYCL CI/CD tests on this project which were missing. I hope things will improve from this point on. OpenGL is different thing altogether. |
Beta Was this translation helpful? Give feedback.
-
Hi Romain,
Many thanks for this incredibly helpful message, that all makes sense.
This is a naive question, but do you have a sense as to why when running on
a modern server-class CPU system, cache hit rates throughout the memory
hierarchy are quite reasonable (as a specific datapoint, on Zen5 desktop
systems I see L1D hit rate of 99%, L2 hit rate of 92%, for various 32B Q6
models) yet the kernel is memory bound on the configurations you have tried?
This gives me hope that you'll eventually be able to get rather high
performance out of Intel GPUs with DPAS but I'm not familiar enough with
the data layout issues or Intel GPU memory hierarchy.
Regards,
Kumi
…On Wed, Mar 26, 2025 at 07:18:28AM -0700, Romain Biessy wrote:
@ky438,
> I'm concerned that, as commented in #12035 "This solution is not the better solution on Intel GPU. There is still huge potential of Intel GPU. Need more study work in the feature." that such an approach is doomed to failure, as it is fundamentally flawed in some way, and I would like to understand how.
The key point of this PR is to reorder the quantization format so that the quantized data and metadata are separated which will allow us to load the quantized data more efficiently. This is also something the ggml-cuda backend does and it should be beneficial to Intel GPUs too. As far as I understand the comment you are quoting implies that the way this is done in ggml-sycl is not optimal currently and that's something we plan to improve.
> Is part of the issue here that Intel's SYCL software stack is so bad that, far from being a useful as a "high level tensor math compiler", it's a barrier to getting any kind of sane operation out of Intel's GPUs?
No, there's really nothing hinting at that currently but if there were we would report that to Intel and see what can be done. I think a lot can be improved with enough effort on ggml-sycl itself.
> When you say "potentially using the matrix engine" - why potentially? Wouldn't you expect performance to be astonishingly poor on Intel's current GPUs if they were unused?
I say potentially because currently the kernel is memory bound in the configurations we have tried. If we're still not able to improve that for some reason using HMX won't help.
> When you say "SYCL backend", are you referring to the code in src/ggml-sycl, the "intel graphics compiler" as "SYCL backend", or both?
In this case I am referring to ggml-sycl and the whole SYCL stack for Intel devices meaning DPC++ compiler and DPC++ runtime, IGC and Level Zero.
> Finally, I noticed you added "support" for the DP4A instruction in src/ggml-sycl - isn't it _much_ more important to try and target the DPAS / DPASW instructions https://github.com/intel/intel-graphics-compiler/blob/master/documentation/visa/instructions/DPAS.md ? Either directly or via the SYCL joint_matrix extension?
My dp4a patch was a simple way to improve performance on Nvidia devices.
I agree using DPAS should be a good improvement assuming we can ensure the kernel is fed with enough data which is often easier said than done. We don't think the joint_matrix extension will be a good fit due to the way the data is laid out, we're planning to use DPAS instructions more "directly".
I hope this will alleviate your concerns.
--
Reply to this email directly or view it on GitHub:
#12570 (reply in thread)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi Romain,
Yes, I was talking about CPU cache hit rates, but for the case of CPU
execution.
I'm still struggling to understand how these could be memory bound on
an Intel GPU; this isn't sparse LU factorization etc, this is dense
matrix multiplications of dense vectors, right? So a relatively "compute
intense" rather than "bandwidth intense" problem, that should be amenable
to "cache blocking" and similar strategies?
Would back-of-the-envelope calculations be helpful here? I wonder if we
take a specific model and Intel Arc B580 GPU, we can determine the
performance we would expect to see for:
- Infinite I/O, actual B580 arithmetic
- Infinite arithmetic, actual B580 bandwidth
and compare these numbers to each other, and the observed performance,
which I suspect will be much lower than either, which is why I am guessing
that right now, the Intel discrete GPUs are not "truly" memory or compute
bound (despite your comment about DP4A not changing observed performance).
Sorry, I realise this is just a very long way of saying "I wonder what
the realistically achievable performance on Intel Arc GPUs might be, and
how far away are we from that right now?"
Thanks again,
Kumi
…On Thu, Mar 27, 2025 at 04:19:33AM -0700, Romain Biessy wrote:
I think you are talking about cache hit rates on the CPU, right? When I say the kernel is memory bound I mean that the limitation comes from the kernel that is not moving data from the global memory of the device to the registers (of the device) in the most efficient way. This is not something that can be measured by cache hit rates on the CPU. VTune can give you more details on this.
Also note that our approach is to try and optimize a different variant of mul_mat_vec_q kernel than what is selected for Intel devices currently. We want to try and optimize [mul_mat_vec_q](https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/mmvq.cpp#L6) which is the one selected by the CUDA backend but relies on shared memory. The equivalent of that for Intel GPU is to use prefetch. This kernel is the one using dp4a and we noticed that switching from a software implementation of dp4a to a hardware one made no impact hence we suspected that this kernel is memory bound.
--
Reply to this email directly or view it on GitHub:
#12570 (reply in thread)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
What is the current status of support for Intel Arc GPUs such as B580?
In testing, B580's OpenGL performance is quite good, llama.cpp text-generation performance via SYCL is extremely bad.
Does anyone understand exactly why Intel's GPUs perform so terribly on llama.cpp? I am guessing multiple defects in the software stack from llama's SYCL backend and continuing down?
Beta Was this translation helpful? Give feedback.
All reactions