-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User] GGML_ASSERT failure for opencl #3002
Comments
Seeing issue as well. Only affects l2. @rhogenson Are you getting this error with Llama 1 models too? |
Maybe related to #2736?
|
Thanks @KerfuffleV2 FWIW I’m on Nvidia so not sure if ROCM is an option or if this could be related to 2736? |
Oh, apologies. I'm actually a moron, but the answer is still almost the same except now I'd say you should use CUDA! For all the same reasons. |
No, you do not know moron. I have dreams about ggml being in python so I can understand it… I may resort to CUDA, but I like that OpenCL is open source. What do you think the chances are of us figuring out the root of the issue? Also, kind of off-topic but do you happen to have anything to say about this issue (leejet/stable-diffusion.cpp#48), also about OpenCL? |
I'm afraid the chance of me figuring it out are pretty low. As soon as the ROCM stuff got merged, I switched to that. I can't speak for anyone else, but it seems like the use case for it is pretty limited right now: People on AMD hardware will use ROCM, people on Nvidia will mostly use CUDA, people on Mac will use Metal. The only users that would still use OpenCL now are... people with Intel's GPUs? (And you, of course. :) Also it seems like even when stuff is working the performance is considerably lower than options like CUDA or ROCM.
Sorry, I don't really know anything about that. |
Thank you for taking time out of your day to talk 🥇 |
I haven't tested this with a llama 1 model, so I can't say whether this issue affects llama 1 as well. However I did some limited testing with other quantization levels, and it seems like it only breaks with the new k-quant quantizations. I didn't do much testing though, so that might not be true. Unfortunately NixOS doesn't support ROCM version 5.6 yet. We're still on 5.4. It should work on ROCM, as this assertion failure happens inside openCL-only code. Digging into the code, it looks like this is the result of a TODO that was never completed. The GGML openCL driver doesn't support matrix multiplication when the 2nd or 3rd dimensions are not the same. I was able to work around this issue by skipping the openCL matrix multiplication in this case, and falling back to the slow CPU matrix multiplication. It's pretty slow though, so a real fix would need to implement this missing feature for the openCL version. |
Can you point to a line number where it’s “doing this, but it should be doing that”…or is it not that straightforward? |
It's simple-ish, but it's been a while since I took linear algebra 😅 Take a look at line 11234 of ggml.c, it has this TODO:
And also look at the code for OpenBLAS just below. It had a similar assert removed in commit ef3f333, which added this broadcasting for the openblas implementation. |
I implemented what's needed for Llama 2 inference on CLBlast. However, I'm having issues with current llama.cpp code, and I cannot test it. So I ported my changes to an earlier version of the codebase that uses GGJT format. Diff here. Note that it is on top of other changes that need review on their own. Full set of changes here. I will port it back to current when the problems are resolved. Update: My issues with inference may be related to RoPE scaling. I'm now downloading a model with normal 4096 context size. |
I got Llama 2 70B working and tested my implementation. Output at 0 temperature is slightly different between CPU and CLBlast builds, but both are okay. Prompt processing is significantly faster with CLBlast, even without I'm still having issues with Code Llama. My implementation is here: https://github.com/shibe2/llama.cpp/tree/cl-gqa
|
You might want to make a draft pull request rather than just commenting here. There's a better chance of people seeing it and giving feedback, it's kind of hidden away in the issue here currently. |
@KerfuffleV2 I intend to merge it in parts. First, fix bugs, then add functionality. I'm not confident that I fixed the bugs properly, and I want to put emphasis on that part. Currently pending resolution of #3307. |
Code Llama 34B is fixed after applying #3315. CLBlast GQA works. Speed increase over CPU build is similar to Llama 2 70B case. |
I am falling down the rabbit hole. Figuratively speaking, whenever I take a thing apart to fix bugs, there are more bugs inside! I don't think that we can count on people who wrote the code to fix all these bugs, so I'll probably have to dive even deeper into it and try to fix it myself. This may take a long time, and I've changed my plan. I will try to merge broadcasting implementation soon and take time to work on related bugs. I removed bugfixes from my working branch. All you need to apply is the commit "CLBlast: Add broadcast support for matrix multiplication". I tested it with few Llama 2 models. Output of a model that doesn't use GQA is unaffected by the change. Larger, GQA models work. As I mentioned earlier, CPU and CLBlast builds produce different outputs when inputs and parameters are the same. Furthermore, Other bugs don't affect Llama 2 inference but prevent me from testing my implementation thoroughly. |
Just to make sure, you’re using a fixed seed number? |
@Happenedtostumblein No, I use 0 temperature. Outputs from both CPU and CLBlast builds are consistent, but different from each other. |
Pretty sure this is normal and also the case for CUDA, ROCM and Metal. |
My hunch is that in this case this is not normal, and that quality may degrade because of it. When I'll have time, I want to actually measure quality of outputs and compare effects of quantization to effects of computation errors. |
You could try comparing the perplexity numbers on a small model. Running on the CPU is really slow but you could compare with the results in the main README. |
Update on discrepancies between CPU and CLBlast back-ends. It may be that CLBlast produces more precise results. Further investigation is needed. If something worth fixing will turn up, a separate issue report should be filed. For now, I will work on other issues, since it doesn't seem to be a problem in CLBlast back-end. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I am trying to use the WizardCoder Python 34B model with llama.cpp, using the opencl drivers. I am using a radeon 7900 XT.
Current Behavior
Llama.cpp crashes with the following output:
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
I'm building from the latest flake.nix file.
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
This failure is reproducible with the WizardCoder Python 34B model, q4_K_M quantization downloaded from here: https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GGUF
I used the following command to build and run:
It should also work to compile normally with openCL support. I can see that the GGML_ASSERT failure is happening inside openCL-specific code, so it's important that this is the openCL version.
Failure Logs
Environment info:
The text was updated successfully, but these errors were encountered: