Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225

Merged
merged 5 commits into from
Apr 30, 2023

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Apr 29, 2023

I had or still have an issue with q5_0 that I can't figure out. On Nvidia trying to transfer the quantized weights to the device leads to a CL_OUT_OF_RESOURCES error. On AMD and on POCL it leads to a segfault. It seems to have a problem with 22 byte structs, while 20 or 24 bytes are alright. I am not sure why this is the case.

As a workaround I copy the weights into a new struct and do the FP16 to FP32 conversion on CPU. This seems to have little overhead and works, but it should not be needed. If anyone knows what's up here please let me know.

I also moved the .cl file into the opencl.c as requested.

Copy link
Collaborator

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ggerganov
Copy link
Owner

For Q5_0 you have to copy the qh bytes into a uint32_t instead of casting a pointer as we usually do:

llama.cpp/ggml-cuda.cu

Lines 134 to 137 in c3ca7a5

uint32_t qh;
memcpy(&qh, x[i].qh, sizeof(qh));

@0cc4m
Copy link
Collaborator Author

0cc4m commented Apr 30, 2023

@ggerganov I did, that happens during the transfer to the GPU.

https://github.com/0cc4m/koboldcpp/blob/369d903edabd5c0acd866337542cbb4150485940/ggml-opencl.c#L74-L79

It works just fine for Q5_1. Q5_0 has a different problem that I can only suspect is related to memory alignment.

ggml-opencl.c Outdated
cl_host_b = (cl_block_q5_0*) malloc(sizeof(cl_block_q5_0) * global / 32);
for (size_t i = 0; i < global / 32; i++) {
cl_host_b[i].d = ggml_fp16_to_fp32(b[i].d);
memcpy(&cl_host_b[i].qh, b[i].qh, sizeof(uint32_t) + QK5_0 / 2);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 2x memcpy - one for qh and one for qs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@LostRuins
Copy link
Collaborator

Btw might want to add a #include <stdlib.h> as it appears to prevent compilation on Termux without it.

@ggerganov ggerganov merged commit 76a8849 into ggerganov:master Apr 30, 2023
@0cc4m 0cc4m deleted the clblast-further-dequant-kernels branch April 30, 2023 20:18
@SlyEcho
Copy link
Collaborator

SlyEcho commented May 13, 2023

It was an alignment / padding issue. the half and uint were together 8 bytes, not 6 as expeted which meant the whole struct was too big and the memory was accessed out of bounds.

In #1422 I added an __attribute__((packed)) to the CL struct. It's not an extension or anything, it's part of the OpenCL language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants