Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL: Fix element-wise multiplication #3656

Merged
merged 1 commit into from
Oct 18, 2023
Merged

Conversation

shibe2
Copy link
Contributor

@shibe2 shibe2 commented Oct 17, 2023

Tested in isolation. I also confirmed that Llama inference works, but I think, it doesn't actually use ggml_cl_mul.

@ggerganov
Copy link
Owner

but I think, it doesn't actually use ggml_cl_mul.

Hm, that would be surprising - it should go through it for LLaMA:

llama.cpp/ggml.c

Lines 10038 to 10046 in cb33f43

#ifdef GGML_USE_CLBLAST
if (src1->backend == GGML_BACKEND_GPU) {
if (ith == 0) {
ggml_cl_mul(src0, src1, dst);
}
return;
}
#endif

If it does not get called, there might be some issue with the offloading when using OpenCL

@shibe2
Copy link
Contributor Author

shibe2 commented Oct 18, 2023

Hm, that would be surprising - it should go through it for LLaMA:
If it does not get called, there might be some issue with the offloading when using OpenCL

It would be called when src1 is already in VRAM, which implies that it must contain model parameters. Which Llama parameters are in f32 format and need to be src of MUL?

@ggerganov
Copy link
Owner

The attention norm should be F32 since it is a 1D tensor, so this op should be offloaded I think:

llama.cpp/llama.cpp

Lines 3249 to 3253 in cb33f43

// cur = cur*attn_norm(broadcasted)
cur = ggml_mul(ctx0, cur, model.layers[il].attn_norm);
offload_func(cur);
ggml_set_name(cur, "attention_norm_0");

@shibe2
Copy link
Contributor Author

shibe2 commented Oct 18, 2023

Yeah, I was wrong. ggml_cl_mul can indeed be used in Llama inference. It seems like the particular case when src1 is 1D was not affected by the bug that is fixed here.

I first confirmed that ggml_cl_mul is being called. Then I ran the exact same command line before and after this change. In both cases the output looks okay.

This change fixes multiplication of higher-dimensional tensors. Different combinations of sizes and permutations were tested in isolation. In all cases, CPU and GPU now produce exactly the same results.

@ggerganov ggerganov merged commit 1117d06 into ggerganov:master Oct 18, 2023
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants