CLBlast: Support broadcasting for matrix multiplication and GQA #3402

shibe2 · 2023-09-29T18:06:12Z

Broadcast src0 into src1 across dimensions 2 and 3 when needed, in line with other back-ends.

Successfully tested with Llama 2 models, both GQA and non-GQA. I've also done some limited testing of matrix multiplication in isolation. However, a number of existing bugs in ggml-opencl.cpp interferes with my testing:

3D and 4D tensors are uploaded to device memory incorrectly CLBlast: byte offset / element count confusion #3307
Even if uploaded properly, data for 3D and 4D tensors is addressed incorrectly during computations
Special code path for vector dot product (i.e. when matrices have 1 row) produces incorrect results
Results of matrix multiplication by CPU and CLBlast back-ends differ significantly when src0 is quantized

Despite that, I believe that this change does not break inference with previously supported models, and newly enabled functionality is on par with existing.

Broadcast src0 into src1 across dimensions 2 and 3 when needed. This is required for models that use GQA.

ggerganov

Thanks for taking a look into the OpenCL backend.
My impression is that lately it has been getting less and less attention, so it is possible that there are multiple issues as you describe.

Would be great to fix these and this looks like a good step

shibe2 · 2023-10-02T10:05:45Z

I've done some more testing, and my confidence in correctness of results increased.

If anyone wants to review code style and commit message, please do. If there are no suggestions on that front, then this can be merged.

0cc4m · 2023-10-02T11:44:05Z

I can take a look later today.

0cc4m

Looks good. Thank you for fixing this issue.

cmp-nct · 2023-10-04T19:06:57Z

It's quite an awesome little addition.

I noticed the lack of broadcasting support in ggml-cuda, I tried to add it quite similar to this patch (just in ggml-cuda it's in the op() function as it has no dedicated multiplication one) but the resulting performance was a fraction of the CPU speed (multi GPU test).
It synchronizes with each outter loop, so with multi queries that's tens of thousands of synchronizations per multiplication.
It looks better manageable in clblast

(I should add that cuda/cublas attempt was ~2 months ago, maybe the underlying code has change a lot but it didn't look so at first glance)

shibe2 · 2023-10-04T19:21:26Z

I made it so that it loops over dimensions 2 and 3 of src1 and computes corresponding coordinates in src0. But it is better to have outer loops over src0 and inner loops over src1.

…example * 'master' of github.com:ggerganov/llama.cpp: (24 commits) convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299) readme : add project status link ggml : fix build after ggerganov#3329 llm : add Refact model (ggerganov#3329) sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468) finetune : readme fix typo (ggerganov#3465) ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453) main : consistent prefix/suffix coloring (ggerganov#3425) llama : fix session saving/loading (ggerganov#3400) llama : expose model's rope_freq_scale in the API (ggerganov#3418) metal : alibi for arbitrary number of heads (ggerganov#3426) cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273) Work on the BPE tokenizer (ggerganov#3252) convert : fix vocab size when not defined in hparams (ggerganov#3421) cmake : increase minimum version for add_link_options (ggerganov#3444) CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402) gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408) gguf : general usability improvements (ggerganov#3409) cmake : make CUDA flags more similar to the Makefile (ggerganov#3420) finetune : fix ggerganov#3404 (ggerganov#3437) ...

) Broadcast src0 into src1 across dimensions 2 and 3 when needed. This is required for models that use GQA.

CLBlast: Add broadcast support for matrix multiplication

6ed3104

Broadcast src0 into src1 across dimensions 2 and 3 when needed. This is required for models that use GQA.

ggerganov approved these changes Sep 30, 2023

View reviewed changes

ggerganov requested a review from 0cc4m September 30, 2023 20:21

ggerganov mentioned this pull request Oct 2, 2023

finetune crashes on assert_shape_2d for Mistral based models #3404

Closed

0cc4m approved these changes Oct 2, 2023

View reviewed changes

0cc4m merged commit 665018c into ggerganov:master Oct 2, 2023
32 checks passed

shibe2 mentioned this pull request Oct 5, 2023

CLBlast: Fix handling of on-device tensor data #3447

Merged

yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023

CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402

3168806

) Broadcast src0 into src1 across dimensions 2 and 3 when needed. This is required for models that use GQA.

shibe2 mentioned this pull request Oct 8, 2023

CLBlast: Fix matrix-vector multiplication #3544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLBlast: Support broadcasting for matrix multiplication and GQA #3402

CLBlast: Support broadcasting for matrix multiplication and GQA #3402

shibe2 commented Sep 29, 2023

ggerganov left a comment

shibe2 commented Oct 2, 2023 •

edited

Loading

0cc4m commented Oct 2, 2023

0cc4m left a comment

cmp-nct commented Oct 4, 2023 •

edited

Loading

shibe2 commented Oct 4, 2023

CLBlast: Support broadcasting for matrix multiplication and GQA #3402

CLBlast: Support broadcasting for matrix multiplication and GQA #3402

Conversation

shibe2 commented Sep 29, 2023

ggerganov left a comment

Choose a reason for hiding this comment

shibe2 commented Oct 2, 2023 • edited Loading

0cc4m commented Oct 2, 2023

0cc4m left a comment

Choose a reason for hiding this comment

cmp-nct commented Oct 4, 2023 • edited Loading

shibe2 commented Oct 4, 2023

shibe2 commented Oct 2, 2023 •

edited

Loading

cmp-nct commented Oct 4, 2023 •

edited

Loading