Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLBlast: Support broadcasting for matrix multiplication and GQA #3402

Merged
merged 1 commit into from
Oct 2, 2023

Conversation

shibe2
Copy link
Contributor

@shibe2 shibe2 commented Sep 29, 2023

Broadcast src0 into src1 across dimensions 2 and 3 when needed, in line with other back-ends.

Fixes #3002
Fixes #3263

Successfully tested with Llama 2 models, both GQA and non-GQA. I've also done some limited testing of matrix multiplication in isolation. However, a number of existing bugs in ggml-opencl.cpp interferes with my testing:

  • 3D and 4D tensors are uploaded to device memory incorrectly CLBlast: byte offset / element count confusion #3307
  • Even if uploaded properly, data for 3D and 4D tensors is addressed incorrectly during computations
  • Special code path for vector dot product (i.e. when matrices have 1 row) produces incorrect results
  • Results of matrix multiplication by CPU and CLBlast back-ends differ significantly when src0 is quantized

Despite that, I believe that this change does not break inference with previously supported models, and newly enabled functionality is on par with existing.

Broadcast src0 into src1 across dimensions 2 and 3 when needed.
This is required for models that use GQA.
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look into the OpenCL backend.
My impression is that lately it has been getting less and less attention, so it is possible that there are multiple issues as you describe.

Would be great to fix these and this looks like a good step

@shibe2
Copy link
Contributor Author

shibe2 commented Oct 2, 2023

I've done some more testing, and my confidence in correctness of results increased.

If anyone wants to review code style and commit message, please do. If there are no suggestions on that front, then this can be merged.

@0cc4m
Copy link
Collaborator

0cc4m commented Oct 2, 2023

I can take a look later today.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you for fixing this issue.

@0cc4m 0cc4m merged commit 665018c into ggerganov:master Oct 2, 2023
32 checks passed
@cmp-nct
Copy link
Contributor

cmp-nct commented Oct 4, 2023

It's quite an awesome little addition.

I noticed the lack of broadcasting support in ggml-cuda, I tried to add it quite similar to this patch (just in ggml-cuda it's in the op() function as it has no dedicated multiplication one) but the resulting performance was a fraction of the CPU speed (multi GPU test).
It synchronizes with each outter loop, so with multi queries that's tens of thousands of synchronizations per multiplication.
It looks better manageable in clblast

(I should add that cuda/cublas attempt was ~2 months ago, maybe the underlying code has change a lot but it didn't look so at first glance)

@shibe2
Copy link
Contributor Author

shibe2 commented Oct 4, 2023

I made it so that it loops over dimensions 2 and 3 of src1 and computes corresponding coordinates in src0. But it is better to have outer loops over src0 and inner loops over src1.

joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 5, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp: (24 commits)
  convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299)
  readme : add project status link
  ggml : fix build after ggerganov#3329
  llm : add Refact model (ggerganov#3329)
  sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468)
  finetune : readme fix typo (ggerganov#3465)
  ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453)
  main : consistent prefix/suffix coloring (ggerganov#3425)
  llama : fix session saving/loading (ggerganov#3400)
  llama : expose model's rope_freq_scale in the API (ggerganov#3418)
  metal : alibi for arbitrary number of heads (ggerganov#3426)
  cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273)
  Work on the BPE tokenizer (ggerganov#3252)
  convert : fix vocab size when not defined in hparams (ggerganov#3421)
  cmake : increase minimum version for add_link_options (ggerganov#3444)
  CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402)
  gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408)
  gguf : general usability improvements (ggerganov#3409)
  cmake : make CUDA flags more similar to the Makefile (ggerganov#3420)
  finetune : fix ggerganov#3404 (ggerganov#3437)
  ...
yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
)

Broadcast src0 into src1 across dimensions 2 and 3 when needed.
This is required for models that use GQA.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants