Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuBLAS: refactor and optimize f16 mat mul performance #1259

Merged
merged 4 commits into from
May 1, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Apr 30, 2023

Moves all the cuBLAS specific code from ggml.c to ggml-cuda.cu. This also makes ggml-cuda.h much simpler, since fewer definitions have to exposed now.

Additionally, improves mat mul performance by using multiple stream where possible (when multiplying 3 or 4-dimensional tensors), and by choosing between doing f16 x f32 mat muls either as f16 x f16 or as f32 x f32, depending on what requires less data to be transferred to the GPU.

Overall, improves perplexity times with cuBLAS by ~15%.

🤖 Generated by Copilot at 4e54943

Summary

🚀🧹🛠️

This pull request improves the performance, compatibility, and readability of the GGML library and the llama model loader. It refactors the CUDA and BLAS code, simplifies the error checking and memory management, and exposes some useful functions and macros. The main files affected are ggml-cuda.h, ggml.c, ggml.h, llama-util.h, and llama.cpp.

ggml refactored
CUDA and BLAS streamlined
Winter of llama

Walkthrough

  • Refactored the code for using cuBLAS for matrix multiplication in GGML, by moving the CUDA-related functions and macros to ggml-cuda.h and calling them from ggml.c with conditional compilation (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
  • Exposed the functions for converting between half-precision and single-precision floating-point numbers as part of the GGML API, by adding their declarations to ggml.h and removing them from ggml.c (link, link)
  • Moved the macro for asserting conditions from ggml.c to ggml.h, to make it available for other source files that use the GGML library (link, link)
  • Improved the code style and quality in ggml.c, by removing unused variables, empty lines, and redundant conditional compilation (link, link, link, link, link, link)

From #1233:

  • Enhanced the llama_buffer and llama_ctx_buffer structs in llama-util.h, by adding default constructors and disabling copy and move constructors and assignment operators, to prevent memory leaks or errors (link, link, link)
  • Optimized the initialization of temporary buffers in the llama_model_loader struct in llama.cpp, by using the constructor of the std::vector instead of the resize method (link)

@slaren slaren marked this pull request as draft April 30, 2023 22:13
@slaren
Copy link
Collaborator Author

slaren commented Apr 30, 2023

Exposed the functions for converting between half-precision and single-precision floating-point numbers as part of the GGML API, by adding their declarations to ggml.h and removing them from ggml.c (link, link)

Specifically, this adds vector versions of ggml_fp16_to_fp32 and ggml_fp32_to_fp16. ggml_fp32_to_fp16_row is vectorized with F16C. This was necessary as GGML_FP32_TO_FP16 isn't visible from ggml-cuda.cu, and ggml_fp32_to_fp16 is too slow without inlining.

Moved the macro for asserting conditions from ggml.c to ggml.h, to make it available for other source files that use the GGML library (link, link)

GGML_ASSERT is now exposed in ggml.h, I did this to be able to use it from ggml-cuda.cu, but if this is not desirable I can remove it.

@slaren slaren marked this pull request as ready for review May 1, 2023 11:38
ggml.c Show resolved Hide resolved
ggml-cuda.cu Outdated
__half d; // delta
__half m; // min
half d; // delta
half m; // min
uint32_t qh; // 5-th bit of quants
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, should sync the CUDA block_q5_1 with the CPU one:

uint8_t qh[4]; // 5-th bit of quants

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure why this isn't the case already, did you have any problems with alignment or anything else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it in the same way as q5_0 and didn't notice any issues.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Q5_1 it works both ways.
For Q5_0, the uint32_t way does not work due to alignment issues, so we changed Q5_1 to uint8_t[4] for consistency

@slaren slaren merged commit 58b367c into ggerganov:master May 1, 2023
@slaren slaren deleted the cuda-mat-mul branch May 1, 2023 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants