Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-accelerated token generation (new quantization format) #1412

Merged
merged 9 commits into from
May 13, 2023

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented May 12, 2023

Build instructions (Linux):

git clone https://github.com/JohannesGaessler/llama.cpp llama.cpp-johannesgaessler
cd llama.cpp-johannesgaessler                               
git fetch
git switch dequantize-matmul-4
make LLAMA_CUBLAS=1

For building on Windows, read the llama.cpp README.

This PR is a replacement for #1375 that works with the new quantization format. Read that PR for more information. People with no git experience are already using that version so I'm making a new branch and PR to avoid unnecessary confusion, especially with the breaking quantization changes.

The goals of this PR:

  • Provide an implementation of GPU-accelerated matrix vector multiplication for all quantization types.
  • Provide an implementation that works on as broad of a range of hardware as possible.
  • Load weights directly into VRAM to reduce RAM usage.

Not the goals of this PR:

  • Squeeze out every last % of performance. It seems that GPU performance optimizations like varying block sizes strongly depend on the specific model of GPU so optimizing that seems like it will take a long time and require feedback from many people. I would like to do that at a later time.
  • Accelerating operations other than matrix vector multiplications and avoiding the copying of vectors between CPU and GPU.
  • Multi-GPU support. As of right now I don't have a machine that I could test this on.
  • iGPU support. Again, I don't have a machine to test this on.

In other news, the quantization changes make a big difference for my kernel implementation. I can now get 14.53 t/s with a GTX 1070 for 7b which is 16% faster than with the old quantization method. I think the reason is memory coalescing when reading the vector values.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 12, 2023

I implemented q4_1 by making a dequantize_mul_mat_vec kernel template that takes accepts another kernel to determine how the dequantization should be done. I'm aware that the README tells you not to use templates but I think that this is the most elegant solution. It allows you to write the matrix vector multiplication code only once; you only need to provide different kernels for dequantization. The dequantization kernels can then also be reused to dequantize matrices for cuBLAS prompt processing. @ggerganov please provide feedback regarding the software design.

@Folko-Ven
Copy link
Contributor

Hello, thank you very much for your contribution to speed up llama.cpp As I understand the acceleration now only works with nvidia gpu? I would like to ask if there is any way to keep the option of not copying weights to vram, for those of us who use igpu? New igpus are quite powerful, for example the amd 780M is more powerful than the 1650. There are also rumors that Intel Meteor Lake will be even more powerful.

@JohannesGaessler
Copy link
Collaborator Author

I unfortunately don't have a machine on which I could test an iGPU implementation. I was thinking I would implement the kernels for discrete GPUs first and then someone else could make a follow-up PR that supports iGPUs.

@JohannesGaessler
Copy link
Collaborator Author

@SlyEcho somehow managed to run the previous version that I implemented on AMD, but I don't know how he did it and I don't have an AMD GPU to test my implementation on.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 12, 2023

You can just start from the master branch here:

git pull origin master

# apply the ROCm PR:
curl -L https://github.com/ggerganov/llama.cpp/pull/1087.diff | git apply -

# apply this PR:
curl -L https://github.com/ggerganov/llama.cpp/pull/1412.diff | git apply -

Then you can follow the instructions in #1087.

If there is a mess and you want to get back to the start then just git reset --hard HEAD should put you back.

EDIT: I put some Docker instructions here: rocm.Dockerfile, with containers there is no need to install any drivers or SDKs or special compilers.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 12, 2023

I've tried looking into how memory management is done in llama.cpp but it's more tricky than I thought. At least I think that when using --mmap the current version is not too bad: after the initial loading the unused parts of the file should be automatically cleared from memory although the GPU-accelerated parts of the model will need to be loaded into memory twice if the amount of RAM is smaller than the model size.

However, when using --mlock the entire input file is locked into memory, including the parts that are offloaded to the GPU anyways. This particular problem can be fixed by not memlocking the first part of the file until the CPU weights start being loaded; the weights for e.g. normalization in the GPU layers are relatively small and will probably not be too relevant for performance. Overall I feel like the memory management in llama.cpp is relatively complicated; maybe this should be postponed to a later date? Can the people that wrote the memory management code perhaps give some insight?

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

llama.cpp Outdated
@@ -879,6 +883,7 @@ static void llama_model_load_internal(
ggml_type memory_type,
bool use_mmap,
bool use_mlock,
int gpu_layers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: unused parameter 'gpu_layers' [clang-diagnostic-unused-parameter]

        int gpu_layers,
            ^

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 12, 2023

The layers that are loaded to the GPU could be unmapped from the memory map, although, I don't think they are lying on page boundaries, but since the loading is sequential it may be possible to at least partially unmap them?

@JohannesGaessler
Copy link
Collaborator Author

My current plan: I'll prioritize cleaning up the code and finishing the kernels for the various quantization types. After that as far as I'm concerned I think this can be merged (unless someone wants to suggest changes to the software design?). I'll probably be done tomorrow morning. I'll make follow-up PRs to fix things like the inefficient use of memory.

ggml-cuda.cu Outdated
@@ -173,6 +206,41 @@ static __global__ void dequantize_block_q8_0(const void * vx, float * y) {
}
}

template <int block_size, int qk, dequantize_kernel_t dequantize_kernel> static __global__ void dequantize_mul_mat_vec(const void * vx, const float * y, float * dst, const int ncols) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the template<...> be on a separate line? It's getting very long otherwise.

@ggerganov
Copy link
Owner

@JohannesGaessler

Thank you for these contributions

In past discussions, I've been strongly opposed to changing ggml.h interface with GPU-specific elements, like the backend member in this PR. But given the huge inference speed-up, I am starting to change my opinion 😄
I'm OK to merge this.

I wish to hear @slaren's opinion as he had worked on most of the CUDA backend and he had ideas for offloading the tensors to the GPU. I'm open to alternative implementations that keep the ggml.h interface changes minimal and potentially can offer additional perplexity computation speed-up as well.

ggml.c Outdated
Comment on lines 7802 to 7804
if (ggml_cuda_can_mul_mat(src0, src1, dst) ||
ggml_cuda_can_dequantize_mul_mat_vec(src0, src1, dst) ||
src0->backend == GGML_BACKEND_CUDA) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to put this here instead of leaving all the checks in ggml_cuda_can_mul_mat?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially thinking it would be better this way: to have a separate method for dequantize mul mat since you can then also use that method for the logic inside ggml_cuda_mul_mat_q_f32. I was thinking I would just push it and see what other people have to say since it turned out kind of convoluted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is better to keep the complexity in ggml_cuda_can_mul_mat than to spread it all around, but I am not entirely sure if you have any reason to split the check in two parts.

@slaren
Copy link
Collaborator

slaren commented May 12, 2023

I think this is good if it can bring a performance improvement right now, the design will need changes if we want to go further and keep as much of the computation in the GPU as possible, but we can iterate over it.

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 12, 2023

Load weights directly into VRAM to reduce RAM usage.

if this is done correctly, I can now run 65B size q4_0 models o.O

32gig ram + 8gig vram

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 12, 2023

This PR is pretty neat and makes my generation speed double three times as fast (for 7B at least).

It really makes my GPU scream. Literally! I should make a video...

@JohannesGaessler
Copy link
Collaborator Author

The GPU "screaming" is a sign of inefficiency though. Currently the GPU has to briefly stop between layers because parts of the model like the norms are still CPU only, so the GPU is rapidly turning on and off. The change in current induces Lorentz forces that cause vibrations, i.e. sounds. Theoretically the vibrations could damage the GPU via resonance catastrophe if you hit the eigenfrequencies of the components but I don't think that this is a realistic problem.

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 13, 2023

I also get coil whine with my rtx 2070 mobile 😄
edit: so the smaller the quantization, the faster, the more coil whine. also with q4_0 i observe almost full powerdraw with only ~60% util

@ggerganov ggerganov merged commit 905d87b into ggerganov:master May 13, 2023
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

llama.cpp Show resolved Hide resolved
case GGML_TYPE_Q8_0:
return dequantize_mul_mat_vec_q8_0_cuda;
case GGML_TYPE_F16:
return dequantize_mul_mat_vec_q8_0_cuda;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be convert_mul_mat_vec_f16_cuda instead?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so - just pushed a fix, although this branch is never reached

@JohannesGaessler
Copy link
Collaborator Author

My 3090 has arrived in the mail. I'll still need to run more detailed performance tests but it's looking good so far: 43.7 t/s for q4_0 7b and 26.6 t/s for q4_0 13b. Thanks everyone for the feedback and help.

@kevkid
Copy link

kevkid commented May 13, 2023

Has anyone got this to build in windows? cant seem to get it to build. link to discussion: #1431

@Green-Sky
Copy link
Collaborator

@kevkid Check the CI. or grab it from the latest release. https://github.com/ggerganov/llama.cpp/releases/tag/master-bda4d7c

@kevkid
Copy link

kevkid commented May 13, 2023

@Green-Sky I got it to build thank you for replying. it compiled, but is there a good way to ensure its using gpu? gpu usage never goes above 10%.
Edit:
image
but I am getting:

llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB

@moejay
Copy link

moejay commented May 13, 2023

@Green-Sky I got it to build thank you for replying. it compiled, but is there a good way to ensure its using gpu? gpu usage never goes above 10%. Edit: image but I am getting:

llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB

I believe you need to add -ngl N, --n-gpu-layers argument

@Green-Sky
Copy link
Collaborator

@moejay is right, loading parts(or all) of the model to the gpu is opt-in.

@kevkid
Copy link

kevkid commented May 13, 2023

@Green-Sky I got it to build thank you for replying. it compiled, but is there a good way to ensure its using gpu? gpu usage never goes above 10%. Edit: image but I am getting:

llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB

I believe you need to add -ngl N, --n-gpu-layers argument

Thank you.

@Green-Sky
Copy link
Collaborator

@JohannesGaessler or @ggerganov please add a hint to the

llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB

which CLI flag to set or lookup in the help. People have been complaining. :)

@JohannesGaessler
Copy link
Collaborator Author

On the other hand, you can always just use --help and you will be enlightened.

@Green-Sky
Copy link
Collaborator

sure, but we already had 2 ask for help here in the last 8h 😆

@CRD716
Copy link
Contributor

CRD716 commented May 14, 2023

there should also be some documentation on how many layers is best. in my initial testing there seems to be a number of layers offloaded that's worse than none (in my case with 65B it's anything under ~15 layers, with my 13B tests it seems to be higher) before it starts getting faster.

@regstuff
Copy link

I've got an AMD GPU, and compiling with CLBlast flags enables me to use the GPU for prompt ingestion. Would CLBlast also enable me to use the GPU for generation?

@JohannesGaessler
Copy link
Collaborator Author

there should also be some documentation on how many layers is best.

Almost always: as many as you can fit into VRAM. I don't think that there is a feasibly way to determine specifics for all possible hardware configurations.

Would CLBlast also enable me to use the GPU for generation?

In theory yes, but no one has implemented OpenCL token generation.

@zakkor
Copy link
Contributor

zakkor commented May 14, 2023

What GPUs & OSes is this currently compatible with?

@Green-Sky
Copy link
Collaborator

What GPUs & OSes is this currently compatible with?

Nvidia windows and linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.