-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU-accelerated token generation (new quantization format) #1412
GPU-accelerated token generation (new quantization format) #1412
Conversation
I implemented q4_1 by making a |
Hello, thank you very much for your contribution to speed up llama.cpp As I understand the acceleration now only works with nvidia gpu? I would like to ask if there is any way to keep the option of not copying weights to vram, for those of us who use igpu? New igpus are quite powerful, for example the amd 780M is more powerful than the 1650. There are also rumors that Intel Meteor Lake will be even more powerful. |
I unfortunately don't have a machine on which I could test an iGPU implementation. I was thinking I would implement the kernels for discrete GPUs first and then someone else could make a follow-up PR that supports iGPUs. |
@SlyEcho somehow managed to run the previous version that I implemented on AMD, but I don't know how he did it and I don't have an AMD GPU to test my implementation on. |
You can just start from the master branch here: git pull origin master
# apply the ROCm PR:
curl -L https://github.com/ggerganov/llama.cpp/pull/1087.diff | git apply -
# apply this PR:
curl -L https://github.com/ggerganov/llama.cpp/pull/1412.diff | git apply - Then you can follow the instructions in #1087. If there is a mess and you want to get back to the start then just EDIT: I put some Docker instructions here: rocm.Dockerfile, with containers there is no need to install any drivers or SDKs or special compilers. |
I've tried looking into how memory management is done in llama.cpp but it's more tricky than I thought. At least I think that when using However, when using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
llama.cpp
Outdated
@@ -879,6 +883,7 @@ static void llama_model_load_internal( | |||
ggml_type memory_type, | |||
bool use_mmap, | |||
bool use_mlock, | |||
int gpu_layers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: unused parameter 'gpu_layers' [clang-diagnostic-unused-parameter]
int gpu_layers,
^
1e735d2
to
f0af475
Compare
The layers that are loaded to the GPU could be unmapped from the memory map, although, I don't think they are lying on page boundaries, but since the loading is sequential it may be possible to at least partially unmap them? |
My current plan: I'll prioritize cleaning up the code and finishing the kernels for the various quantization types. After that as far as I'm concerned I think this can be merged (unless someone wants to suggest changes to the software design?). I'll probably be done tomorrow morning. I'll make follow-up PRs to fix things like the inefficient use of memory. |
ggml-cuda.cu
Outdated
@@ -173,6 +206,41 @@ static __global__ void dequantize_block_q8_0(const void * vx, float * y) { | |||
} | |||
} | |||
|
|||
template <int block_size, int qk, dequantize_kernel_t dequantize_kernel> static __global__ void dequantize_mul_mat_vec(const void * vx, const float * y, float * dst, const int ncols) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the template<...>
be on a separate line? It's getting very long otherwise.
Thank you for these contributions In past discussions, I've been strongly opposed to changing I wish to hear @slaren's opinion as he had worked on most of the CUDA backend and he had ideas for offloading the tensors to the GPU. I'm open to alternative implementations that keep the |
ggml.c
Outdated
if (ggml_cuda_can_mul_mat(src0, src1, dst) || | ||
ggml_cuda_can_dequantize_mul_mat_vec(src0, src1, dst) || | ||
src0->backend == GGML_BACKEND_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to put this here instead of leaving all the checks in ggml_cuda_can_mul_mat
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially thinking it would be better this way: to have a separate method for dequantize mul mat since you can then also use that method for the logic inside ggml_cuda_mul_mat_q_f32
. I was thinking I would just push it and see what other people have to say since it turned out kind of convoluted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it is better to keep the complexity in ggml_cuda_can_mul_mat
than to spread it all around, but I am not entirely sure if you have any reason to split the check in two parts.
I think this is good if it can bring a performance improvement right now, the design will need changes if we want to go further and keep as much of the computation in the GPU as possible, but we can iterate over it. |
if this is done correctly, I can now run 65B size q4_0 models o.O 32gig ram + 8gig vram |
This PR is pretty neat and makes my generation speed It really makes my GPU scream. Literally! I should make a video... |
The GPU "screaming" is a sign of inefficiency though. Currently the GPU has to briefly stop between layers because parts of the model like the norms are still CPU only, so the GPU is rapidly turning on and off. The change in current induces Lorentz forces that cause vibrations, i.e. sounds. Theoretically the vibrations could damage the GPU via resonance catastrophe if you hit the eigenfrequencies of the components but I don't think that this is a realistic problem. |
991ef9e
to
bb0993e
Compare
I also get coil whine with my rtx 2070 mobile 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
case GGML_TYPE_Q8_0: | ||
return dequantize_mul_mat_vec_q8_0_cuda; | ||
case GGML_TYPE_F16: | ||
return dequantize_mul_mat_vec_q8_0_cuda; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be convert_mul_mat_vec_f16_cuda
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so - just pushed a fix, although this branch is never reached
My 3090 has arrived in the mail. I'll still need to run more detailed performance tests but it's looking good so far: 43.7 t/s for q4_0 7b and 26.6 t/s for q4_0 13b. Thanks everyone for the feedback and help. |
Has anyone got this to build in windows? cant seem to get it to build. link to discussion: #1431 |
@kevkid Check the CI. or grab it from the latest release. https://github.com/ggerganov/llama.cpp/releases/tag/master-bda4d7c |
@Green-Sky I got it to build thank you for replying. it compiled, but is there a good way to ensure its using gpu? gpu usage never goes above 10%.
|
I believe you need to add |
@moejay is right, loading parts(or all) of the model to the gpu is opt-in. |
Thank you. |
@JohannesGaessler or @ggerganov please add a hint to the
which CLI flag to set or lookup in the help. People have been complaining. :) |
On the other hand, you can always just use |
sure, but we already had 2 ask for help here in the last 8h 😆 |
there should also be some documentation on how many layers is best. in my initial testing there seems to be a number of layers offloaded that's worse than none (in my case with 65B it's anything under ~15 layers, with my 13B tests it seems to be higher) before it starts getting faster. |
I've got an AMD GPU, and compiling with CLBlast flags enables me to use the GPU for prompt ingestion. Would CLBlast also enable me to use the GPU for generation? |
Almost always: as many as you can fit into VRAM. I don't think that there is a feasibly way to determine specifics for all possible hardware configurations.
In theory yes, but no one has implemented OpenCL token generation. |
What GPUs & OSes is this currently compatible with? |
Nvidia windows and linux. |
Build instructions (Linux):
For building on Windows, read the llama.cpp README.
This PR is a replacement for #1375 that works with the new quantization format. Read that PR for more information. People with no git experience are already using that version so I'm making a new branch and PR to avoid unnecessary confusion, especially with the breaking quantization changes.
The goals of this PR:
Not the goals of this PR:
In other news, the quantization changes make a big difference for my kernel implementation. I can now get 14.53 t/s with a GTX 1070 for 7b which is 16% faster than with the old quantization method. I think the reason is memory coalescing when reading the vector values.