Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cuda.so is 90mb with -arch=all #7156

Closed
jart opened this issue May 9, 2024 · 20 comments
Closed

ggml-cuda.so is 90mb with -arch=all #7156

jart opened this issue May 9, 2024 · 20 comments

Comments

@jart
Copy link
Contributor

jart commented May 9, 2024

The CUDA implementation for GGML_OP_FLASH_ATTN_EXT is as large as the rest of ggml-cuda combined.

master jart@luna:~/llama.cpp$ ls -Shal ggml-cuda/*.o
-rw-rw-r-- 1 jart jart 3.9M May  8 19:37 ggml-cuda/fattn.o
-rw-rw-r-- 1 jart jart 2.4M May  8 19:37 ggml-cuda/mmvq.o
-rw-rw-r-- 1 jart jart 335K May  8 19:37 ggml-cuda/mmq.o
-rw-rw-r-- 1 jart jart 316K May  8 19:37 ggml-cuda/binbcast.o
-rw-rw-r-- 1 jart jart 265K May  8 19:37 ggml-cuda/convert.o
-rw-rw-r-- 1 jart jart 197K May  8 19:37 ggml-cuda/softmax.o
-rw-rw-r-- 1 jart jart 193K May  8 19:37 ggml-cuda/cpy.o
-rw-rw-r-- 1 jart jart 143K May  8 19:37 ggml-cuda/dmmv.o
-rw-rw-r-- 1 jart jart 121K May  8 19:37 ggml-cuda/getrows.o
-rw-rw-r-- 1 jart jart 113K May  8 19:37 ggml-cuda/norm.o
-rw-rw-r-- 1 jart jart 109K May  8 19:37 ggml-cuda/rope.o
-rw-rw-r-- 1 jart jart  96K May  8 19:37 ggml-cuda/unary.o
-rw-rw-r-- 1 jart jart  85K May  8 19:37 ggml-cuda/im2col.o
-rw-rw-r-- 1 jart jart  72K May  8 19:37 ggml-cuda/argsort.o
-rw-rw-r-- 1 jart jart  71K May  8 19:37 ggml-cuda/pool2d.o
-rw-rw-r-- 1 jart jart  67K May  8 19:37 ggml-cuda/acc.o
-rw-rw-r-- 1 jart jart  67K May  8 19:37 ggml-cuda/alibi.o
-rw-rw-r-- 1 jart jart  66K May  8 19:37 ggml-cuda/upscale.o
-rw-rw-r-- 1 jart jart  66K May  8 19:37 ggml-cuda/concat.o
-rw-rw-r-- 1 jart jart  66K May  8 19:37 ggml-cuda/tsembd.o
-rw-rw-r-- 1 jart jart  66K May  8 19:37 ggml-cuda/diagmask.o
-rw-rw-r-- 1 jart jart  66K May  8 19:37 ggml-cuda/sumrows.o
-rw-rw-r-- 1 jart jart  66K May  8 19:37 ggml-cuda/pad.o
-rw-rw-r-- 1 jart jart  65K May  8 19:37 ggml-cuda/arange.o
-rw-rw-r-- 1 jart jart  65K May  8 19:37 ggml-cuda/clamp.o
-rw-rw-r-- 1 jart jart  65K May  8 19:37 ggml-cuda/scale.o
-rw-rw-r-- 1 jart jart  65K May  8 19:37 ggml-cuda/quantize.o

The heaviest function is this one:

// D == head size, VKQ_stride == num VKQ rows calculated in parallel:
template<int D, int ncols, int nwarps, int VKQ_stride, int parallel_blocks, typename KQ_acc_t>
__launch_bounds__(nwarps*WARP_SIZE, 1)
static __global__ void flash_attn_ext_f16(
const char * __restrict__ Q,

GPU support for flash attention can't be included in llamafile because we deal with a 4GB limit on Windows.

For comparison, in December ggml-cuda.so built with -march=all was 12mb. By February is was 16mb. By April it was 50mb. Now it's 90gb. On my project we've already started using gzip to compress the ggml-cuda dso. We've also reduced our support vector to -arch=all-major. Everything that can be done is being done on our end, since I'd like to be able to include everything if possible. However this op seems like it could benefit from a refactoring.

@JohannesGaessler
Copy link
Collaborator

By February is was 16mb. By April it was 50mb. Now it's 90gb.

I assume this is simply a typo and you mean 90mb.

We've also reduced our support vector to -arch=all-major.

When we (slaren, a user, and me) tested compiling for different CUDA architectures (months ago) we found that there is no measurable performance difference between compiling for the minimum needed CUDA architecture and the actual CUDA arch of the GPU. So assuming you use CUDA 12 it should be sufficient to compile for CUDA architectures 5.2, 6.0, 6.1, and 7.0 with the current code.

Everything that can be done is being done on our end, since I'd like to be able to include everything if possible. However this op seems like it could benefit from a refactoring.

The reasons why the FlashAttention kernel needs so much space are because

  1. it is simply a large kernel that does many things at once in order to avoid having to write the KQ matrix to VRAM and
  2. because it makes heavy use of templating to compile many different versions of this kernel so that the compiler can optimize the code for specific combinations of head sizes and batch sizes.

The first reason is I think fundamentally unavoidable. The second reason can only be avoided if you accept a significant performance penalty or reduce the number of cases covered by the kernel. Intuitively I would think that a kernel without templating would be at least 2x slower. What you could do on your end to reduce the file size without performance penalties is to compile the kernel only for the head size of the model with which you package the code; all other head sizes are never going to be used anyways. In a similar manner you could compile only those kernels for quantized data that match the quantization format of the packaged model to reduce the file size for mmq.so, mmvq.so, and dmmv.so.

@jart
Copy link
Contributor Author

jart commented May 25, 2024

How would I do that? If I use -arch=compute_52 then it works on Linux with CUDA SDK installed although it has a long startup delay on the first run (assumedly because the SDK is JIT compiling something). However it doesn't work for me when I run it on Windows. Maybe that's because it needs access to the installed CUDA SDK to JIT the code?

I tried the thing you suggested, namely:

nvcc \
  -gencode arch=compute_52,code=sm_52 \
  -gencode arch=compute_60,code=sm_60 \
  -gencode arch=compute_61,code=sm_61 \
  -gencode arch=compute_70,code=sm_70 \
  ...

But that doesn't work on my Linux machine with a compute_89 card.

I don't know how to make a portable CUDA DSO binary that's smaller than what -arch=all-major produces. Even that's too big now to meet llamafile's requirements, namely, I want to have the ability to distribute 7b models to Windows users that squeak under the 4gb file size limit. I can't do that if the flash attention feature needs >100mb of space in the binary.

Could we add a define called GGML_MINIMIZE_CODE_SIZE that causes it to be removed from the build?

@JohannesGaessler
Copy link
Collaborator

I have not figured out how to do this manually using make but with cmake you can specify an exact list of compute capabilities.

Could we add a define called GGML_MINIMIZE_CODE_SIZE that causes it to be removed from the build?

It would absolutely be possible to reduce the code size by just compiling less code but as I said, this will inevitably come at the cost of either worse performance or less features.

@JohannesGaessler
Copy link
Collaborator

Also FYI: the addition of KV cache quantization support for FlashAttention like with #7527 will lead to similar increases in file size.

@jart
Copy link
Contributor Author

jart commented May 25, 2024

llamafile does its best to faithfully support, without bias, all the features the llama.cpp developer community implements. But I can't do that if including a llama.cpp feature means I need to drop support for some of my users. Users come first.

So if I send a PR adding the #ifdef statements, will you merge it?

Please keep in mind, I'm only talking about cutting features from llamafile's prebuilt release binaries. llamafile releases include a --recompile flag feature, that lets users who have the CUDA or ROCm SDKs installed to automatically build a native ggml-cuda dso which goes faster and includes flash attention.

The ifdef should hold us off until someone thinks of a tinier way to do it. If I can build a LISP IDE that boots as an OS in 436 bytes then it should be possible to implement a native cuda build of flash attention in fewer than 6,688,888 bytes.

@JohannesGaessler
Copy link
Collaborator

So if I send a PR adding the #ifdef statements, will you merge it?

My personal opinion is that optimizing for code size is not worthwhile for llama.cpp itself. So I personally would only be willing to merge such a PR if it comes with a pledge that you will put in the effort to maintain the feature. Creating a fork could also be an option. But I am not the ultimate authority on what does and doesn't get merged when ggml is involved so input from @ggerganov and @slaren would be appreciated.

The ifdef should hold us off until someone thinks of a tinier way to do it. If I can build a LISP IDE that boots as an OS in 436 bytes then it should be possible to implement a native cuda build of flash attention in fewer than 6,688,888 bytes.

Sorry to say, but since the CUDA compiler is proprietary and closed-source I think your options will be rather limited. As I said before, (apart from reducing the CUDA architectures) I think the best option would be to only compile kernels for those quantization formats and head sizes that are actually going to be used by the packaged model.

More generally speaking, I think that this is just a cursed problem in the first place. The general trend for llama.cpp/ggml has been to add more GPU code that covers more cases and improves performance at the cost of an increase in compile time/binary size. I see that trend continuing so my intuition is that trying to reduce the binary size with tricks will only provide a temporary solution.

@github-actions github-actions bot added the stale label Jun 25, 2024
@henk717
Copy link

henk717 commented Jun 26, 2024

The rate its growing does bring concerns so it may be good to establish a target size for the binary.
For example lets take a more extreme case and say this binary grows to be 1.5GB, combined with the cuda12 libs this would now push a self extractor or setup file over 2GB which means github releases can no longer be used in downstream projects and they have to resort to other storage. And of course as jart pointed out, for projects like llamafile these size issues are a bigger concern due to the 4GB executable limit on windows.

Similarly we already had one case where the compilation of our Koboldcpp huggingface space took so long that we could no longer compile it. For now these issues are manageable and solvable, in this case by us tweaking the space to compile for less GPU targets but this does leave all the spaces that got cloned potentially broken. On my own side I am planning to switch the HF spaces to use our precompiled binaries to avoid that issue completely in the future, but its not unthinkable it will eventually effect more free CI's if the compile time keeps growing massively.

So to ensure that bigger issues are avoided I do think its good to keep a maximum size of the binaries / compile time in mind as a counter balance to the performance chase.

@JohannesGaessler
Copy link
Collaborator

It's unfortunate that you're having issues but if you want or need precompiled binaries that are both small and portable then CUDA is simply the wrong tool.

I am not willing to invest the time and effort needed to implement and maintain changes that trade performance for smaller file sizes/lower compile time but I will happily provide instructions for how to do so if someone else is interested.

If file size/compile time becomes an issue for llama.cpp the way to go would in my opinion be to make support for architectures older than Pascal and q4_1, q5_0, q5_1 (and probably also very small quantization formats) opt-in via an option like LLAMA_CUDA_FULL. FlashAttention support for unusual head sizes could maybe also be made opt-in. I think making the default compile have suboptimal performance would be a bad solution.

@LostRuins
Copy link
Collaborator

I understand that you prefer to focus solely on speed/performance, but I'd just like to point out that in just the last 3 days, the addition of int8 tensor core support alone has increased llama-b3209-bin-win-cuda-cu11.7.1-x64.zip from 168mb compressed to llama-b3233-bin-win-cuda-cu11.7.1-x64.zip at 310mb compressed, which is an increase greater than the total growth in binary size over the entirety of last year.

Granted that this decision is entirely within your prerogative, and indeed very few others could do the work you do, just thought I could maybe bring a bit of perspective from an end-user point of view.

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2024

The next release should be significantly smaller. There was an issue that caused ggml to be linked multiple times.

@henk717
Copy link

henk717 commented Jun 26, 2024

Thats good to hear, the main worry stems from how fast it was growing and that eventually turning into sizes that github release pages / windows binaries can't handle. From what we saw it was an exponential growth the past releases, so hearing thats due to a linking error is a relief.

If the majority of size increasing changes has been done and it will eventually stabilize then I don't see an issue. But if were to get in the gigabyte sizes I can see issues starting to happen hence me wanting to warn that its good to keep those restrictions in mind.

The earlier mentioned Huggingface compile time issue I just solved for good on our end (Cloned spaces will have to manually apply my new build file), it uses our precompiled binaries now so compile time there will no longer be a factor. I saw Github CI allows up to 6 hours so that should have plenty of margin left.

@JohannesGaessler
Copy link
Collaborator

The dimensions for template parameters for FlashAttention are compute capability, head size, batch size, K type, and V type. The dimensions for MMQ are compute capability, weight type, batch size, and whether or not you need an out-of-bounds check in one of the dimensions. More than anything the increase in compile time and binary size came from adding more dimensions since this drastically increases the number of possible combinations. I am currently not aware of any extra dimensions that would be worth adding. I plan to at some point add code for training which you could in a sense consider an extra dimension but I think it would be completely fine to make the code for that opt-in since most people would not be using.

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2024

I triggered a new release in https://github.com/ggerganov/llama.cpp/actions/runs/9683534847, and the compressed file size is now 82MB.

@jart
Copy link
Contributor Author

jart commented Jun 26, 2024

I must admit I smiled when I read the koboldcpp v1.68 release notes the other day.

fattn

@giladgd
Copy link
Contributor

giladgd commented Jun 27, 2024

The uncompressed file size of a CUDA build is now 199MB (ggml.dll on Windows, on build b3240), whereas the build was just 80MB on a release from a few days ago (like b3166 for example).
I think there might be another issue other than duplicate linking that might cause this (given that this is not the case anymore), since the compressed .zip file is 80MB, so I suspect there's something else duplicate in the build now.

@slaren
Copy link
Collaborator

slaren commented Jun 27, 2024

I don't think that's unexpected, there is one more architecture to compile for, and a lot more kernels. The compression ratio is similar in both releases

@github-actions github-actions bot removed the stale label Jun 27, 2024
@LostRuins
Copy link
Collaborator

@jart just wondering, have you found any solution to this issue for llamafile yet? There's #8495 and #8542 incoming, so we can expect another dozen MB increase within the next few days to weeks. 90mb now looks amazingly pleasant in comparison.

@oldgithubman
Copy link

By February is was 16mb. By April it was 50mb. Now it's 90gb.

I assume this is simply a typo and you mean 90mb.

We've also reduced our support vector to -arch=all-major.

When we (slaren, a user, and me) tested compiling for different CUDA architectures (months ago) we found that there is no measurable performance difference between compiling for the minimum needed CUDA architecture and the actual CUDA arch of the GPU. So assuming you use CUDA 12 it should be sufficient to compile for CUDA architectures 5.2, 6.0, 6.1, and 7.0 with the current code.

Everything that can be done is being done on our end, since I'd like to be able to include everything if possible. However this op seems like it could benefit from a refactoring.

The reasons why the FlashAttention kernel needs so much space are because

  1. it is simply a large kernel that does many things at once in order to avoid having to write the KQ matrix to VRAM and
  2. because it makes heavy use of templating to compile many different versions of this kernel so that the compiler can optimize the code for specific combinations of head sizes and batch sizes.

The first reason is I think fundamentally unavoidable. The second reason can only be avoided if you accept a significant performance penalty or reduce the number of cases covered by the kernel. Intuitively I would think that a kernel without templating would be at least 2x slower. What you could do on your end to reduce the file size without performance penalties is to compile the kernel only for the head size of the model with which you package the code; all other head sizes are never going to be used anyways. In a similar manner you could compile only those kernels for quantized data that match the quantization format of the packaged model to reduce the file size for mmq.so, mmvq.so, and dmmv.so.

Would there be any way to do just-in-time compilation of needed kernels?

@jart
Copy link
Contributor Author

jart commented Jul 17, 2024

@LostRuins Right now I'm focusing on developing new features rather than copying code from upstream.

@github-actions github-actions bot added the stale label Aug 17, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants