-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cuda.so is 90mb with -arch=all #7156
Comments
I assume this is simply a typo and you mean 90mb.
When we (slaren, a user, and me) tested compiling for different CUDA architectures (months ago) we found that there is no measurable performance difference between compiling for the minimum needed CUDA architecture and the actual CUDA arch of the GPU. So assuming you use CUDA 12 it should be sufficient to compile for CUDA architectures 5.2, 6.0, 6.1, and 7.0 with the current code.
The reasons why the FlashAttention kernel needs so much space are because
The first reason is I think fundamentally unavoidable. The second reason can only be avoided if you accept a significant performance penalty or reduce the number of cases covered by the kernel. Intuitively I would think that a kernel without templating would be at least 2x slower. What you could do on your end to reduce the file size without performance penalties is to compile the kernel only for the head size of the model with which you package the code; all other head sizes are never going to be used anyways. In a similar manner you could compile only those kernels for quantized data that match the quantization format of the packaged model to reduce the file size for |
How would I do that? If I use I tried the thing you suggested, namely:
But that doesn't work on my Linux machine with a I don't know how to make a portable CUDA DSO binary that's smaller than what Could we add a define called |
I have not figured out how to do this manually using
It would absolutely be possible to reduce the code size by just compiling less code but as I said, this will inevitably come at the cost of either worse performance or less features. |
Also FYI: the addition of KV cache quantization support for FlashAttention like with #7527 will lead to similar increases in file size. |
llamafile does its best to faithfully support, without bias, all the features the llama.cpp developer community implements. But I can't do that if including a llama.cpp feature means I need to drop support for some of my users. Users come first. So if I send a PR adding the Please keep in mind, I'm only talking about cutting features from llamafile's prebuilt release binaries. llamafile releases include a The ifdef should hold us off until someone thinks of a tinier way to do it. If I can build a LISP IDE that boots as an OS in 436 bytes then it should be possible to implement a native cuda build of flash attention in fewer than 6,688,888 bytes. |
My personal opinion is that optimizing for code size is not worthwhile for llama.cpp itself. So I personally would only be willing to merge such a PR if it comes with a pledge that you will put in the effort to maintain the feature. Creating a fork could also be an option. But I am not the ultimate authority on what does and doesn't get merged when ggml is involved so input from @ggerganov and @slaren would be appreciated.
Sorry to say, but since the CUDA compiler is proprietary and closed-source I think your options will be rather limited. As I said before, (apart from reducing the CUDA architectures) I think the best option would be to only compile kernels for those quantization formats and head sizes that are actually going to be used by the packaged model. More generally speaking, I think that this is just a cursed problem in the first place. The general trend for llama.cpp/ggml has been to add more GPU code that covers more cases and improves performance at the cost of an increase in compile time/binary size. I see that trend continuing so my intuition is that trying to reduce the binary size with tricks will only provide a temporary solution. |
The rate its growing does bring concerns so it may be good to establish a target size for the binary. Similarly we already had one case where the compilation of our Koboldcpp huggingface space took so long that we could no longer compile it. For now these issues are manageable and solvable, in this case by us tweaking the space to compile for less GPU targets but this does leave all the spaces that got cloned potentially broken. On my own side I am planning to switch the HF spaces to use our precompiled binaries to avoid that issue completely in the future, but its not unthinkable it will eventually effect more free CI's if the compile time keeps growing massively. So to ensure that bigger issues are avoided I do think its good to keep a maximum size of the binaries / compile time in mind as a counter balance to the performance chase. |
It's unfortunate that you're having issues but if you want or need precompiled binaries that are both small and portable then CUDA is simply the wrong tool. I am not willing to invest the time and effort needed to implement and maintain changes that trade performance for smaller file sizes/lower compile time but I will happily provide instructions for how to do so if someone else is interested. If file size/compile time becomes an issue for llama.cpp the way to go would in my opinion be to make support for architectures older than Pascal and q4_1, q5_0, q5_1 (and probably also very small quantization formats) opt-in via an option like |
I understand that you prefer to focus solely on speed/performance, but I'd just like to point out that in just the last 3 days, the addition of int8 tensor core support alone has increased llama-b3209-bin-win-cuda-cu11.7.1-x64.zip from 168mb compressed to llama-b3233-bin-win-cuda-cu11.7.1-x64.zip at 310mb compressed, which is an increase greater than the total growth in binary size over the entirety of last year. Granted that this decision is entirely within your prerogative, and indeed very few others could do the work you do, just thought I could maybe bring a bit of perspective from an end-user point of view. |
The next release should be significantly smaller. There was an issue that caused ggml to be linked multiple times. |
Thats good to hear, the main worry stems from how fast it was growing and that eventually turning into sizes that github release pages / windows binaries can't handle. From what we saw it was an exponential growth the past releases, so hearing thats due to a linking error is a relief. If the majority of size increasing changes has been done and it will eventually stabilize then I don't see an issue. But if were to get in the gigabyte sizes I can see issues starting to happen hence me wanting to warn that its good to keep those restrictions in mind. The earlier mentioned Huggingface compile time issue I just solved for good on our end (Cloned spaces will have to manually apply my new build file), it uses our precompiled binaries now so compile time there will no longer be a factor. I saw Github CI allows up to 6 hours so that should have plenty of margin left. |
The dimensions for template parameters for FlashAttention are compute capability, head size, batch size, K type, and V type. The dimensions for MMQ are compute capability, weight type, batch size, and whether or not you need an out-of-bounds check in one of the dimensions. More than anything the increase in compile time and binary size came from adding more dimensions since this drastically increases the number of possible combinations. I am currently not aware of any extra dimensions that would be worth adding. I plan to at some point add code for training which you could in a sense consider an extra dimension but I think it would be completely fine to make the code for that opt-in since most people would not be using. |
I triggered a new release in https://github.com/ggerganov/llama.cpp/actions/runs/9683534847, and the compressed file size is now 82MB. |
I must admit I smiled when I read the koboldcpp v1.68 release notes the other day. |
The uncompressed file size of a CUDA build is now 199MB ( |
I don't think that's unexpected, there is one more architecture to compile for, and a lot more kernels. The compression ratio is similar in both releases |
Would there be any way to do just-in-time compilation of needed kernels? |
@LostRuins Right now I'm focusing on developing new features rather than copying code from upstream. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
The CUDA implementation for
GGML_OP_FLASH_ATTN_EXT
is as large as the rest of ggml-cuda combined.The heaviest function is this one:
llama.cpp/ggml-cuda/fattn.cu
Lines 192 to 196 in 4426e29
GPU support for flash attention can't be included in llamafile because we deal with a 4GB limit on Windows.
For comparison, in December ggml-cuda.so built with
-march=all
was 12mb. By February is was 16mb. By April it was 50mb. Now it's 90gb. On my project we've already started using gzip to compress the ggml-cuda dso. We've also reduced our support vector to-arch=all-major
. Everything that can be done is being done on our end, since I'd like to be able to include everything if possible. However this op seems like it could benefit from a refactoring.The text was updated successfully, but these errors were encountered: