-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: assert when using batch size less than 129 #5140
Comments
Probably related to the following issue: #5086 |
As I noted in that PR, this is a fundamental issue with the way ggml-alloc works when the sizes of the tensors don't match exactly the sizes used to measure the buffer size, and fixing this definitely will require significant changes. I outlined the solution here:
This will require changing the way tensors are allocated, such that they are always assigned the same memory addresses regardless of their size. This will also introduce the restriction that all the graphs must have the exact same topology than the graph used to measure the buffer size, and that the tensor sizes must never be larger than the sizes in the measure graph. A consequence of that is that the way the K-shift is handled in llama.cpp will need to be changed. For now, if this use case is important, what we can do is add some margin to the sizes of the compute buffers (5-10% should do it). However it is impossible for me test every combination of model, batch size, and context size, all of which affect the sizes of the tensors used in the graph. |
Yeah, it seems like a very tricky problem. Surprisingly, the strategy of greedily assigning the tensors to the best-fit block (i.e. least amount of bytes wasted) does not always lead to a solution because AFAICT it depends if a block will be reused later on or not. Somehow it feels like there should be an elegant solution, but I can't see it. Would 5-10% increase suffice? In this case it's more like 35% needed. |
Hard to tell if it is going to be enough in every case, but in this case 10% seems enough. The change in size may affect previous allocations which results in less fragmentation (closer to the fragmentation observed during measure). diff --git a/ggml-alloc.c b/ggml-alloc.c
index 60141a34..c9dc20b7 100644
--- a/ggml-alloc.c
+++ b/ggml-alloc.c
@@ -335,7 +335,7 @@ bool ggml_tallocr_is_measure(ggml_tallocr_t alloc) {
}
size_t ggml_tallocr_max_size(ggml_tallocr_t alloc) {
- return alloc->max_size;
+ return alloc->max_size + alloc->max_size/10;
}
// graph allocator |
In the few cases that I tried, it does solve the problem. Let's push this for now and make note to resolve this in the future |
Ok, do you want me to open a PR? |
Yes, go ahead |
It's also possible that expanding the graphs in different topological order to minimize the number of tensors allocated at any moment could reduce fragmentation and alleviate this issue. I experimented with this in ggerganov/ggml#462 (comment). Ultimately, I think that the only way to provide a strong guarantee that allocations will never fail is to add the restrictions mentioned earlier (same graph topology and every tensors is never bigger than in the measure graph). We cannot implement very complicated logic neither in the graph expansion or in the tensor allocations because this is something that needs to be very fast, since it is done for every evaluation. |
I can still hit the issue using e76627b.
|
@RodolfoCastanheira that's simply not possible, the error conditions referenced here no longer exist. Please open a new issue and explain what you are observing in detail. |
You're right, I tested with an older build, my bad. |
This issue is stale because it has been open for 30 days with no activity. |
The model fully fits in VRAM with a lot of room to spare. Running on a 16 GB GPU, Mistral-7B quantized with any quantization:
This is reproducible for any batch size
<= 128
.The text was updated successfully, but these errors were encountered: