Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llamamodel: prevent CUDA OOM crash by allocating VRAM early #2393

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cebtenzzre
Copy link
Member

@cebtenzzre cebtenzzre commented May 30, 2024

This is a proposed fix for the issue where CUDA OOM can happen later than expected and crash GPT4All. The question is whether the benefit (falling back early instead of crashing later) is worth the load latency cost.

After a model is loaded onto a CUDA device, we run one full batch of (meaningless) input through it. Small batches don't use as much VRAM, and llama.cpp seems to allocate the full KV cache for the context regardless of where in context the input lies - so n_batch matters a lot, but n_past seems to not matter at all.

The call to testModel() can be seen in the UI as the progress bar staying at near 100% before the load completes. With 24 layers of Llama 3 8B, this takes about 2 seconds on my GTX 970 and 0.3 seconds on my Tesla P40. Worst case timing under high memory pressure and a batch size of 512 (which I had to patch in since the upper limit is normally 128) is about 11.2 seconds. At a batch size of 128 I have seen this take as long as 7.6 seconds.

Testing

You can test this PR by choosing a model that does not fit in your card's VRAM and finding a number of layers to offload that just barely doesn't fit. On the main branch, GPT4All can crash either during load or when you are sending input to it. With this PR, an exception is logged to the console during testModel() and GPT4All falls back to CPU as it does for Kompute.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre requested a review from manyoso May 30, 2024 22:18
@cebtenzzre cebtenzzre changed the title llamamodel: prevent CUDA OOM crash by eagerly allocating VRAM llamamodel: prevent CUDA OOM crash by allocating VRAM early May 30, 2024
@manyoso manyoso requested a review from apage43 May 31, 2024 17:30
@manyoso
Copy link
Collaborator

manyoso commented May 31, 2024

The latency is quite unfortunate...

@cebtenzzre cebtenzzre marked this pull request as draft June 4, 2024 18:42
@cebtenzzre
Copy link
Member Author

Marking as draft because the master branch of llama.cpp-mainline had to be rolled back in favor of higher priority changes. The code itself is still reviewable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants