-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backend : offload large batches to GPU #6083
Conversation
This is the first step to allow the CUDA backend to free its resources when its ggml-backend objects are deleted. Currently, the CUDA backend allocates many resources as globals to support this feature. |
diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 9e92acc0..13640f98 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -82,6 +82,10 @@
#define cudaGetDeviceProperties hipGetDeviceProperties
#define cudaGetErrorString hipGetErrorString
#define cudaGetLastError hipGetLastError
+#define cudaHostRegister hipHostRegister
+#define cudaHostRegisterPortable hipHostRegisterPortable
+#define cudaHostRegisterReadOnly hipHostRegisterReadOnly
+#define cudaHostUnregister hipHostUnregister
#define cudaLaunchHostFunc hipLaunchHostFunc
#ifdef GGML_HIP_UMA
#define cudaMalloc hipMallocManaged
|
Wow, I'm speechless. This is beyond incredible and a HUGE leap forward!
Speed before this PR:
That's indeed double the prompt processing speed! (5 layers offloaded with an RTX 2060 laptop and Mixtral.) Thank you so much Slaren!! |
On my A6000 (using stock settings) there's a .31 tokens per second eval time regression for a 70b model. This .31 tps is consistent on just about every run. ./main -ngl 99 -m /mnt/40TB/AI/MiquMaid-v2-70B-DPO/ggml-model-Q4_K_M.gguf -p "Write a long story on why the sky is red."
A6000 + A4000. Again there's a regression around .30 tps also load time is longer on this PR.
|
@USBhost should be fixed now. Interestingly, this was caused by an increase to |
I just tried this PR. I'm not sure what fixed it, but I don't get this error reported here (#5701) with benchmark-matmult. It now completes with ROCm/7900XTX. With master I see the same abort error, with this PR it works fine.
|
@tbocek unfortunately that has not really been fixed. |
Yeah that fixed it Thanks and it feels just a tad faster than master. But that load time still looking sus...
|
For some reason, my computer really doesn't like this PR though. After text Gen, the terminal doesn't accept any input anymore and I can't start browsers. I have restart it, which takes much longer than usual. I'm using Linux Pop OS LTS 22.04. |
@Dampfinchen try setting the environment variable |
Yep, that fixes it! Thanks! |
You can also try |
Using Radeon VII, can confirm this does offer a major speedup on prompt processing, although does seem to reduce the token generation speed by just a bit. |
Tested with Vulkan, partial offload (7 layers, 7B model, Q6_K version, 478 tokens of prompt). On my low-end GPU (1060 3gb) there seems to be almost no difference: Looks like this PR would only help with more layers offloaded (and on better hardware) - but it works so far without problems. |
Vulkan supports offloading large batches automatically, but it has its own implementation. Only the CUDA backend supports the functionality added by this PR. Other backends will need to be implement a (very simple) Lines 11391 to 11401 in dc93f5a
However for this is work properly, backends need to be able to execute many graphs with little overhead, since this will result in a very large number of graph splits (hundreds, at least one for each weight). |
Ok, thanks for explaining - I saw
and decided to test just in case. |
I'm not seeing any meaningful difference in prompt processing, but with
Results:
|
dc93f5a
to
c0fe629
Compare
@8XXD8 this only affects prompt processing with partial offloading. Full offloading is unchanged. The issue with |
I think this PR breaks imatrix when partially offloading, I am getting smaller imatrix files with lots of missing info for some tensors. |
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Yes, it does. When the weights are copied to the GPU, the name of the tensor is different (for example |
reduce max inputs per split more cleanup
1fc2eef
to
d4e9187
Compare
I think the ggml-ci cuda-v100 runner has some issue, the logs say |
I think I fixed the drivers and restarted the job. Will review the PR tomorrow |
ggml-ci
d4e9187
to
cc9299c
Compare
@0cc4m is should be possible to adapt the Vulkan backend now to use this and remove |
Are there plans to also implement pre-loading the data for the next layer as the current one is being processed? Since prompt processing is compute bound it should theoretically be possible to achieve ~100% GPU speed even at 0 GPU layers. The tradeoff would be that VRAM usage goes up so you would be able to offload fewer layers which in turn makes generation slower. |
We should implement that for sure. With a large enough batch size we could reach close to the batch performance of full offload, which could have a significant impact. It's not an immediate priority for me right now, but I will work on this eventually if nobody does it before. |
Regarding my previous comment: looking at some profiling data suggests that it won't be quite as simple: With a Ryzen 5950X, 3200 MHz dual channel RAM, and an RTX 3090 the amount of time spent on memory transfers currently seems to be significantly larger than the amount of time spent on compute. Also there are still significant gaps where the GPU is idling and the CPU seems to be doing some work. |
I don't know what batch size you are using, but with a large enough batch size, I can already see over 50% utilization with |
I was using a batch size of 512 for the
No, the area that I was showing was from the middle of the calculation. Also, I am seeing the same gaps with
The total runtime was 67.02 s so |
It's the Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
diff --git a/llama.cpp b/llama.cpp
index cd7a7b8d..bd0847bb 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -5428,6 +5428,10 @@ static void llm_build_kv_store(
cb(v_cache_view, "v_cache_view", il);
// important: storing RoPE-ed version of K in the KV cache!
+ k_cur = ggml_cast(ctx, k_cur, k_cache_view->type);
+ v_cur_t = ggml_cast(ctx, v_cur_t, v_cache_view->type);
+ ggml_build_forward_expand(graph, k_cur);
+ ggml_build_forward_expand(graph, v_cur_t);
ggml_build_forward_expand(graph, ggml_cpy(ctx, k_cur, k_cache_view));
ggml_build_forward_expand(graph, ggml_cpy(ctx, v_cur_t, v_cache_view));
} |
* backend : offload large batches to GPU * fix hip * code cleanup * fix CUDA split buffers * Update ggml-backend-impl.h Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix memset without set_device * imatrix : remove sched affix from weight names * sched : add a new split if the current one has too many inputs reduce max inputs per split more cleanup * update backends ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* backend : offload large batches to GPU * fix hip * code cleanup * fix CUDA split buffers * Update ggml-backend-impl.h Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix memset without set_device * imatrix : remove sched affix from weight names * sched : add a new split if the current one has too many inputs reduce max inputs per split more cleanup * update backends ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Moves the logic of auto-offloading to the GPU when processing large batches to
ggml_backend_sched
. Currently only CUDA and Vulkan support this, this will allow any backend to support this feature.Instead of offloading only the matrix multiplications, the entire computation of the batch is offloaded. This reduces the amount of data that needs to be transferred between the GPU and CPU and improves performance significantly.
The weights are now copied to VRAM in the compute buffer, instead of the private CUDA pool buffer. As a result, the size of the compute buffers will increase significantly when offloading a model partially. However, the total VRAM usage should stay same, or slightly lower.
Backends that wish to support this feature need to implement the
offload_op
function. Only the CUDA backend implements it at this point.Additionally, the CUDA backend will now attempt to register as a host pinned buffer the memory of the models, even when using mmap. Previously, host buffers were only supported with mmap disabled. This further increases the performance of automatic offloading. The usage of host pinned memory can be disabled by defining the
GGML_CUDA_NO_PINNED
environment variable.RTX 3090 Ti, CUDA under WSL:
Raw data
build: 4755afd (2431)
build: 7664a45b (2441)
build: 46acb36 (2437)
build: 7664a45b (2441)
70B Q4_0