-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747
Comments
I didn't test or optimize the CUDA code for H100s or A100s. But I would very much suspect that on such fast GPUs for a 7b q4_K_M model the synchronization overhead is higher than any potential speed gain. Just run models on a single GPU if you can. |
@JohannesGaessler yes, it is a synchronization overhead issue, just tested with a single A100 and it performed much better than 4 GPUs (72>>31). thanks a lot. |
It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU. |
Can TensorRT-LLM solve this issue? |
I'd say that's slightly unrelated because llama.cpp uses custom kernels for custom quantizations - I don't know much about Nvidias solution but my guess is that it is operating at fp16 and might support fp8 on latest generation. No the solution is Layer-wise splitting of tensors (#4055) I first stumbled upon this mechanism when I attempted to add broadcasted multiplication (for falcon) into the GPU kernel and I realized I am looking at ten thousand GPU synchronizations among my 2 GPUs for just one token. These synchronizations alone made it slower than CPU-bound computation of the same tensor. The solution is to give up on the highly complex tensor splitting and instead split the computation by layers, this means the card does not have to synchronize hundreds to thousands of times - it just needs to receive one tensor at the beginning and deliver the result at the end. The EXL2 framework uses layer-splitting for that reason. I recently asked the author and he assumes that running a 7B model on 8 H100 cards is as fast as on 1 H100 card (no benefit, no slowdown). So, in my opinion, the solution is to implement the simpler layer-split into llama.cpp. However that currently has no support and I lack the time for a full implementation that might not even get accepted as it has to dig deep into offloading and OP functions. |
Layer splitting will be added in #4766 |
wow great job, I've lobbied for that quite a while. |
My general stance on things like this that I don't consider a priority is that I won't implement it myself but that I will still review other people's PRs if they want to implement it. |
I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks. MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" |
The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing. Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090. So both cards are too slow, assumed you use full GPU offload (-ngl) |
@cmp-nct here is Clock and Power of the A100 system: ~$ nvidia-smi -q -d CLOCK ==============NVSMI LOG============== Timestamp : Wed Jan 10 08:58:04 2024 Attached GPUs : 4 GPU 00000000:52:00.0 GPU 00000000:D5:00.0 GPU 00000000:D6:00.0 ==============NVSMI LOG============== Timestamp : Wed Jan 10 09:03:55 2024 Attached GPUs : 4 GPU 00000000:52:00.0 GPU 00000000:D5:00.0 GPU 00000000:D6:00.0 |
I do not have a A100 or H100 system as reference, I'm using the slightly cheaper 4090/3090 :) The power target appears to be too low, a A100 should be 400W according to Google and the H100 should be 350W. I found contradicting information, as some servers are at 350W and some at 400W. |
@cmp-nct When will the new backend be released? Do you have a schedule? Thanks |
The change that allows splitting models across multiple GPUs at the layer level already been merged, and this is now the default behavior when using multiple GPUs with llama.cpp. There is another change in the works (#4918) that will enable pipeline parallelism to improve multi GPU performance when processing large batches or prompts. |
Just as Slaren said, that's the answer. Slaren made a beautiful implementation of it, it already works great. With the pipeline feature llama.cpp will be useful even for use in real power servers. |
I confirm the problem. the results with H100 are worse than the results on A100. has anyone found the cause of this problem ? I had 4 x A100 PCIe I switched to 4 x H100 hoping to have better results with llamacpp but it's quite the opposite has anyone found a solution to this problem? |
@jughurta Did you find a solution? |
I tested llama.cpp on two systems, one with 4xA100 GPU and the other with 8xH100 GPU. The test results show that the inference performance of 8xH100+nvlink(21 tokens per socond) is worse than that of 4xA100 pcie(31 token per second), which is very strange! Can anyone help explain this behavior? How can I improve H100? Thanks
The text was updated successfully, but these errors were encountered: