Description
I am getting a slower tps when using multi gpu, as opposed to using 1 gpu (by using CUDA_VISIBLE_DEVICES
).
No. of GPUs | TPS (generation) |
---|---|
1 | 13.48 |
2 | 10.14 |
3 | 9.69 |
4 | 9.23 |
I have done multiple runs, so the TPS is an average.
The command the and output is as follows (omitting the outputs for 2 and 3 gpus runs):
Note: --n-gpu-layers
is 76 for all in order to fit the model into a single A100. Should not affect the results, as for smaller models where all layers are offloaded to the GPU, I observed the same slowdown
4 GPUs
$ ./main --model ../llama2-70b-chat-q4_1.gguf --prompt "The quick brown fox" --n-predict 128 --ctx-size 4096 --n-gpu-layers 76
<truncated>
Log start
ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0
Device 1: NVIDIA A100-SXM4-40GB, compute capability 8.0
Device 2: NVIDIA A100-SXM4-40GB, compute capability 8.0
Device 3: NVIDIA A100-SXM4-40GB, compute capability 8.0
<truncated>
llama_print_timings: load time = 11896.31 ms
llama_print_timings: sample time = 126.25 ms / 128 runs ( 0.99 ms per token, 1013.87 tokens per second)
llama_print_timings: prompt eval time = 570.27 ms / 6 tokens ( 95.04 ms per token, 10.52 tokens per second)
llama_print_timings: eval time = 13757.15 ms / 127 runs ( 108.32 ms per token, 9.23 tokens per second)
llama_print_timings: total time = 14653.54 ms
1 GPU
$ CUDA_VISIBLE_DEVICES=GPU-0870b5a7-7e03-79d9-d3b2-e1277c9ca547 ./main --model ../llama2-70b-chat-q4_1.gguf --prompt "The quick brown fox" --n-predict 128 --ctx-size 4096 --n-gpu-layers 76
<truncated>
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0
<truncated>
llama_print_timings: load time = 11464.86 ms
llama_print_timings: sample time = 127.03 ms / 128 runs ( 0.99 ms per token, 1007.66 tokens per second)
llama_print_timings: prompt eval time = 584.76 ms / 6 tokens ( 97.46 ms per token, 10.26 tokens per second)
llama_print_timings: eval time = 9420.06 ms / 127 runs ( 74.17 ms per token, 13.48 tokens per second)
llama_print_timings: total time = 10333.01 ms
I read in llama.cpp
file https://github.com/ggerganov/llama.cpp/blob/6eeb4d90839bac1e6085e5544654ab5c319ad09a/llama.cpp#L2041 that it seems to split up the tensors of each layer, and put then across the GPUs.
I suppose the slowdown is because of the synchronization steps given the tensor split.
Could it be a faster strategy to load the layers as a whole into the GPUs, and divide all layers across the GPUs?
For example, if there are 83 layers and 4 gpus, GPU can take 20 layers, and GPU1, 2, and 3 can take 21 layers.
I will be more than happy to help do the feature if it makes sense, and if I am pointed in the correct direction.