Skip to content

Faster multi-gpu strategy? #3120

Closed
Closed
@calvintwr

Description

@calvintwr

I am getting a slower tps when using multi gpu, as opposed to using 1 gpu (by using CUDA_VISIBLE_DEVICES).

No. of GPUs TPS (generation)
1 13.48
2 10.14
3 9.69
4 9.23

I have done multiple runs, so the TPS is an average.

The command the and output is as follows (omitting the outputs for 2 and 3 gpus runs):

Note: --n-gpu-layers is 76 for all in order to fit the model into a single A100. Should not affect the results, as for smaller models where all layers are offloaded to the GPU, I observed the same slowdown

4 GPUs

$ ./main --model ../llama2-70b-chat-q4_1.gguf --prompt "The quick brown fox" --n-predict 128 --ctx-size 4096 --n-gpu-layers 76

<truncated>

Log start
ggml_init_cublas: found 4 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 1: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 2: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 3: NVIDIA A100-SXM4-40GB, compute capability 8.0

<truncated>

llama_print_timings:        load time = 11896.31 ms
llama_print_timings:      sample time =   126.25 ms /   128 runs   (    0.99 ms per token,  1013.87 tokens per second)
llama_print_timings: prompt eval time =   570.27 ms /     6 tokens (   95.04 ms per token,    10.52 tokens per second)
llama_print_timings:        eval time = 13757.15 ms /   127 runs   (  108.32 ms per token,     9.23 tokens per second)
llama_print_timings:       total time = 14653.54 ms

1 GPU

$ CUDA_VISIBLE_DEVICES=GPU-0870b5a7-7e03-79d9-d3b2-e1277c9ca547 ./main --model ../llama2-70b-chat-q4_1.gguf --prompt "The quick brown fox" --n-predict 128 --ctx-size 4096 --n-gpu-layers 76

<truncated>

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0

<truncated>

llama_print_timings:        load time = 11464.86 ms
llama_print_timings:      sample time =   127.03 ms /   128 runs   (    0.99 ms per token,  1007.66 tokens per second)
llama_print_timings: prompt eval time =   584.76 ms /     6 tokens (   97.46 ms per token,    10.26 tokens per second)
llama_print_timings:        eval time =  9420.06 ms /   127 runs   (   74.17 ms per token,    13.48 tokens per second)
llama_print_timings:       total time = 10333.01 ms

I read in llama.cpp file https://github.com/ggerganov/llama.cpp/blob/6eeb4d90839bac1e6085e5544654ab5c319ad09a/llama.cpp#L2041 that it seems to split up the tensors of each layer, and put then across the GPUs.

I suppose the slowdown is because of the synchronization steps given the tensor split.

Could it be a faster strategy to load the layers as a whole into the GPUs, and divide all layers across the GPUs?

For example, if there are 83 layers and 4 gpus, GPU can take 20 layers, and GPU1, 2, and 3 can take 21 layers.

I will be more than happy to help do the feature if it makes sense, and if I am pointed in the correct direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions