Closed
Description
Hi, thanks! I use vllm to inference the llama-7B model on single gpu, and tensor-parallel on 2-gpus and 4-gpus, we found that it is 10 times faster than HF on a single GPU, but using tensor parallelism, there is no significant increase in token throughput.We understand that through data parallelism, the memory can be expanded and the batch of processing samples can be increased.But the communication between graphics cards may reduce the speed. If 2-gpus is used, there should be an acceleration of 1.5✖️. But now the throughput has basically remained unchanged. Is it because our GPU KV cache usage is full, or there are other reasons. Looking forward to your reply!
Metadata
Metadata
Assignees
Labels
No labels