Skip to content

Tensor Parallelism vs Data Parallelism #367

Closed
@xcxhy

Description

@xcxhy

Hi, thanks! I use vllm to inference the llama-7B model on single gpu, and tensor-parallel on 2-gpus and 4-gpus, we found that it is 10 times faster than HF on a single GPU, but using tensor parallelism, there is no significant increase in token throughput.We understand that through data parallelism, the memory can be expanded and the batch of processing samples can be increased.But the communication between graphics cards may reduce the speed. If 2-gpus is used, there should be an acceleration of 1.5✖️. But now the throughput has basically remained unchanged. Is it because our GPU KV cache usage is full, or there are other reasons. Looking forward to your reply!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions