[Performance]: Empirical Measurement of how to broadcast python object in vLLM

### Proposal to improve performance

When we use tensor parallel in vLLM, the driver worker need to broadcast some metadata to all workers, such as the input, the lora requests, etc. This functionality is currently implemented in:

https://github.com/vllm-project/vllm/blob/9c7306ac114da3e31a5ff040a76f6c640354cce8/vllm/distributed/communication_op.py#L143

In essence, it uses `torch.distributed.broadcast_object_list` to broadcast a Python object. This function has many overhead. The overall procedure is:

<img width="904" alt="image" src="https://github.com/vllm-project/vllm/assets/23236638/155cff83-5acc-4942-bf59-4a3f94daedeb">

There are three layers of overhead:

1. device memory move: pickle works only for cpu memory. so we need to move data from cpu to device back and forth.
2. pickled data of multiple objects are concated, leading to one memory copy
3. two broadcast operation is needed, one for broadcasting the size of each pickled object, and the other for broadcasting data.

Current vLLM implementation packs the data in a list of size one, thus overhead 2 is eliminated:

https://github.com/vllm-project/vllm/blob/9c7306ac114da3e31a5ff040a76f6c640354cce8/vllm/distributed/communication_op.py#L173-L175

To remove overhead 1, we can use CPU operation to broadcast this kind of metadata.

In addition, if we can know the rough size of picked object, we can remove overhead 3 as well. Only one broadcast is required, which is the optimal case for broadcasting a Python object.

I have wrote some benchmark code in https://gist.github.com/youkaichao/b33fcd70286eb45a4a2d5a6dc32d096b and the result is in https://docs.google.com/spreadsheets/d/1c9xgR0fGvm6SROfk7vrjwOZdYnKQk9oOafWK4_KgOyo/edit?usp=sharing .

The short conclusion is:
1. using cpu (gloo) to broadcast the data indeed works better than nccl (gpu). For small size metadata, the broadcast time reduces from 400us to 300us.
2. if we can estimate the rough size, the broadcast time can be reduced to 100us. That requires us to design the object to be broadcast.

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Empirical Measurement of how to broadcast python object in vLLM #4440

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	torch.distributed.broadcast_object_list([metadata_list],
	src=src,
	group=group)

Uh oh!

[Performance]: Empirical Measurement of how to broadcast python object in vLLM #4440

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions