[RFC]: Multimodal data IPC improvement

### Motivation.

### Summary
Currently vllm interprocess communication can account for considerable amount of overhead in some cases, this RFC is aiming at reducing these overhead by using a shared memory based approach for interprocess communication.

### Background
According to the profiling result on our internal vision model in a TP>1 setting, the GPU stays idle during engine to worker communication.
![Image](https://github.com/user-attachments/assets/4669088c-ddd6-4947-bb30-1c1bce0985a5)
The major overhead is two parts 1. IPC between engine and worker process through socket 2. serialization and deserialization through pickle

A similar issue is posted here https://github.com/vllm-project/vllm/issues/16626

### Proposed Change.

After initial discussion with @ywang96 and @njhill , proposing this change to address the following communication overhead

1. IPC between engine and worker processes
2. Serialization and deserialization before and after 1.
3. Extra multimodal data transmission: first from P0 to engine, then engine to workers

### Design

**Step 1.**
For addressing 1. engine and worker processes can transmit mm data through a shared memory buffer instead of socket, there’s an existing [ShmRingBuffer](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/device_communicators/shm_broadcast.py#L68) class which only supports fixed size chunks, because of chunk size limit, it's only turned on when mm data size [is less than 16mb](https://github.com/vllm-project/vllm/pull/19242) by default, otherwise the IPC will be done through socket, which is slow.

We can add/redesign an shared memory buffer implementation for storing variable length mm data.

**Step 2.**
For addressing 2, based on the above we can skip the (de)serialization of mm data and only keep the mm_hashes for RPC call. Similar to when P0 gets a cache hit, we can [set mm data to None](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/mm_input_cache.py#L58) in the engine process then restore it from shared memory buffer in the worker process. Maybe one assumption here is mm data only contains numpy/torch tensors or other easy-to-serialize types.

**Step 3.**
For 3, we can replace the [MirroredProcessingCache](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/mm_input_cache.py#L58) with the same shared memory buffer ideally to avoid extra mm data transfer between all processes.


### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Multimodal data IPC improvement #19702

Motivation.

Summary

Background

Proposed Change.

Design

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Multimodal data IPC improvement #19702

Description

Motivation.

Summary

Background

Proposed Change.

Design

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions