Skip to content

Commit cd2ed05

Browse files
DarkLight1337shanes-cerebras
authored andcommitted
[CLI][Doc] Formalize --mm-encoder-tp-mode (vllm-project#23190)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
1 parent 748425e commit cd2ed05

File tree

7 files changed

+3102
-1882
lines changed

7 files changed

+3102
-1882
lines changed

docs/configuration/optimization.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,51 @@ Data parallelism replicates the entire model across multiple GPU sets and proces
129129
Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
130130
Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
131131

132+
### Batch-level DP for Multi-Modal Encoders
133+
134+
By default, TP is used to shard the weights of multi-modal encoders just like for language decoders,
135+
in order to reduce the memory and compute load on each GPU.
136+
137+
However, since the size of multi-modal encoders is very small compared to language decoders,
138+
there is relatively little gain from TP. On the other hand, TP incurs significant communication
139+
overhead because of all-reduce being performed after every layer.
140+
141+
Given this, it may be advantageous to instead shard the batched input data using TP, essentially
142+
performing batch-level DP. This has been shown to improve the throughput by around 10% for
143+
`tensor_parallel_size=8`. For vision encoders that use hardware-unoptimized Conv3D operations,
144+
batch-level DP can provide another 40% increase to throughput compared to regular TP.
145+
146+
Nevertheless, since the weights of the multi-modal encoder are replicated across each TP rank,
147+
there will be a minor increase in memory consumption and may cause OOM if you can barely fit the model already.
148+
149+
You can enable batch-level DP by setting `mm_encoder_tp_mode="data"`, for example:
150+
151+
```python
152+
from vllm import LLM
153+
154+
llm = LLM(
155+
model="Qwen/Qwen2.5-VL-72B-Instruct",
156+
# Create two EngineCore instances, one per DP rank
157+
data_parallel_size=2,
158+
# Within each EngineCore instance:
159+
# The vision encoder uses TP=4 (not DP=2) to shard the input data
160+
# The language decoder uses TP=4 to shard the weights as usual
161+
tensor_parallel_size=4,
162+
mm_encoder_tp_mode="data",
163+
)
164+
```
165+
166+
!! important
167+
Batch-level DP is not to be confused with API request-level DP
168+
(which is instead controlled by `data_parallel_size`).
169+
170+
The availablilty of batch-level DP is based on model implementation.
171+
Currently, the following models support `mm_encoder_tp_mode="data"`:
172+
173+
- Llama4 (<gh-pr:18368>)
174+
- Qwen2.5-VL (<gh-pr:22742>)
175+
- Step3 (<gh-pr:22697>)
176+
132177
## Input Processing
133178

134179
### Parallel Processing

0 commit comments

Comments
 (0)