@@ -129,6 +129,51 @@ Data parallelism replicates the entire model across multiple GPU sets and proces
129129Data parallelism can be combined with the other parallelism strategies and is set by ` data_parallel_size=N ` .
130130Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
131131
132+ ### Batch-level DP for Multi-Modal Encoders  
133+ 
134+ By default, TP is used to shard the weights of multi-modal encoders just like for language decoders,
135+ in order to reduce the memory and compute load on each GPU.
136+ 
137+ However, since the size of multi-modal encoders is very small compared to language decoders,
138+ there is relatively little gain from TP. On the other hand, TP incurs significant communication
139+ overhead because of all-reduce being performed after every layer.
140+ 
141+ Given this, it may be advantageous to instead shard the batched input data using TP, essentially
142+ performing batch-level DP. This has been shown to improve the throughput by around 10% for
143+ ` tensor_parallel_size=8 ` . For vision encoders that use hardware-unoptimized Conv3D operations,
144+ batch-level DP can provide another 40% increase to throughput compared to regular TP.
145+ 
146+ Nevertheless, since the weights of the multi-modal encoder are replicated across each TP rank,
147+ there will be a minor increase in memory consumption and may cause OOM if you can barely fit the model already.
148+ 
149+ You can enable batch-level DP by setting ` mm_encoder_tp_mode="data" ` , for example:
150+ 
151+ ``` python 
152+ from  vllm import  LLM 
153+ 
154+ llm =  LLM(
155+     model = " Qwen/Qwen2.5-VL-72B-Instruct" 
156+     #  Create two EngineCore instances, one per DP rank
157+     data_parallel_size = 2 ,
158+     #  Within each EngineCore instance:
159+     #  The vision encoder uses TP=4 (not DP=2) to shard the input data
160+     #  The language decoder uses TP=4 to shard the weights as usual
161+     tensor_parallel_size = 4 ,
162+     mm_encoder_tp_mode = " data" 
163+ )
164+ ``` 
165+ 
166+ !! important
167+     Batch-level DP is not to be confused with API request-level DP
168+     (which is instead controlled by ` data_parallel_size ` ).
169+ 
170+ The availablilty of batch-level DP is based on model implementation.
171+ Currently, the following models support ` mm_encoder_tp_mode="data" ` :
172+ 
173+ -  Llama4 (< gh-pr:18368 > )
174+ -  Qwen2.5-VL (< gh-pr:22742 > )
175+ -  Step3 (< gh-pr:22697 > )
176+ 
132177## Input Processing  
133178
134179### Parallel Processing  
0 commit comments