7
7
Data parallel and model parallel are widely-used distributed training
8
8
techniques. Previous posts have explained how to use
9
9
`DataParallel <https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html>`_
10
- to train a neural network on multiple GPUs. ``DataParallel`` replicates the same
11
- model to all GPUs, where each GPU consumes a different partition of the input
12
- data. Although it can significantly accelerate the training process, it does not
13
- work for some use cases where the model is large to fit into a single GPU. This
14
- post shows how to solve that problem by using model parallel and also shares
15
- some insights on how to speed up model parallel training.
10
+ to train a neural network on multiple GPUs. ``DataParallel`` replicates the
11
+ same model to all GPUs, where each GPU consumes a different partition of the
12
+ input data. Although it can significantly accelerate the training process, it
13
+ does not work for some use cases where the model is large to fit into a single
14
+ GPU. This post shows how to solve that problem by using model parallel and also
15
+ shares some insights on how to speed up model parallel training.
16
16
17
17
The high-level idea of model parallel is to place different sub-networks of a
18
18
model onto different devices, and implement the ``forward`` method accordingly
19
- to move intermediate outputs across devices. As only part of a model operates on
20
- any individual device, a set of devices can collectively serve a larger model.
21
- In this post, we will not try to construct huge models and squeeze them into a
22
- limited number of GPUs. Instead, this post focuses on showing the idea of model
23
- parallel. It is up to the readers to apply the ideas to real-world applications.
19
+ to move intermediate outputs across devices. As only part of a model operates
20
+ on any individual device, a set of devices can collectively serve a larger
21
+ model. In this post, we will not try to construct huge models and squeeze them
22
+ into a limited number of GPUs. Instead, this post focuses on showing the idea
23
+ of model parallel. It is up to the readers to apply the ideas to real-world
24
+ applications.
24
25
25
26
Let us start with a toy model that contains two linear layers. To run this
26
27
model on two GPUs, simply put each linear layer on a different GPU, and move
@@ -173,15 +174,18 @@ def train(model):
173
174
174
175
stmt = "train(model)"
175
176
176
- setup = "from __main__ import train, ModelParallelResNet50;" + \
177
- "model = ModelParallelResNet50()"
178
- mp_run_times = timeit .repeat (stmt , setup , number = 1 , repeat = num_repeat )
177
+ setup = "model = ModelParallelResNet50()"
178
+ # globals arg is only available in Python 3. In Python 2, use the following
179
+ # import __builtin__
180
+ # __builtin__.__dict__.update(locals())
181
+ mp_run_times = timeit .repeat (
182
+ stmt , setup , number = 1 , repeat = num_repeat , globals = globals ())
179
183
mp_mean , mp_std = np .mean (mp_run_times ), np .std (mp_run_times )
180
184
181
- setup = "from __main__ import train, num_classes;" + \
182
- "import torchvision.models as models;" + \
185
+ setup = "import torchvision.models as models;" + \
183
186
"model = models.resnet50(num_classes=num_classes).to('cuda:0')"
184
- rn_run_times = timeit .repeat (stmt , setup , number = 1 , repeat = num_repeat )
187
+ rn_run_times = timeit .repeat (
188
+ stmt , setup , number = 1 , repeat = num_repeat , globals = globals ())
185
189
rn_mean , rn_std = np .mean (rn_run_times ), np .std (rn_run_times )
186
190
187
191
@@ -212,18 +216,20 @@ def plot(means, stds, labels, fig_name):
212
216
# ``4.02/3.75-1=7%`` longer than the existing single-GPU implementation. So we
213
217
# can conclude there is roughly 7% overhead in copying tensors back and forth
214
218
# across the GPUs. There are rooms for improvements, as we know one of the two
215
- # GPUs is sitting idle throughout the execution. One option is to further divide
216
- # each batch into a pipeline of splits, such that when one split reaches the
217
- # second sub-network, the following split can be fed into the first sub-network.
218
- # In this way, two consecutive splits can run concurrently on two GPUs.
219
+ # GPUs is sitting idle throughout the execution. One option is to further
220
+ # divide each batch into a pipeline of splits, such that when one split reaches
221
+ # the second sub-network, the following split can be fed into the first
222
+ # sub-network. In this way, two consecutive splits can run concurrently on two
223
+ # GPUs.
219
224
220
225
######################################################################
221
226
# Speed Up by Pipelining Inputs
222
227
# =======================
223
228
#
224
229
# In the following experiments, we further divide each 120-image batch into
225
230
# 20-image splits. As PyTorch launches CUDA operations asynchronizely, the
226
- # implementation does not need to spawn multiple threads to achieve concurrency.
231
+ # implementation does not need to spawn multiple threads to achieve
232
+ # concurrency.
227
233
228
234
229
235
class PipelineParallelResNet50 (ModelParallelResNet50 ):
@@ -251,9 +257,9 @@ def forward(self, x):
251
257
return torch .cat (ret )
252
258
253
259
254
- setup = "from __main__ import train, PipelineParallelResNet50;" + \
255
- "model = PipelineParallelResNet50()"
256
- pp_run_times = timeit . repeat ( stmt , setup , number = 1 , repeat = num_repeat )
260
+ setup = "model = PipelineParallelResNet50()"
261
+ pp_run_times = timeit . repeat (
262
+ stmt , setup , number = 1 , repeat = num_repeat , globals = globals () )
257
263
pp_mean , pp_std = np .mean (pp_run_times ), np .std (pp_run_times )
258
264
259
265
plot ([mp_mean , rn_mean , pp_mean ],
@@ -266,16 +272,17 @@ def forward(self, x):
266
272
# current streams on the source and the destination devices. If you create
267
273
# multiple streams, you have to make sure that copy operations are properly
268
274
# synchronized. Writing the source tensor or reading/writing the destination
269
- # tensor before finishing the copy operation can lead to undefined behavior. The
270
- # above implementation only uses default streams on both source and destination
271
- # devices, hence it is not necessary to enforce additional synchronizations.
275
+ # tensor before finishing the copy operation can lead to undefined behavior.
276
+ # The above implementation only uses default streams on both source and
277
+ # destination devices, hence it is not necessary to enforce additional
278
+ # synchronizations.
272
279
#
273
280
# .. figure:: /_static/img/model-parallel-images/mp_vs_rn_vs_pp.png
274
281
# :alt:
275
282
#
276
- # The experiment result shows that, pipelining inputs to model parallel ResNet50
277
- # speeds up the training process by roughly ``3.75/2.51-1=49%``. It is still
278
- # quite far away from the ideal 100% speedup. As we have introduced a new
283
+ # The experiment result shows that, pipelining inputs to model parallel
284
+ # ResNet50 speeds up the training process by roughly ``3.75/2.51-1=49%``. It is
285
+ # still quite far away from the ideal 100% speedup. As we have introduced a new
279
286
# parameter ``split_sizes`` in our pipeline parallel implementation, it is
280
287
# unclear how the new parameter affects the overall training time. Intuitively
281
288
# speaking, using small ``split_size`` leads to many tiny CUDA kernel launch,
@@ -290,10 +297,9 @@ def forward(self, x):
290
297
split_sizes = [1 , 3 , 5 , 8 , 10 , 12 , 20 , 40 , 60 ]
291
298
292
299
for split_size in split_sizes :
293
- setup = "from __main__ import train, PipelineParallelResNet50;" + \
294
- "from __main__ import split_size;" + \
295
- "model = PipelineParallelResNet50(split_size=split_size)"
296
- pp_run_times = timeit .repeat (stmt , setup , number = 1 , repeat = num_repeat )
300
+ setup = "model = PipelineParallelResNet50(split_size=%d)" % split_size
301
+ pp_run_times = timeit .repeat (
302
+ stmt , setup , number = 1 , repeat = num_repeat , globals = globals ())
297
303
means .append (np .mean (pp_run_times ))
298
304
stds .append (np .std (pp_run_times ))
299
305
0 commit comments