Skip to content

Commit dbf49e5

Browse files
committed
fix timeit sphinx conflicts
1 parent 83c527f commit dbf49e5

File tree

1 file changed

+41
-35
lines changed

1 file changed

+41
-35
lines changed

intermediate_source/model_parallel_tutorial.py

+41-35
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,21 @@
77
Data parallel and model parallel are widely-used distributed training
88
techniques. Previous posts have explained how to use
99
`DataParallel <https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html>`_
10-
to train a neural network on multiple GPUs. ``DataParallel`` replicates the same
11-
model to all GPUs, where each GPU consumes a different partition of the input
12-
data. Although it can significantly accelerate the training process, it does not
13-
work for some use cases where the model is large to fit into a single GPU. This
14-
post shows how to solve that problem by using model parallel and also shares
15-
some insights on how to speed up model parallel training.
10+
to train a neural network on multiple GPUs. ``DataParallel`` replicates the
11+
same model to all GPUs, where each GPU consumes a different partition of the
12+
input data. Although it can significantly accelerate the training process, it
13+
does not work for some use cases where the model is large to fit into a single
14+
GPU. This post shows how to solve that problem by using model parallel and also
15+
shares some insights on how to speed up model parallel training.
1616
1717
The high-level idea of model parallel is to place different sub-networks of a
1818
model onto different devices, and implement the ``forward`` method accordingly
19-
to move intermediate outputs across devices. As only part of a model operates on
20-
any individual device, a set of devices can collectively serve a larger model.
21-
In this post, we will not try to construct huge models and squeeze them into a
22-
limited number of GPUs. Instead, this post focuses on showing the idea of model
23-
parallel. It is up to the readers to apply the ideas to real-world applications.
19+
to move intermediate outputs across devices. As only part of a model operates
20+
on any individual device, a set of devices can collectively serve a larger
21+
model. In this post, we will not try to construct huge models and squeeze them
22+
into a limited number of GPUs. Instead, this post focuses on showing the idea
23+
of model parallel. It is up to the readers to apply the ideas to real-world
24+
applications.
2425
2526
Let us start with a toy model that contains two linear layers. To run this
2627
model on two GPUs, simply put each linear layer on a different GPU, and move
@@ -173,15 +174,18 @@ def train(model):
173174

174175
stmt = "train(model)"
175176

176-
setup = "from __main__ import train, ModelParallelResNet50;" + \
177-
"model = ModelParallelResNet50()"
178-
mp_run_times = timeit.repeat(stmt, setup, number=1, repeat=num_repeat)
177+
setup = "model = ModelParallelResNet50()"
178+
# globals arg is only available in Python 3. In Python 2, use the following
179+
# import __builtin__
180+
# __builtin__.__dict__.update(locals())
181+
mp_run_times = timeit.repeat(
182+
stmt, setup, number=1, repeat=num_repeat, globals=globals())
179183
mp_mean, mp_std = np.mean(mp_run_times), np.std(mp_run_times)
180184

181-
setup = "from __main__ import train, num_classes;" + \
182-
"import torchvision.models as models;" + \
185+
setup = "import torchvision.models as models;" + \
183186
"model = models.resnet50(num_classes=num_classes).to('cuda:0')"
184-
rn_run_times = timeit.repeat(stmt, setup, number=1, repeat=num_repeat)
187+
rn_run_times = timeit.repeat(
188+
stmt, setup, number=1, repeat=num_repeat, globals=globals())
185189
rn_mean, rn_std = np.mean(rn_run_times), np.std(rn_run_times)
186190

187191

@@ -212,18 +216,20 @@ def plot(means, stds, labels, fig_name):
212216
# ``4.02/3.75-1=7%`` longer than the existing single-GPU implementation. So we
213217
# can conclude there is roughly 7% overhead in copying tensors back and forth
214218
# across the GPUs. There are rooms for improvements, as we know one of the two
215-
# GPUs is sitting idle throughout the execution. One option is to further divide
216-
# each batch into a pipeline of splits, such that when one split reaches the
217-
# second sub-network, the following split can be fed into the first sub-network.
218-
# In this way, two consecutive splits can run concurrently on two GPUs.
219+
# GPUs is sitting idle throughout the execution. One option is to further
220+
# divide each batch into a pipeline of splits, such that when one split reaches
221+
# the second sub-network, the following split can be fed into the first
222+
# sub-network. In this way, two consecutive splits can run concurrently on two
223+
# GPUs.
219224

220225
######################################################################
221226
# Speed Up by Pipelining Inputs
222227
# =======================
223228
#
224229
# In the following experiments, we further divide each 120-image batch into
225230
# 20-image splits. As PyTorch launches CUDA operations asynchronizely, the
226-
# implementation does not need to spawn multiple threads to achieve concurrency.
231+
# implementation does not need to spawn multiple threads to achieve
232+
# concurrency.
227233

228234

229235
class PipelineParallelResNet50(ModelParallelResNet50):
@@ -251,9 +257,9 @@ def forward(self, x):
251257
return torch.cat(ret)
252258

253259

254-
setup = "from __main__ import train, PipelineParallelResNet50;" + \
255-
"model = PipelineParallelResNet50()"
256-
pp_run_times = timeit.repeat(stmt, setup, number=1, repeat=num_repeat)
260+
setup = "model = PipelineParallelResNet50()"
261+
pp_run_times = timeit.repeat(
262+
stmt, setup, number=1, repeat=num_repeat, globals=globals())
257263
pp_mean, pp_std = np.mean(pp_run_times), np.std(pp_run_times)
258264

259265
plot([mp_mean, rn_mean, pp_mean],
@@ -266,16 +272,17 @@ def forward(self, x):
266272
# current streams on the source and the destination devices. If you create
267273
# multiple streams, you have to make sure that copy operations are properly
268274
# synchronized. Writing the source tensor or reading/writing the destination
269-
# tensor before finishing the copy operation can lead to undefined behavior. The
270-
# above implementation only uses default streams on both source and destination
271-
# devices, hence it is not necessary to enforce additional synchronizations.
275+
# tensor before finishing the copy operation can lead to undefined behavior.
276+
# The above implementation only uses default streams on both source and
277+
# destination devices, hence it is not necessary to enforce additional
278+
# synchronizations.
272279
#
273280
# .. figure:: /_static/img/model-parallel-images/mp_vs_rn_vs_pp.png
274281
# :alt:
275282
#
276-
# The experiment result shows that, pipelining inputs to model parallel ResNet50
277-
# speeds up the training process by roughly ``3.75/2.51-1=49%``. It is still
278-
# quite far away from the ideal 100% speedup. As we have introduced a new
283+
# The experiment result shows that, pipelining inputs to model parallel
284+
# ResNet50 speeds up the training process by roughly ``3.75/2.51-1=49%``. It is
285+
# still quite far away from the ideal 100% speedup. As we have introduced a new
279286
# parameter ``split_sizes`` in our pipeline parallel implementation, it is
280287
# unclear how the new parameter affects the overall training time. Intuitively
281288
# speaking, using small ``split_size`` leads to many tiny CUDA kernel launch,
@@ -290,10 +297,9 @@ def forward(self, x):
290297
split_sizes = [1, 3, 5, 8, 10, 12, 20, 40, 60]
291298

292299
for split_size in split_sizes:
293-
setup = "from __main__ import train, PipelineParallelResNet50;" + \
294-
"from __main__ import split_size;" + \
295-
"model = PipelineParallelResNet50(split_size=split_size)"
296-
pp_run_times = timeit.repeat(stmt, setup, number=1, repeat=num_repeat)
300+
setup = "model = PipelineParallelResNet50(split_size=%d)" % split_size
301+
pp_run_times = timeit.repeat(
302+
stmt, setup, number=1, repeat=num_repeat, globals=globals())
297303
means.append(np.mean(pp_run_times))
298304
stds.append(np.std(pp_run_times))
299305

0 commit comments

Comments
 (0)