forked from pytorch/tutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathaws_distributed_training_tutorial.py
695 lines (604 loc) · 28.4 KB
/
aws_distributed_training_tutorial.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
"""
PyTorch 1.0 Distributed Trainer with Amazon AWS
===============================================
**Author**: `Nathan Inkawhich <https://github.com/inkawhich>`_
**Edited by**: `Teng Li <https://github.com/teng-li>`_
"""
######################################################################
# In this tutorial we will show how to setup, code, and run a PyTorch 1.0
# distributed trainer across two multi-gpu Amazon AWS nodes. We will start
# with describing the AWS setup, then the PyTorch environment
# configuration, and finally the code for the distributed trainer.
# Hopefully you will find that there is actually very little code change
# required to extend your current training code to a distributed
# application, and most of the work is in the one-time environment setup.
#
######################################################################
# Amazon AWS Setup
# ----------------
#
# In this tutorial we will run distributed training across two multi-gpu
# nodes. In this section we will first cover how to create the nodes, then
# how to setup the security group so the nodes can communicate with
# eachother.
#
# Creating the Nodes
# ~~~~~~~~~~~~~~~~~~
#
# In Amazon AWS, there are seven steps to creating an instance. To get
# started, login and select **Launch Instance**.
#
# **Step 1: Choose an Amazon Machine Image (AMI)** - Here we will select
# the ``Deep Learning AMI (Ubuntu) Version 14.0``. As described, this
# instance comes with many of the most popular deep learning frameworks
# installed and is preconfigured with CUDA, cuDNN, and NCCL. It is a very
# good starting point for this tutorial.
#
# **Step 2: Choose an Instance Type** - Now, select the GPU compute unit
# called ``p2.8xlarge``. Notice, each of these instances has a different
# cost but this instance provides 8 NVIDIA Tesla K80 GPUs per node, and
# provides a good architecture for multi-gpu distributed training.
#
# **Step 3: Configure Instance Details** - The only setting to change here
# is increasing the *Number of instances* to 2. All other configurations
# may be left at default.
#
# **Step 4: Add Storage** - Notice, by default these nodes do not come
# with a lot of storage (only 75 GB). For this tutorial, since we are only
# using the STL-10 dataset, this is plenty of storage. But, if you want to
# train on a larger dataset such as ImageNet, you will have to add much
# more storage just to fit the dataset and any trained models you wish to
# save.
#
# **Step 5: Add Tags** - Nothing to be done here, just move on.
#
# **Step 6: Configure Security Group** - This is a critical step in the
# configuration process. By default two nodes in the same security group
# would not be able to communicate in the distributed training setting.
# Here, we want to create a **new** security group for the two nodes to be
# in. However, we cannot finish configuring in this step. For now, just
# remember your new security group name (e.g. launch-wizard-12) then move
# on to Step 7.
#
# **Step 7: Review Instance Launch** - Here, review the instance then
# launch it. By default, this will automatically start initializing the
# two instances. You can monitor the initialization progress from the
# dashboard.
#
# Configure Security Group
# ~~~~~~~~~~~~~~~~~~~~~~~~
#
# Recall that we were not able to properly configure the security group
# when creating the instances. Once you have launched the instance, select
# the *Network & Security > Security Groups* tab in the EC2 dashboard.
# This will bring up a list of security groups you have access to. Select
# the new security group you created in Step 6 (i.e. launch-wizard-12),
# which will bring up tabs called *Description, Inbound, Outbound, and
# Tags*. First, select the *Inbound* tab and *Edit* to add a rule to allow
# "All Traffic" from "Sources" in the launch-wizard-12 security group.
# Then select the *Outbound* tab and do the exact same thing. Now, we have
# effectively allowed all Inbound and Outbound traffic of all types
# between nodes in the launch-wizard-12 security group.
#
# Necessary Information
# ~~~~~~~~~~~~~~~~~~~~~
#
# Before continuing, we must find and remember the IP addresses of both
# nodes. In the EC2 dashboard find your running instances. For both
# instances, write down the *IPv4 Public IP* and the *Private IPs*. For
# the remainder of the document, we will refer to these as the
# **node0-publicIP**, **node0-privateIP**, **node1-publicIP**, and
# **node1-privateIP**. The public IPs are the addresses we will use to SSH
# in, and the private IPs will be used for inter-node communication.
#
######################################################################
# Environment Setup
# -----------------
#
# The next critical step is the setup of each node. Unfortunately, we
# cannot configure both nodes at the same time, so this process must be
# done on each node separately. However, this is a one time setup, so once
# you have the nodes configured properly you will not have to reconfigure
# for future distributed training projects.
#
# The first step, once logged onto the node, is to create a new conda
# environment with python 3.6 and numpy. Once created activate the
# environment.
#
# ::
#
# $ conda create -n nightly_pt python=3.6 numpy
# $ source activate nightly_pt
#
# Next, we will install a nightly build of Cuda 9.0 enabled PyTorch with
# pip in the conda environment.
#
# ::
#
# $ pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu90/torch_nightly.html
#
# We must also install torchvision so we can use the torchvision model and
# dataset. At this time, we must build torchvision from source as the pip
# installation will by default install an old version of PyTorch on top of
# the nightly build we just installed.
#
# ::
#
# $ cd
# $ git clone https://github.com/pytorch/vision.git
# $ cd vision
# $ python setup.py install
#
# And finally, **VERY IMPORTANT** step is to set the network interface
# name for the NCCL socket. This is set with the environment variable
# ``NCCL_SOCKET_IFNAME``. To get the correct name, run the ``ifconfig``
# command on the node and look at the interface name that corresponds to
# the node's *privateIP* (e.g. ens3). Then set the environment variable as
#
# ::
#
# $ export NCCL_SOCKET_IFNAME=ens3
#
# Remember, do this on both nodes. You may also consider adding the
# NCCL\_SOCKET\_IFNAME setting to your *.bashrc*. An important observation
# is that we did not setup a shared filesystem between the nodes.
# Therefore, each node will have to have a copy of the code and a copy of
# the datasets. For more information about setting up a shared network
# filesystem between nodes, see
# `here <https://aws.amazon.com/blogs/aws/amazon-elastic-file-system-shared-file-storage-for-amazon-ec2/>`__.
#
######################################################################
# Distributed Training Code
# -------------------------
#
# With the instances running and the environments setup we can now get
# into the training code. Most of the code here has been taken from the
# `PyTorch ImageNet
# Example <https://github.com/pytorch/examples/tree/master/imagenet>`__
# which also supports distributed training. This code provides a good
# starting point for a custom trainer as it has much of the boilerplate
# training loop, validation loop, and accuracy tracking functionality.
# However, you will notice that the argument parsing and other
# non-essential functions have been stripped out for simplicity.
#
# In this example we will use
# `torchvision.models.resnet18 <https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.resnet18>`__
# model and will train it on the
# `torchvision.datasets.STL10 <https://pytorch.org/docs/stable/torchvision/datasets.html#torchvision.datasets.STL10>`__
# dataset. To accomodate for the dimensionality mismatch of STL-10 with
# Resnet18, we will resize each image to 224x224 with a transform. Notice,
# the choice of model and dataset are orthogonal to the distributed
# training code, you may use any dataset and model you wish and the
# process is the same. Lets get started by first handling the imports and
# talking about some helper functions. Then we will define the train and
# test functions, which have been largely taken from the ImageNet Example.
# At the end, we will build the main part of the code which handles the
# distributed training setup. And finally, we will discuss how to actually
# run the code.
#
######################################################################
# Imports
# ~~~~~~~
#
# The important distributed training specific imports here are
# `torch.nn.parallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__,
# `torch.distributed <https://pytorch.org/docs/stable/distributed.html>`__,
# `torch.utils.data.distributed <https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler>`__,
# and
# `torch.multiprocessing <https://pytorch.org/docs/stable/multiprocessing.html>`__.
# It is also important to set the multiprocessing start method to *spawn*
# or *forkserver* (only supported in Python 3),
# as the default is *fork* which may cause deadlocks when using multiple
# worker processes for dataloading.
#
import time
import sys
import torch
if __name__ == '__main__':
torch.multiprocessing.set_start_method('spawn')
import torch.nn as nn
import torch.nn.parallel
import torch.distributed as dist
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.multiprocessing import Pool, Process
######################################################################
# Helper Functions
# ~~~~~~~~~~~~~~~~
#
# We must also define some helper functions and classes that will make
# training easier. The ``AverageMeter`` class tracks training statistics
# like accuracy and iteration count. The ``accuracy`` function computes
# and returns the top-k accuracy of the model so we can track learning
# progress. Both are provided for training convenience but neither are
# distributed training specific.
#
class AverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self):
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def accuracy(output, target, topk=(1,)):
"""Computes the precision@k for the specified values of k"""
with torch.no_grad():
maxk = max(topk)
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
res.append(correct_k.mul_(100.0 / batch_size))
return res
######################################################################
# Train Functions
# ~~~~~~~~~~~~~~~
#
# To simplify the main loop, it is best to separate a training epoch step
# into a function called ``train``. This function trains the input model
# for one epoch of the *train\_loader*. The only distributed training
# artifact in this function is setting the
# `non\_blocking <https://pytorch.org/docs/stable/notes/cuda.html#use-pinned-memory-buffers>`__
# attributes of the data and label tensors to ``True`` before the forward
# pass. This allows asynchronous GPU copies of the data meaning transfers
# can be overlapped with computation. This function also outputs training
# statistics along the way so we can track progress throughout the epoch.
#
# The other function to define here is ``adjust_learning_rate``, which
# decays the initial learning rate at a fixed schedule. This is another
# boilerplate trainer function that is useful to train accurate models.
#
def train(train_loader, model, criterion, optimizer, epoch):
batch_time = AverageMeter()
data_time = AverageMeter()
losses = AverageMeter()
top1 = AverageMeter()
top5 = AverageMeter()
# switch to train mode
model.train()
end = time.time()
for i, (input, target) in enumerate(train_loader):
# measure data loading time
data_time.update(time.time() - end)
# Create non_blocking tensors for distributed training
input = input.cuda(non_blocking=True)
target = target.cuda(non_blocking=True)
# compute output
output = model(input)
loss = criterion(output, target)
# measure accuracy and record loss
prec1, prec5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), input.size(0))
top1.update(prec1[0], input.size(0))
top5.update(prec5[0], input.size(0))
# compute gradients in a backward pass
optimizer.zero_grad()
loss.backward()
# Call step of optimizer to update model params
optimizer.step()
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if i % 10 == 0:
print('Epoch: [{0}][{1}/{2}]\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
epoch, i, len(train_loader), batch_time=batch_time,
data_time=data_time, loss=losses, top1=top1, top5=top5))
def adjust_learning_rate(initial_lr, optimizer, epoch):
"""Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
lr = initial_lr * (0.1 ** (epoch // 30))
for param_group in optimizer.param_groups:
param_group['lr'] = lr
######################################################################
# Validation Function
# ~~~~~~~~~~~~~~~~~~~
#
# To track generalization performance and simplify the main loop further
# we can also extract the validation step into a function called
# ``validate``. This function runs a full validation step of the input
# model on the input validation dataloader and returns the top-1 accuracy
# of the model on the validation set. Again, you will notice the only
# distributed training feature here is setting ``non_blocking=True`` for
# the training data and labels before they are passed to the model.
#
def validate(val_loader, model, criterion):
batch_time = AverageMeter()
losses = AverageMeter()
top1 = AverageMeter()
top5 = AverageMeter()
# switch to evaluate mode
model.eval()
with torch.no_grad():
end = time.time()
for i, (input, target) in enumerate(val_loader):
input = input.cuda(non_blocking=True)
target = target.cuda(non_blocking=True)
# compute output
output = model(input)
loss = criterion(output, target)
# measure accuracy and record loss
prec1, prec5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), input.size(0))
top1.update(prec1[0], input.size(0))
top5.update(prec5[0], input.size(0))
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if i % 100 == 0:
print('Test: [{0}/{1}]\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
i, len(val_loader), batch_time=batch_time, loss=losses,
top1=top1, top5=top5))
print(' * Prec@1 {top1.avg:.3f} Prec@5 {top5.avg:.3f}'
.format(top1=top1, top5=top5))
return top1.avg
######################################################################
# Inputs
# ~~~~~~
#
# With the helper functions out of the way, now we have reached the
# interesting part. Here is where we will define the inputs for the run.
# Some of the inputs are standard model training inputs such as batch size
# and number of training epochs, and some are specific to our distributed
# training task. The required inputs are:
#
# - **batch\_size** - batch size for *each* process in the distributed
# training group. Total batch size across distributed model is
# batch\_size\*world\_size
#
# - **workers** - number of worker processes used with the dataloaders in
# each process
#
# - **num\_epochs** - total number of epochs to train for
#
# - **starting\_lr** - starting learning rate for training
#
# - **world\_size** - number of processes in the distributed training
# environment
#
# - **dist\_backend** - backend to use for distributed training
# communication (i.e. NCCL, Gloo, MPI, etc.). In this tutorial, since
# we are using several multi-gpu nodes, NCCL is suggested.
#
# - **dist\_url** - URL to specify the initialization method of the
# process group. This may contain the IP address and port of the rank0
# process or be a non-existant file on a shared file system. Here,
# since we do not have a shared file system this will incorporate the
# **node0-privateIP** and the port on node0 to use.
#
print("Collect Inputs...")
# Batch Size for training and testing
batch_size = 32
# Number of additional worker processes for dataloading
workers = 2
# Number of epochs to train for
num_epochs = 2
# Starting Learning Rate
starting_lr = 0.1
# Number of distributed processes
world_size = 4
# Distributed backend type
dist_backend = 'nccl'
# Url used to setup distributed training
dist_url = "tcp://172.31.22.234:23456"
######################################################################
# Initialize process group
# ~~~~~~~~~~~~~~~~~~~~~~~~
#
# One of the most important parts of distributed training in PyTorch is to
# properly setup the process group, which is the **first** step in
# initializing the ``torch.distributed`` package. To do this, we will use
# the ``torch.distributed.init_process_group`` function which takes
# several inputs. First, a *backend* input which specifies the backend to
# use (i.e. NCCL, Gloo, MPI, etc.). An *init\_method* input which is
# either a url containing the address and port of the rank0 machine or a
# path to a non-existant file on the shared file system. Note, to use the
# file init\_method, all machines must have access to the file, similarly
# for the url method, all machines must be able to communicate on the
# network so make sure to configure any firewalls and network settings to
# accomodate. The *init\_process\_group* function also takes *rank* and
# *world\_size* arguments which specify the rank of this process when run
# and the number of processes in the collective, respectively.
# The *init\_method* input can also be "env://". In this case, the address
# and port of the rank0 machine will be read from the following two
# environment variables respectively: MASTER_ADDR, MASTER_PORT. If *rank*
# and *world\_size* arguments are not specified in the *init\_process\_group*
# function, they both can be read from the following two environment
# variables respectively as well: RANK, WORLD_SIZE.
#
# Another important step, especially when each node has multiple gpus is
# to set the *local\_rank* of this process. For example, if you have two
# nodes, each with 8 GPUs and you wish to train with all of them then
# :math:`world\_size=16` and each node will have a process with local rank
# 0-7. This local\_rank is used to set the device (i.e. which GPU to use)
# for the process and later used to set the device when creating a
# distributed data parallel model. It is also recommended to use NCCL
# backend in this hypothetical environment as NCCL is preferred for
# multi-gpu nodes.
#
print("Initialize Process Group...")
# Initialize Process Group
# v1 - init with url
dist.init_process_group(backend=dist_backend, init_method=dist_url, rank=int(sys.argv[1]), world_size=world_size)
# v2 - init with file
# dist.init_process_group(backend="nccl", init_method="file:///home/ubuntu/pt-distributed-tutorial/trainfile", rank=int(sys.argv[1]), world_size=world_size)
# v3 - init with environment variables
# dist.init_process_group(backend="nccl", init_method="env://", rank=int(sys.argv[1]), world_size=world_size)
# Establish Local Rank and set device on this node
local_rank = int(sys.argv[2])
dp_device_ids = [local_rank]
torch.cuda.set_device(local_rank)
######################################################################
# Initialize Model
# ~~~~~~~~~~~~~~~~
#
# The next major step is to initialize the model to be trained. Here, we
# will use a resnet18 model from ``torchvision.models`` but any model may
# be used. First, we initialize the model and place it in GPU memory.
# Next, we make the model ``DistributedDataParallel``, which handles the
# distribution of the data to and from the model and is critical for
# distributed training. The ``DistributedDataParallel`` module also
# handles the averaging of gradients across the world, so we do not have
# to explicitly average the gradients in the training step.
#
# It is important to note that this is a blocking function, meaning
# program execution will wait at this function until *world\_size*
# processes have joined the process group. Also, notice we pass our device
# ids list as a parameter which contains the local rank (i.e. GPU) we are
# using. Finally, we specify the loss function and optimizer to train the
# model with.
#
print("Initialize Model...")
# Construct Model
model = models.resnet18(pretrained=False).cuda()
# Make model DistributedDataParallel
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=dp_device_ids, output_device=local_rank)
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), starting_lr, momentum=0.9, weight_decay=1e-4)
######################################################################
# Initialize Dataloaders
# ~~~~~~~~~~~~~~~~~~~~~~
#
# The last step in preparation for the training is to specify which
# dataset to use. Here we use the `STL-10
# dataset <https://cs.stanford.edu/~acoates/stl10/>`__ from
# `torchvision.datasets.STL10 <https://pytorch.org/docs/stable/torchvision/datasets.html#torchvision.datasets.STL10>`__.
# The STL10 dataset is a 10 class dataset of 96x96px color images. For use
# with our model, we resize the images to 224x224px in the transform. One
# distributed training specific item in this section is the use of the
# ``DistributedSampler`` for the training set, which is designed to be
# used in conjunction with ``DistributedDataParallel`` models. This object
# handles the partitioning of the dataset across the distributed
# environment so that not all models are training on the same subset of
# data, which would be counterproductive. Finally, we create the
# ``DataLoader``'s which are responsible for feeding the data to the
# processes.
#
# The STL-10 dataset will automatically download on the nodes if they are
# not present. If you wish to use your own dataset you should download the
# data, write your own dataset handler, and construct a dataloader for
# your dataset here.
#
print("Initialize Dataloaders...")
# Define the transform for the data. Notice, we must resize to 224x224 with this dataset and model.
transform = transforms.Compose(
[transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
# Initialize Datasets. STL10 will automatically download if not present
trainset = datasets.STL10(root='./data', split='train', download=True, transform=transform)
valset = datasets.STL10(root='./data', split='test', download=True, transform=transform)
# Create DistributedSampler to handle distributing the dataset across nodes when training
# This can only be called after torch.distributed.init_process_group is called
train_sampler = torch.utils.data.distributed.DistributedSampler(trainset)
# Create the Dataloaders to feed data to the training and validation steps
train_loader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=workers, pin_memory=False, sampler=train_sampler)
val_loader = torch.utils.data.DataLoader(valset, batch_size=batch_size, shuffle=False, num_workers=workers, pin_memory=False)
######################################################################
# Training Loop
# ~~~~~~~~~~~~~
#
# The last step is to define the training loop. We have already done most
# of the work for setting up the distributed training so this is not
# distributed training specific. The only detail is setting the current
# epoch count in the ``DistributedSampler``, as the sampler shuffles the
# data going to each process deterministically based on epoch. After
# updating the sampler, the loop runs a full training epoch, runs a full
# validation step then prints the performance of the current model against
# the best performing model so far. After training for num\_epochs, the
# loop exits and the tutorial is complete. Notice, since this is an
# exercise we are not saving models but one may wish to keep track of the
# best performing model then save it at the end of training (see
# `here <https://github.com/pytorch/examples/blob/master/imagenet/main.py#L184>`__).
#
best_prec1 = 0
for epoch in range(num_epochs):
# Set epoch count for DistributedSampler
train_sampler.set_epoch(epoch)
# Adjust learning rate according to schedule
adjust_learning_rate(starting_lr, optimizer, epoch)
# train for one epoch
print("\nBegin Training Epoch {}".format(epoch+1))
train(train_loader, model, criterion, optimizer, epoch)
# evaluate on validation set
print("Begin Validation @ Epoch {}".format(epoch+1))
prec1 = validate(val_loader, model, criterion)
# remember best prec@1 and save checkpoint if desired
# is_best = prec1 > best_prec1
best_prec1 = max(prec1, best_prec1)
print("Epoch Summary: ")
print("\tEpoch Accuracy: {}".format(prec1))
print("\tBest Accuracy: {}".format(best_prec1))
######################################################################
# Running the Code
# ----------------
#
# Unlike most of the other PyTorch tutorials, this code may not be run
# directly out of this notebook. To run, download the .py version of this
# file (or convert it using
# `this <https://gist.github.com/chsasank/7218ca16f8d022e02a9c0deb94a310fe>`__)
# and upload a copy to both nodes. The astute reader would have noticed
# that we hardcoded the **node0-privateIP** and :math:`world\_size=4` but
# input the *rank* and *local\_rank* inputs as arg[1] and arg[2] command
# line arguments, respectively. Once uploaded, open two ssh terminals into
# each node.
#
# - On the first terminal for node0, run ``$ python main.py 0 0``
#
# - On the second terminal for node0 run ``$ python main.py 1 1``
#
# - On the first terminal for node1, run ``$ python main.py 2 0``
#
# - On the second terminal for node1 run ``$ python main.py 3 1``
#
# The programs will start and wait after printing "Initialize Model..."
# for all four processes to join the process group. Notice the first
# argument is not repeated as this is the unique global rank of the
# process. The second argument is repeated as that is the local rank of
# the process running on the node. If you run ``nvidia-smi`` on each node,
# you will see two processes on each node, one running on GPU0 and one on
# GPU1.
#
# We have now completed the distributed training example! Hopefully you
# can see how you would use this tutorial to help train your own models on
# your own datasets, even if you are not using the exact same distributed
# envrionment. If you are using AWS, don't forget to **SHUT DOWN YOUR
# NODES** if you are not using them or you may find an uncomfortably large
# bill at the end of the month.
#
# **Where to go next**
#
# - Check out the `launcher
# utility <https://pytorch.org/docs/stable/distributed.html#launch-utility>`__
# for a different way of kicking off the run
#
# - Check out the `torch.multiprocessing.spawn
# utility <https://pytorch.org/docs/master/multiprocessing.html#spawning-subprocesses>`__
# for another easy way of kicking off multiple distributed processes.
# `PyTorch ImageNet Example <https://github.com/pytorch/examples/tree/master/imagenet>`__
# has it implemented and can demonstrate how to use it.
#
# - If possible, setup a NFS so you only need one copy of the dataset
#