Demonstration of training a small ResNet on CIFAR10 to 94% test accuracy in 79 seconds as described in this blog series.
Instructions to reproduce on an AWS p3.2xlarge instance:
- setup an instance with AMI: Deep Learning AMI (Ubuntu) Version 11.0(ami-c47c28bcinus-west-2)
- ssh into the instance: ssh -i $KEY_PAIR ubuntu@$PUBLIC_IP_ADDRESS -L 8901:localhost:8901
- on the remote machine
- source activate pytorch_p36
- pip install pydot(optional for network visualisation)
- git clone https://github.com/davidcpage/cifar10-fast.git
- jupyter notebook --no-browser --port=8901
 
- open the jupyter notebook url in a browser, open demo.ipynband run all the cells
In my test, 35 out of 50 runs reached 94% test set accuracy with a median of 94.08%. Runtime for 24 epochs is roughly 79s.
A second notebook experiments.ipynb contains code to reproduce the main results from the posts.
NB: demo.ipynb also works on the latest Deep Learning AMI (Ubuntu) Version 16.0, but some examples in experiments.ipynb trigger a core dump when using TensorCores in versions after 11.0.
To reproduce DAWNBench timings, setup the AWS p3.2xlarge instance as above but instead of launching a jupyter notebook on the remote machine, change directory to cifar10-fast and run python dawn.py from the command line. Timings in DAWNBench format will be saved to logs.tsv.
Note that DAWNBench timings do not include validation time, as in this FAQ, but do include initial preprocessing, as indicated here. DAWNBench timing is roughly 74 seconds which breaks down as 79s (as above) -7s (validation)+ 2s (preprocessing).
- Core functionality has moved to core.pywhilst PyTorch specific stuff is intorch_backend.pyto allow easier experimentation with different frameworks.
- Stats (loss/accuracy) are collected on the GPU and bulk transferred to the CPU at the end of each epoch. This speeds up some experiments so timings in demo.ipynbandexperiments.ipynbno longer match the blog posts.