Skip to content

Commit 144de17

Browse files
author
EC2 Default User
committed
updates to notebooks
1 parent 3306b8b commit 144de17

File tree

5 files changed

+159
-379
lines changed

5 files changed

+159
-379
lines changed

notebooks/part-1-horovod/cifar10-distributed.ipynb

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,14 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Part 1: Convert training script to use horovod"
7+
"## Exercise 1: Convert training script to use horovod"
88
]
99
},
1010
{
1111
"cell_type": "markdown",
1212
"metadata": {},
1313
"source": [
14-
"This notebook contains training script to train a classification deep neural network on the CIFAR-10 dataset.\n",
15-
"The CIFAR-10 consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). \n",
16-
"\n",
1714
"You'll need to make the following modifications to your training script to use horovod for distributed training.\n",
18-
"Once these changes are made, you convert the jupyter notebook into a python training script by running:\n",
19-
"<code> $ jupyter nbconvert --to script cifar10-distributed.ipynb </code>\n",
2015
"\n",
2116
"1. Run hvd.init()\n",
2217
"2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list.\n",
@@ -26,6 +21,18 @@
2621
"6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them."
2722
]
2823
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"Look for cells that say **Change X** and fill in those cells with the modifications - where **X** is the change number. There are a total of 8 changes.\n",
29+
"Click on **Solution** to see the answers\n",
30+
"\n",
31+
"After you've finished making necessary changes, run the script by hitting *Run > Run All Cells*.\n",
32+
"\n",
33+
"**Confirm that that the script still runs after introducing the horovod API**"
34+
]
35+
},
2936
{
3037
"cell_type": "markdown",
3138
"metadata": {},
@@ -172,10 +179,13 @@
172179
"momentum = 0.9\n",
173180
"weight_decay = 2e-4\n",
174181
"optimizer = 'sgd'\n",
182+
"gpu_count = 1\n",
175183
"\n",
176184
"# Data directories and other options\n",
177-
"gpu_count = 1\n",
178-
"checkpoint_dir = 'ckpt_dir'\n",
185+
"checkpoint_dir = '../ckpt_dir'\n",
186+
"if not os.path.exists(checkpoint_dir):\n",
187+
" os.makedirs(checkpoint_dir)\n",
188+
"\n",
179189
"train_dir = '../data/train'\n",
180190
"validation_dir = '../data/validation'\n",
181191
"eval_dir = '../data/eval'\n",
@@ -407,6 +417,14 @@
407417
"print('Test loss :', score[0])\n",
408418
"print('Test accuracy:', score[1])"
409419
]
420+
},
421+
{
422+
"cell_type": "markdown",
423+
"metadata": {},
424+
"source": [
425+
"Note once these changes are made, you can convert the jupyter notebook into a python training script by running:\n",
426+
"<code> $ jupyter nbconvert --to script notebook_name.ipynb </code>"
427+
]
410428
}
411429
],
412430
"metadata": {

notebooks/part-1-horovod/cifar10-single-instance.ipynb

Lines changed: 57 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,25 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Single CPU/GPU training on the local instance\n",
8+
"This Jupyter notebook contains code that trains a DNN on the CIFAR10 dataset.\n",
9+
"The CIFAR-10 dataset consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class).\n",
10+
"\n",
11+
"This script was written for local training on a single instance. Run through this notebook either cell-by-cell or by hitting *Run > Run All Cells*\n",
12+
"\n",
13+
"Once you start to feel comfortable with what the script is doing, we'll then start to make changes to this script so that it can run on a cluster in a distributed fashion."
14+
]
15+
},
16+
{
17+
"cell_type": "markdown",
18+
"metadata": {},
19+
"source": [
20+
"**Step 1:** Import essentials packages and define constants"
21+
]
22+
},
323
{
424
"cell_type": "code",
525
"execution_count": null,
@@ -16,6 +36,7 @@
1636
"from tensorflow.keras.optimizers import Adam, SGD\n",
1737
"from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint\n",
1838
"\n",
39+
"# Import DNN model definition file\n",
1940
"from model_def import get_model\n",
2041
"\n",
2142
"HEIGHT = 32\n",
@@ -27,6 +48,13 @@
2748
"NUM_TEST_IMAGES = 10000"
2849
]
2950
},
51+
{
52+
"cell_type": "markdown",
53+
"metadata": {},
54+
"source": [
55+
"**Step 2:** Define functions used to load and prepare dataset for training. We incorporate 3 types of data augmentation schemes: random resize, random crop, random flip. Feel free to update this if you're comfortable. Leave the cell as it is if you aren't comfortable making changes."
56+
]
57+
},
3058
{
3159
"cell_type": "code",
3260
"execution_count": null,
@@ -44,15 +72,8 @@
4472
" # Randomly flip the image horizontally.\n",
4573
" image = tf.image.random_flip_left_right(image)\n",
4674
"\n",
47-
" return image"
48-
]
49-
},
50-
{
51-
"cell_type": "code",
52-
"execution_count": null,
53-
"metadata": {},
54-
"outputs": [],
55-
"source": [
75+
" return image\n",
76+
"\n",
5677
"def make_batch(filenames, batch_size):\n",
5778
" \"\"\"Read the images and labels from 'filenames'.\"\"\"\n",
5879
" # Repeat infinitely.\n",
@@ -66,15 +87,8 @@
6687
" iterator = dataset.make_one_shot_iterator()\n",
6788
"\n",
6889
" image_batch, label_batch = iterator.get_next()\n",
69-
" return image_batch, label_batch"
70-
]
71-
},
72-
{
73-
"cell_type": "code",
74-
"execution_count": null,
75-
"metadata": {},
76-
"outputs": [],
77-
"source": [
90+
" return image_batch, label_batch\n",
91+
"\n",
7892
"def single_example_parser(serialized_example):\n",
7993
" \"\"\"Parses a single tf.Example into image and label tensors.\"\"\"\n",
8094
" # Dimensions of the images in the CIFAR-10 dataset.\n",
@@ -101,26 +115,39 @@
101115
" return image, label"
102116
]
103117
},
118+
{
119+
"cell_type": "markdown",
120+
"metadata": {},
121+
"source": [
122+
"**Step 3:** \n",
123+
"* Define hyperameters, directories for train, validation and test.\n",
124+
"* Load model from model_def.py\n",
125+
"* Compile model and fit"
126+
]
127+
},
104128
{
105129
"cell_type": "code",
106130
"execution_count": null,
107131
"metadata": {},
108132
"outputs": [],
109133
"source": [
110134
"# Hyper-parameters\n",
111-
"epochs = 15\n",
135+
"epochs = 1\n",
112136
"lr = 0.01\n",
113137
"batch_size = 128\n",
114138
"momentum = 0.9\n",
115139
"weight_decay = 2e-4\n",
116140
"optimizer = 'sgd'\n",
141+
"gpu_count = 1\n",
117142
"\n",
118143
"# Data directories and other options\n",
119-
"gpu_count = 1\n",
120-
"checkpoint_dir = 'ckpt_dir'\n",
121-
"train_dir = '../data/train'\n",
122-
"validation_dir = '../data/validation'\n",
123-
"eval_dir = '../data/eval'\n",
144+
"checkpoint_dir = '../ckpt_dir'\n",
145+
"if not os.path.exists(checkpoint_dir):\n",
146+
" os.makedirs(checkpoint_dir)\n",
147+
" \n",
148+
"train_dir = '../dataset/train'\n",
149+
"validation_dir = '../dataset/validation'\n",
150+
"eval_dir = '../dataset/eval'\n",
124151
"\n",
125152
"train_dataset = make_batch(train_dir+'/train.tfrecords', batch_size)\n",
126153
"val_dataset = make_batch(validation_dir+'/validation.tfrecords', batch_size)\n",
@@ -192,11 +219,13 @@
192219
]
193220
},
194221
{
195-
"cell_type": "code",
196-
"execution_count": null,
222+
"cell_type": "markdown",
197223
"metadata": {},
198-
"outputs": [],
199-
"source": []
224+
"source": [
225+
"\n",
226+
"----\n",
227+
"##### Now that you have a successfully working training script, open `cifar10-distributed.ipynb` and start converting it for distributed training"
228+
]
200229
}
201230
],
202231
"metadata": {

0 commit comments

Comments
 (0)