|
1 | 1 | {
|
2 | 2 | "cells": [
|
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "## Single CPU/GPU training on the local instance\n", |
| 8 | + "This Jupyter notebook contains code that trains a DNN on the CIFAR10 dataset.\n", |
| 9 | + "The CIFAR-10 dataset consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class).\n", |
| 10 | + "\n", |
| 11 | + "This script was written for local training on a single instance. Run through this notebook either cell-by-cell or by hitting *Run > Run All Cells*\n", |
| 12 | + "\n", |
| 13 | + "Once you start to feel comfortable with what the script is doing, we'll then start to make changes to this script so that it can run on a cluster in a distributed fashion." |
| 14 | + ] |
| 15 | + }, |
| 16 | + { |
| 17 | + "cell_type": "markdown", |
| 18 | + "metadata": {}, |
| 19 | + "source": [ |
| 20 | + "**Step 1:** Import essentials packages and define constants" |
| 21 | + ] |
| 22 | + }, |
3 | 23 | {
|
4 | 24 | "cell_type": "code",
|
5 | 25 | "execution_count": null,
|
|
16 | 36 | "from tensorflow.keras.optimizers import Adam, SGD\n",
|
17 | 37 | "from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint\n",
|
18 | 38 | "\n",
|
| 39 | + "# Import DNN model definition file\n", |
19 | 40 | "from model_def import get_model\n",
|
20 | 41 | "\n",
|
21 | 42 | "HEIGHT = 32\n",
|
|
27 | 48 | "NUM_TEST_IMAGES = 10000"
|
28 | 49 | ]
|
29 | 50 | },
|
| 51 | + { |
| 52 | + "cell_type": "markdown", |
| 53 | + "metadata": {}, |
| 54 | + "source": [ |
| 55 | + "**Step 2:** Define functions used to load and prepare dataset for training. We incorporate 3 types of data augmentation schemes: random resize, random crop, random flip. Feel free to update this if you're comfortable. Leave the cell as it is if you aren't comfortable making changes." |
| 56 | + ] |
| 57 | + }, |
30 | 58 | {
|
31 | 59 | "cell_type": "code",
|
32 | 60 | "execution_count": null,
|
|
44 | 72 | " # Randomly flip the image horizontally.\n",
|
45 | 73 | " image = tf.image.random_flip_left_right(image)\n",
|
46 | 74 | "\n",
|
47 |
| - " return image" |
48 |
| - ] |
49 |
| - }, |
50 |
| - { |
51 |
| - "cell_type": "code", |
52 |
| - "execution_count": null, |
53 |
| - "metadata": {}, |
54 |
| - "outputs": [], |
55 |
| - "source": [ |
| 75 | + " return image\n", |
| 76 | + "\n", |
56 | 77 | "def make_batch(filenames, batch_size):\n",
|
57 | 78 | " \"\"\"Read the images and labels from 'filenames'.\"\"\"\n",
|
58 | 79 | " # Repeat infinitely.\n",
|
|
66 | 87 | " iterator = dataset.make_one_shot_iterator()\n",
|
67 | 88 | "\n",
|
68 | 89 | " image_batch, label_batch = iterator.get_next()\n",
|
69 |
| - " return image_batch, label_batch" |
70 |
| - ] |
71 |
| - }, |
72 |
| - { |
73 |
| - "cell_type": "code", |
74 |
| - "execution_count": null, |
75 |
| - "metadata": {}, |
76 |
| - "outputs": [], |
77 |
| - "source": [ |
| 90 | + " return image_batch, label_batch\n", |
| 91 | + "\n", |
78 | 92 | "def single_example_parser(serialized_example):\n",
|
79 | 93 | " \"\"\"Parses a single tf.Example into image and label tensors.\"\"\"\n",
|
80 | 94 | " # Dimensions of the images in the CIFAR-10 dataset.\n",
|
|
101 | 115 | " return image, label"
|
102 | 116 | ]
|
103 | 117 | },
|
| 118 | + { |
| 119 | + "cell_type": "markdown", |
| 120 | + "metadata": {}, |
| 121 | + "source": [ |
| 122 | + "**Step 3:** \n", |
| 123 | + "* Define hyperameters, directories for train, validation and test.\n", |
| 124 | + "* Load model from model_def.py\n", |
| 125 | + "* Compile model and fit" |
| 126 | + ] |
| 127 | + }, |
104 | 128 | {
|
105 | 129 | "cell_type": "code",
|
106 | 130 | "execution_count": null,
|
107 | 131 | "metadata": {},
|
108 | 132 | "outputs": [],
|
109 | 133 | "source": [
|
110 | 134 | "# Hyper-parameters\n",
|
111 |
| - "epochs = 15\n", |
| 135 | + "epochs = 1\n", |
112 | 136 | "lr = 0.01\n",
|
113 | 137 | "batch_size = 128\n",
|
114 | 138 | "momentum = 0.9\n",
|
115 | 139 | "weight_decay = 2e-4\n",
|
116 | 140 | "optimizer = 'sgd'\n",
|
| 141 | + "gpu_count = 1\n", |
117 | 142 | "\n",
|
118 | 143 | "# Data directories and other options\n",
|
119 |
| - "gpu_count = 1\n", |
120 |
| - "checkpoint_dir = 'ckpt_dir'\n", |
121 |
| - "train_dir = '../data/train'\n", |
122 |
| - "validation_dir = '../data/validation'\n", |
123 |
| - "eval_dir = '../data/eval'\n", |
| 144 | + "checkpoint_dir = '../ckpt_dir'\n", |
| 145 | + "if not os.path.exists(checkpoint_dir):\n", |
| 146 | + " os.makedirs(checkpoint_dir)\n", |
| 147 | + " \n", |
| 148 | + "train_dir = '../dataset/train'\n", |
| 149 | + "validation_dir = '../dataset/validation'\n", |
| 150 | + "eval_dir = '../dataset/eval'\n", |
124 | 151 | "\n",
|
125 | 152 | "train_dataset = make_batch(train_dir+'/train.tfrecords', batch_size)\n",
|
126 | 153 | "val_dataset = make_batch(validation_dir+'/validation.tfrecords', batch_size)\n",
|
|
192 | 219 | ]
|
193 | 220 | },
|
194 | 221 | {
|
195 |
| - "cell_type": "code", |
196 |
| - "execution_count": null, |
| 222 | + "cell_type": "markdown", |
197 | 223 | "metadata": {},
|
198 |
| - "outputs": [], |
199 |
| - "source": [] |
| 224 | + "source": [ |
| 225 | + "\n", |
| 226 | + "----\n", |
| 227 | + "##### Now that you have a successfully working training script, open `cifar10-distributed.ipynb` and start converting it for distributed training" |
| 228 | + ] |
200 | 229 | }
|
201 | 230 | ],
|
202 | 231 | "metadata": {
|
|
0 commit comments