shashankprasanna
diff --git a/‎notebooks/part-1-horovod/cifar10-distributed.ipynb
Lines changed: 26 additions & 8 deletions b/‎notebooks/part-1-horovod/cifar10-distributed.ipynb
Lines changed: 26 additions & 8 deletions
diff --git a/‎notebooks/part-1-horovod/cifar10-single-instance.ipynb
Lines changed: 57 additions & 28 deletions b/‎notebooks/part-1-horovod/cifar10-single-instance.ipynb
Lines changed: 57 additions & 28 deletions
@@ -4,19 +4,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Part 1: Convert training script to use horovod"
+    "## Exercise 1: Convert training script to use horovod"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This notebook contains training script to train a classification deep neural network on the CIFAR-10 dataset.\n",
-    "The CIFAR-10 consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). \n",
-    "\n",
     "You'll need to make the following modifications to your training script to use horovod for distributed training.\n",
-    "Once these changes are made, you convert the jupyter notebook into a python training script by running:\n",
-    "<code> $ jupyter nbconvert --to script cifar10-distributed.ipynb </code>\n",
     "\n",
     "1. Run hvd.init()\n",
     "2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list.\n",
@@ -26,6 +21,18 @@
     "6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Look for cells that say **Change X** and fill in those cells with the modifications - where **X** is the change number. There are a total of 8 changes.\n",
+    "Click on **Solution** to see the answers\n",
+    "\n",
+    "After you've finished making necessary changes, run the script by hitting *Run > Run All Cells*.\n",
+    "\n",
+    "**Confirm that that the script still runs after introducing the horovod API**"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -172,10 +179,13 @@
     "momentum = 0.9\n",
     "weight_decay = 2e-4\n",
     "optimizer = 'sgd'\n",
+    "gpu_count = 1\n",
     "\n",
     "# Data directories and other options\n",
-    "gpu_count = 1\n",
-    "checkpoint_dir = 'ckpt_dir'\n",
+    "checkpoint_dir = '../ckpt_dir'\n",
+    "if not os.path.exists(checkpoint_dir):\n",
+    "    os.makedirs(checkpoint_dir)\n",
+    "\n",
     "train_dir = '../data/train'\n",
     "validation_dir = '../data/validation'\n",
     "eval_dir = '../data/eval'\n",
@@ -407,6 +417,14 @@
     "print('Test loss    :', score[0])\n",
     "print('Test accuracy:', score[1])"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note once these changes are made, you can convert the jupyter notebook into a python training script by running:\n",
+    "<code> $ jupyter nbconvert --to script notebook_name.ipynb </code>"
+   ]
   }
  ],
  "metadata": {
 
@@ -1,5 +1,25 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Single CPU/GPU training on the local instance\n",
+    "This Jupyter notebook contains code that trains a DNN on the CIFAR10 dataset.\n",
+    "The CIFAR-10 dataset consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class).\n",
+    "\n",
+    "This script was written for local training on a single instance. Run through this notebook either cell-by-cell or by hitting *Run > Run All Cells*\n",
+    "\n",
+    "Once you start to feel comfortable with what the script is doing, we'll then start to make changes to this script so that it can run on a cluster in a distributed fashion."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Step 1:** Import essentials packages and define constants"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -16,6 +36,7 @@
     "from tensorflow.keras.optimizers import Adam, SGD\n",
     "from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint\n",
     "\n",
+    "# Import DNN model definition file\n",
     "from model_def import get_model\n",
     "\n",
     "HEIGHT = 32\n",
@@ -27,6 +48,13 @@
     "NUM_TEST_IMAGES  = 10000"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Step 2:** Define functions used to load and prepare dataset for training. We incorporate 3 types of data augmentation schemes: random resize, random crop, random flip. Feel free to update this if you're comfortable. Leave the cell as it is if you aren't comfortable making changes."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -44,15 +72,8 @@
     "    # Randomly flip the image horizontally.\n",
     "    image = tf.image.random_flip_left_right(image)\n",
     "\n",
-    "    return image"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "    return image\n",
+    "\n",
     "def make_batch(filenames, batch_size):\n",
     "    \"\"\"Read the images and labels from 'filenames'.\"\"\"\n",
     "    # Repeat infinitely.\n",
@@ -66,15 +87,8 @@
     "    iterator = dataset.make_one_shot_iterator()\n",
     "\n",
     "    image_batch, label_batch = iterator.get_next()\n",
-    "    return image_batch, label_batch"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "    return image_batch, label_batch\n",
+    "\n",
     "def single_example_parser(serialized_example):\n",
     "    \"\"\"Parses a single tf.Example into image and label tensors.\"\"\"\n",
     "    # Dimensions of the images in the CIFAR-10 dataset.\n",
@@ -101,26 +115,39 @@
     "    return image, label"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Step 3:** \n",
+    "* Define hyperameters, directories for train, validation and test.\n",
+    "* Load model from model_def.py\n",
+    "* Compile model and fit"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "# Hyper-parameters\n",
-    "epochs = 15\n",
+    "epochs = 1\n",
     "lr = 0.01\n",
     "batch_size = 128\n",
     "momentum = 0.9\n",
     "weight_decay = 2e-4\n",
     "optimizer = 'sgd'\n",
+    "gpu_count = 1\n",
     "\n",
     "# Data directories and other options\n",
-    "gpu_count = 1\n",
-    "checkpoint_dir = 'ckpt_dir'\n",
-    "train_dir = '../data/train'\n",
-    "validation_dir = '../data/validation'\n",
-    "eval_dir = '../data/eval'\n",
+    "checkpoint_dir = '../ckpt_dir'\n",
+    "if not os.path.exists(checkpoint_dir):\n",
+    "    os.makedirs(checkpoint_dir)\n",
+    "    \n",
+    "train_dir = '../dataset/train'\n",
+    "validation_dir = '../dataset/validation'\n",
+    "eval_dir = '../dataset/eval'\n",
     "\n",
     "train_dataset = make_batch(train_dir+'/train.tfrecords',  batch_size)\n",
     "val_dataset = make_batch(validation_dir+'/validation.tfrecords', batch_size)\n",
@@ -192,11 +219,13 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "\n",
+    "----\n",
+    "##### Now that you have a successfully working training script, open `cifar10-distributed.ipynb` and start converting it for distributed training"
+   ]
   }
  ],
  "metadata": {