|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# MNIST: learning to recognize handwritten digits" |
| 8 | + ] |
| 9 | + }, |
| 10 | + { |
| 11 | + "cell_type": "markdown", |
| 12 | + "metadata": {}, |
| 13 | + "source": [ |
| 14 | + "## Dataset exploration" |
| 15 | + ] |
| 16 | + }, |
| 17 | + { |
| 18 | + "cell_type": "markdown", |
| 19 | + "metadata": {}, |
| 20 | + "source": [ |
| 21 | + "Before starting a machine learning or data science task, it is always useful to familiarize yourself with the data set and its context." |
| 22 | + ] |
| 23 | + }, |
| 24 | + { |
| 25 | + "cell_type": "markdown", |
| 26 | + "metadata": {}, |
| 27 | + "source": [ |
| 28 | + "### Required imports" |
| 29 | + ] |
| 30 | + }, |
| 31 | + { |
| 32 | + "cell_type": "code", |
| 33 | + "execution_count": null, |
| 34 | + "metadata": {}, |
| 35 | + "outputs": [], |
| 36 | + "source": [ |
| 37 | + "from collections import Counter\n", |
| 38 | + "from keras.datasets import mnist\n", |
| 39 | + "import matplotlib.pyplot as plt\n", |
| 40 | + "%matplotlib inline\n", |
| 41 | + "import numpy as np" |
| 42 | + ] |
| 43 | + }, |
| 44 | + { |
| 45 | + "cell_type": "markdown", |
| 46 | + "metadata": {}, |
| 47 | + "source": [ |
| 48 | + "### Obtaining the dataset" |
| 49 | + ] |
| 50 | + }, |
| 51 | + { |
| 52 | + "cell_type": "markdown", |
| 53 | + "metadata": {}, |
| 54 | + "source": [ |
| 55 | + "In Keras' datasets module we have a handle to the MNIST dataset we want to use in this notebook. Download the training and test set for this data." |
| 56 | + ] |
| 57 | + }, |
| 58 | + { |
| 59 | + "cell_type": "code", |
| 60 | + "execution_count": null, |
| 61 | + "metadata": {}, |
| 62 | + "outputs": [], |
| 63 | + "source": [ |
| 64 | + "(x_train, y_train), (x_test, y_test) = mnist.load_data()" |
| 65 | + ] |
| 66 | + }, |
| 67 | + { |
| 68 | + "cell_type": "markdown", |
| 69 | + "metadata": {}, |
| 70 | + "source": [ |
| 71 | + "### Dimensions and types" |
| 72 | + ] |
| 73 | + }, |
| 74 | + { |
| 75 | + "cell_type": "markdown", |
| 76 | + "metadata": {}, |
| 77 | + "source": [ |
| 78 | + "Determine the shape and type of the training and the test set." |
| 79 | + ] |
| 80 | + }, |
| 81 | + { |
| 82 | + "cell_type": "markdown", |
| 83 | + "metadata": {}, |
| 84 | + "source": [ |
| 85 | + "The training set has 60,000 examples, the test set 10,000. The input is a 28 $\\times$ 28 matrix of unsigned 8-bit integers, the output a single unsigned 8-bit integer." |
| 86 | + ] |
| 87 | + }, |
| 88 | + { |
| 89 | + "cell_type": "markdown", |
| 90 | + "metadata": {}, |
| 91 | + "source": [ |
| 92 | + "### Data semantics" |
| 93 | + ] |
| 94 | + }, |
| 95 | + { |
| 96 | + "cell_type": "markdown", |
| 97 | + "metadata": {}, |
| 98 | + "source": [ |
| 99 | + "Each input represents a scanned grayscale image of a handwritten digit, the output is the corresponding integer. Visualize the image, and check the label for the first training example." |
| 100 | + ] |
| 101 | + }, |
| 102 | + { |
| 103 | + "cell_type": "code", |
| 104 | + "execution_count": null, |
| 105 | + "metadata": {}, |
| 106 | + "outputs": [], |
| 107 | + "source": [ |
| 108 | + "rows = 5\n", |
| 109 | + "cols = 7\n", |
| 110 | + "figure, axes = plt.subplots(rows, cols, figsize=(5, 3))\n", |
| 111 | + "plt.subplots_adjust(wspace=0.1, hspace=0.1)\n", |
| 112 | + "for img_nr in range(rows*cols):\n", |
| 113 | + " row = img_nr//cols\n", |
| 114 | + " col = img_nr % cols\n", |
| 115 | + " axes[row, col].get_xaxis().set_visible(False)\n", |
| 116 | + " axes[row, col].get_yaxis().set_visible(False)\n", |
| 117 | + " axes[row, col].imshow(x_train[img_nr], cmap='gray')" |
| 118 | + ] |
| 119 | + }, |
| 120 | + { |
| 121 | + "cell_type": "code", |
| 122 | + "execution_count": null, |
| 123 | + "metadata": {}, |
| 124 | + "outputs": [], |
| 125 | + "source": [ |
| 126 | + "y_train[:rows*cols].reshape(rows, cols)" |
| 127 | + ] |
| 128 | + }, |
| 129 | + { |
| 130 | + "cell_type": "markdown", |
| 131 | + "metadata": {}, |
| 132 | + "source": [ |
| 133 | + "So this proves that I'm certainly not the only one cursed with bad handwriting." |
| 134 | + ] |
| 135 | + }, |
| 136 | + { |
| 137 | + "cell_type": "markdown", |
| 138 | + "metadata": {}, |
| 139 | + "source": [ |
| 140 | + "### Data distribution" |
| 141 | + ] |
| 142 | + }, |
| 143 | + { |
| 144 | + "cell_type": "markdown", |
| 145 | + "metadata": {}, |
| 146 | + "source": [ |
| 147 | + "An important question is whether all digits are represented in the training and test set, and what the distribution is. This may have an impact on the accuracy of the trained model." |
| 148 | + ] |
| 149 | + }, |
| 150 | + { |
| 151 | + "cell_type": "markdown", |
| 152 | + "metadata": {}, |
| 153 | + "source": [ |
| 154 | + "Although some digits like 1 are overrepresented, and others, e.g., 5 are underrepresented, the distribution seems to be reasonably uniform, and it is likely no special care needs to be taken." |
| 155 | + ] |
| 156 | + } |
| 157 | + ], |
| 158 | + "metadata": { |
| 159 | + "kernelspec": { |
| 160 | + "display_name": "Python 3", |
| 161 | + "language": "python", |
| 162 | + "name": "python3" |
| 163 | + }, |
| 164 | + "language_info": { |
| 165 | + "codemirror_mode": { |
| 166 | + "name": "ipython", |
| 167 | + "version": 3 |
| 168 | + }, |
| 169 | + "file_extension": ".py", |
| 170 | + "mimetype": "text/x-python", |
| 171 | + "name": "python", |
| 172 | + "nbconvert_exporter": "python", |
| 173 | + "pygments_lexer": "ipython3", |
| 174 | + "version": "3.7.3" |
| 175 | + } |
| 176 | + }, |
| 177 | + "nbformat": 4, |
| 178 | + "nbformat_minor": 2 |
| 179 | +} |
0 commit comments