Adding DataFrame inspection (checks for data abnormalities and furthe…

…r inspects individual columns with summary statistics and distribution plots)
ellisonbg · Aug 7, 2018 · 02d9500 · 02d9500
1 parent cb5f76a
commit 02d9500
Show file tree

Hide file tree

Showing 9 changed files with 65,297 additions and 357 deletions.
diff --git a/IntroductionToPlyto.ipynb b/IntroductionToPlyto.ipynb
diff --git a/PlytoData.ipynb b/PlytoData.ipynb
diff --git a/PlytoKeras.ipynb b/PlytoKeras.ipynb
@@ -0,0 +1,346 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction to Plyto with Keras\n",
+    "#### Python Machine Learning Visualization Toolkit\n",
+    "This notebook will demonstrate how to use our Keras Accuracy Loss function with the PlytoAPI to visualize model accuracy and loss throughout the training process of a machine learning algorithm, as well as a tutorial on how to create your own callback function"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<center><img src='style/icons/machinelearning-blue.svg'></center>\n",
+    "\n",
+    "Plyto toolbar item opens the model visualizer for this notebook!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using TensorFlow backend.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from plyto import PlytoAPI, KerasAccuracyLossCallback"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### How it works"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A callback function that takes a plyto instance as a parameter is passed as a callback to model.fit when using keras for machine learning. A plyto instance requires an Altair spec to define plots. Below is an example of a simple altair spec to create a line graph of samples versus accuracy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# an array of altair specs with one plot of samples versus accuracy\n",
+    "spec = [\n",
+    "    {\n",
+    "        # specifies an altair spec\n",
+    "        \"$schema\": \"https://vega.github.io/schema/vega-lite/v2.json\",\n",
+    "        \"name\": \"accuracyGraph\",\n",
+    "        \n",
+    "        #size of the plot\n",
+    "        \"config\": {\n",
+    "            \"view\": {\n",
+    "                \"height\": 300,\n",
+    "                \"width\": 300\n",
+    "            }\n",
+    "        },\n",
+    "        \n",
+    "        # name of the dataset as defined in the backend of the callback function\n",
+    "        \"data\": {\n",
+    "            \"name\": \"dataSet\"\n",
+    "        },\n",
+    "        \n",
+    "        # visual encodings of the plot\n",
+    "        \"encoding\": {\n",
+    "            \"x\": {\n",
+    "                \"field\": \"samples\",\n",
+    "                \"type\": \"quantitative\"\n",
+    "            },\n",
+    "            \"y\": {\n",
+    "                \"field\": \"accuracy\",\n",
+    "                \"type\": \"quantitative\"\n",
+    "            }\n",
+    "        },\n",
+    "        \n",
+    "        \n",
+    "        \"mark\": \"line\"\n",
+    "    }\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Running a model\n",
+    "To demonstrate how plyto works, we will be looking at the MNIST handwritten digit data, which can be loaded from keras.datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from keras.datasets import mnist\n",
+    "from keras.models import Sequential\n",
+    "from keras.layers import Dense, Dropout, Activation, Flatten, Convolution2D, MaxPooling2D\n",
+    "from keras.utils import np_utils\n",
+    "\n",
+    "# Load pre-shuffled MNIST data into train and test sets\n",
+    "(X_train, y_train), (X_test, y_test) = mnist.load_data()\n",
+    " \n",
+    "# Preprocess input data\n",
+    "X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)\n",
+    "X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)\n",
+    "X_train = X_train.astype('float32')\n",
+    "X_test = X_test.astype('float32')\n",
+    "X_train /= 255\n",
+    "X_test /= 255\n",
+    " \n",
+    "# Preprocess class labels\n",
+    "Y_train = np_utils.to_categorical(y_train, 10)\n",
+    "Y_test = np_utils.to_categorical(y_test, 10)\n",
+    " \n",
+    "# Define model architecture\n",
+    "model = Sequential()\n",
+    " \n",
+    "model.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))\n",
+    "model.add(Convolution2D(32, (3, 3), activation='tanh'))\n",
+    "model.add(MaxPooling2D(pool_size=(2,2)))\n",
+    "model.add(Dropout(0.25))\n",
+    " \n",
+    "model.add(Flatten())\n",
+    "model.add(Dense(128, activation='relu'))\n",
+    "model.add(Dropout(0.5))\n",
+    "model.add(Dense(10, activation='tanh'))\n",
+    "\n",
+    "# Compile the model\n",
+    "model.compile(loss='categorical_crossentropy',\n",
+    "              optimizer='adam',\n",
+    "              metrics=['accuracy'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "However you fit your model, simply add the callback as a parameter and open the Plyto model visualizer to see your statistics and plots update."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<keras.callbacks.History at 0x1167eab38>"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# pass the spec into PlytoAPI\n",
+    "plytoInstance = PlytoAPI(spec)\n",
+    "\n",
+    "# pass plyto instance to the callback function\n",
+    "callback = KerasAccuracyLossCallback(plytoInstance)\n",
+    "\n",
+    "model.fit(X_train, Y_train, batch_size=None, epochs=2, verbose=0, callbacks=[callback])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*Note: if you are to stop and re-run the model, the plytoInstance and callback must be re-initialized. We recommend initializing them in the same cell as the call to model.fit() to ensure this works properly*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Writing your own callback function"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A callback function for Plyto is a class that takes a keras callback as a parameter. \n",
+    "\n",
+    "Within this custom function, you can define functions to execute at specific points in running the model.\n",
+    "\n",
+    "Keras Callbacks\n",
+    "- on_epoch_begin: called at the beginning of every epoch.\n",
+    "- on_epoch_end: called at the end of every epoch.\n",
+    "- on_batch_begin: called at the beginning of every batch.\n",
+    "- on_batch_end: called at the end of every batch.\n",
+    "- on_train_begin: called at the beginning of model training.\n",
+    "- on_train_end: called at the end of model training.\n",
+    "\n",
+    "For more information on using callbacks, visit [keras callback documentation](https://keras.io/callbacks/).\n",
+    "\n",
+    "For the progress bars in the status bar to work correctly, your callback function must have epochs, sample_amount, total_progress, current_progress, mode, and epoch_number. Further, total_runtime and starttime is required for the panel to display the runtime once the model is complete. Below is a base to work off of, only containing these variables for basic functionality and passing no altair spec for plots."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from keras.callbacks import Callback\n",
+    "from time import time\n",
+    "\n",
+    "\n",
+    "class Basic(Callback):\n",
+    "\n",
+    "    def __init__(self, plyto_instance):\n",
+    "        self.epochs = None\n",
+    "        self.sample_amount = 0\n",
+    "        self.total_progress = 0\n",
+    "        self.current_progress = 0\n",
+    "        self.mode = 0\n",
+    "        self.total_runtime = 0\n",
+    "        self.starttime = time()\n",
+    "        self.epoch_number = 1\n",
+    "        self.plyto = plyto_instance\n",
+    "\n",
+    "    def on_train_begin(self, logs={}):\n",
+    "        \"\"\"\n",
+    "        Get number of samples/steps per epoch and total number of epochs\n",
+    "        \"\"\"\n",
+    "        if \"samples\" in self.params:\n",
+    "            self.sample_amount = self.params[\"samples\"]\n",
+    "        elif \"nb_sample\" in self.params:\n",
+    "            self.sample_amount = self.params[\"nb_sample\"]\n",
+    "        else:\n",
+    "            self.sample_amount = self.params[\"steps\"]\n",
+    "            self.mode = 1\n",
+    "\n",
+    "        self.epochs = self.params[\"epochs\"]\n",
+    "        self.plyto.update_size(self.sample_amount)\n",
+    "        self.plyto.update_total_steps(self.epochs)\n",
+    "\n",
+    "    def on_epoch_begin(self, epoch, logs={}):\n",
+    "        \"\"\"\n",
+    "        Reset the current epoch's progress every new epoch\n",
+    "        \"\"\"\n",
+    "        self.current_progress = 0\n",
+    "\n",
+    "    def on_epoch_end(self, epoch, logs={}):\n",
+    "        self.epoch_number += 1\n",
+    "        self.plyto.update_current_step(self.epoch_number)\n",
+    "\n",
+    "    def on_batch_end(self, batch, logs={}):\n",
+    "        \"\"\"\n",
+    "        Update statistics and datasets\n",
+    "        \"\"\"\n",
+    "        if self.mode == 0:\n",
+    "            self.current_progress += logs.get(\"size\")\n",
+    "            self.total_progress += logs.get(\"size\")\n",
+    "        else:\n",
+    "            self.current_progress += 1\n",
+    "            self.total_progress += 1\n",
+    "        self.total_runtime = time() - self.starttime\n",
+    "        self.plyto.update_runtime(self.total_runtime)\n",
+    "        self.plyto.update_current_progress(self.current_progress)\n",
+    "        self.plyto.update_total_progress(self.total_progress)\n",
+    "        self.plyto.update_data_set(\n",
+    "            {\n",
+    "                \"samples\": self.total_progress,\n",
+    "            }\n",
+    "        )\n",
+    "        if self.total_progress % 256 == 0 or self.total_progress == (self.epochs * self.sample_amount):\n",
+    "            self.plyto.send_data()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<keras.callbacks.History at 0x1164a46d8>"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "basicPlatoInstance = PlytoAPI([])\n",
+    "basicCallback = Basic(basicPlatoInstance)\n",
+    "model.fit(X_train, Y_train, batch_size=None, epochs=2, verbose=0, callbacks=[basicCallback])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Only variables in the data set can be used in the Altair plotting spec and will be displayed in the panel. The basic example only includes the variale \"samples\", as seen when updating the Plyto data set with\n",
+    "\n",
+    "```python\n",
+    "self.plyto.update_data_set({\n",
+    "    \"samples\": self.total_progress\n",
+    "})\n",
+    "```"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/package.json b/package.json
@@ -26,6 +26,7 @@
     "@jupyterlab/mainmenu": "^0.6.2",
     "@jupyterlab/notebook": "^0.17.2",
     "@jupyterlab/statusbar": "^0.2.0",
+    "pandas": "0.0.3",
     "typestyle": "^2.0.1",
     "vega": "^4.2.0",
     "vega-embed": "^3.15.0",

diff --git a/plyto/__init__.py b/plyto/__init__.py
@@ -1,2 +1,3 @@
 from .plyto import PlytoAPI
-from .keras_accuracy_loss_callback import KerasAccuracyLossCallback
+from .keras_accuracy_loss_callback import KerasAccuracyLossCallback
+from .checkData import CheckData, CheckColumn