usage example

mdeff · Oct 18, 2016 · 1014001 · 1014001
1 parent 864808b
commit 1014001
Show file tree

Hide file tree

Showing 2 changed files with 376 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -35,7 +35,20 @@ Information Processing Systems (NIPS), 2016.
 
 ## Reproducing our results
 
-Run all the notebooks to reproduce the experiments presented in the paper.
+Run all the notebooks to reproduce the experiments on [MNIST](mnist.ipynb) and
+[20NEWS](20news.ipynb) presented in the paper.
 ```sh
-make run
+make exp
 ```
+
+## Using the model
+
+To use our graph ConvNet on your data, you need:
+
+1. a data matrix where each row is a sample and each column is a feature,
+2. a target vector,
+3. optionally, an adjacency matrix which encodes the structure as a graph.
+
+See the [usage notebook](usage.ipynb) for a simple example with fabricated
+data. Please get in touch if you are unsure about applying the model to
+a different setting.
diff --git a/usage.ipynb b/usage.ipynb
@@ -0,0 +1,361 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "$\\newcommand{\\G}{\\mathcal{G}}$\n",
+    "$\\newcommand{\\V}{\\mathcal{V}}$\n",
+    "$\\newcommand{\\E}{\\mathcal{E}}$\n",
+    "$\\newcommand{\\R}{\\mathbb{R}}$\n",
+    "\n",
+    "This notebook shows how to apply our graph ConvNet ([paper] & [code]), or any other, to your structured or unstructured data. For this example, we assume that we have $n$ samples $x_i \\in \\R^{d_x}$ arranged in a data matrix $$X = [x_1, ..., x_n]^T \\in \\R^{n \\times d_x}.$$ Each sample $x_i$ is associated with a vector $y_i \\in \\R^{d_y}$ for a regression task or a label $y_i \\in \\{0,\\ldots,C\\}$ for a classification task.\n",
+    "\n",
+    "[paper]: https://arxiv.org/abs/1606.09375\n",
+    "[code]:  https://github.com/mdeff/cnn_graph\n",
+    "\n",
+    "From there, we'll structure our data with a graph $\\G = (\\V, \\E, A)$ where $\\V$ is the set of $d_x = |\\V|$ vertices, $\\E$ is the set of edges and $A \\in \\R^{d_x \\times d_x}$ is the adjacency matrix. That matrix represents the weight of each edge, i.e. $A_{i,j}$ is the weight of the edge connecting $v_i \\in \\V$ to $v_j \\in \\V$. The weights of that feature graph thus represent pairwise relationships between features $i$ and $j$. We call that regime **signal classification / regression**, as the samples $x_i$ to be classified or regressed are graph signals.\n",
+    "\n",
+    "Some applications of that regime:\n",
+    "*  Task classification for task fMRI: the graph is a functional or anatomical connectome. Graph signals are activations measured by fMRI while the subject is performing some task. The goal is to find which task.\n",
+    "* Anomaly detection: given a transportation, energy or communication network and some traffic measures, predict whether something is going wrong.\n",
+    "* Text classification: each document is modeled as a graph signal (bag-of-words or TF-IDF) and each node represents a word. The graph represents the vocabulary, where edge weights indicate the similarity between words."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Other modelling possibilities include:\n",
+    "1. Using a data graph, i.e. an adjacency matrix $A \\in \\R^{n \\times n}$ which represents pairwise relationships between samples $x_i \\in \\R^{d_x}$. The problem is here to predict a graph signal $y \\in \\R^{n \\times d_y}$ given a graph characterized by $A$ and some graph signals $X \\in \\R^{n \\times d_x}$. We call that regime **node classification / regression**, as we classify or regress nodes instead of signals.\n",
+    "    1. Example application: text classification where each document is modeled as a node and each hyperlink or citation is an edge between them. The graph signals may be bag-of-words or TF-IDF representations of the documents.\n",
+    "    2. [Kipf & Weiling (2016)][kipf_weiling] uses a first-order approximation of our spectral graph convolution for semi-supervised node classification.\n",
+    "2. Another problem of interest is whole graph classification, with or without signals on top. We'll call that third regime **graph classification / regression**. The problem here is to classify or regress a whole graph $A_i \\in \\R^{n \\times n}$ (with or without an associated data matrix $X_i \\in \\R^{n \\times d_x}$) into $y_i \\in \\R^{d_y}$. In case we have no signal, we can use a constant vector $X_i = 1_n$ of size $n$.\n",
+    "    1. Example application: predict some characteristic of a chemical compound given its arrangement.\n",
+    "\n",
+    "[kipf_weiling]: https://arxiv.org/abs/1609.02907"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "#import graph, coarsening, utils\n",
+    "%run -n models.ipynb  # import models\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import shutil\n",
+    "%matplotlib inline\n",
+    "\n",
+    "\n",
+    "%load_ext autoreload\n",
+    "%autoreload 1\n",
+    "%aimport graph\n",
+    "%aimport coarsening\n",
+    "%aimport utils"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 1 Data\n",
+    "\n",
+    "For the purpose of the demo, let's create a random data matrix $X \\in \\R^{n \\times d_x}$ and somehow infer a label $y_i = f(x_i)$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "d = 100    # Dimensionality.\n",
+    "n = 10000  # Number of samples.\n",
+    "c = 5      # Number of feature communities.\n",
+    "\n",
+    "# Data matrix, structured in communities (feature-wise).\n",
+    "X = np.random.normal(0, 1, (n, d)).astype(np.float32)\n",
+    "X += np.linspace(0, 1, c).repeat(d // c)\n",
+    "\n",
+    "# Noisy non-linear target.\n",
+    "w = np.random.normal(0, .02, d)\n",
+    "t = X.dot(w) + np.random.normal(0, .001, n)\n",
+    "t = np.tanh(t)\n",
+    "plt.figure(figsize=(15, 5))\n",
+    "plt.plot(t, '.')\n",
+    "\n",
+    "# Classification.\n",
+    "y = np.ones(t.shape, dtype=np.uint8)\n",
+    "y[t > t.mean() + 0.4 * t.std()] = 0\n",
+    "y[t < t.mean() - 0.4 * t.std()] = 2\n",
+    "print('Class imbalance: ', np.unique(y, return_counts=True)[1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then split this dataset into training, validation and testing sets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "n_train = n // 2\n",
+    "n_val = n // 10\n",
+    "\n",
+    "X_train = X[:n_train]\n",
+    "X_val   = X[n_train:n_train+n_val]\n",
+    "X_test  = X[n_train+n_val:]\n",
+    "\n",
+    "y_train = y[:n_train]\n",
+    "y_val   = y[n_train:n_train+n_val]\n",
+    "y_test  = y[n_train+n_val:]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 2 Graph\n",
+    "\n",
+    "The second thing we need is a **graph between features**, i.e. an adjacency matrix $A \\in \\mathbb{R}^{d_x \\times d_x}$.\n",
+    "Structuring data with graphs is very flexible: it can accomodate both structured and unstructured data.\n",
+    "1. **Structured data**.\n",
+    "    1. The data is structured by an Euclidean domain, e.g. $x_i$ represents an image, a sound or a video. We can use a classical ConvNet with 1D, 2D or 3D convolutions or a graph ConvNet with a line or grid graph (however losing the orientation).\n",
+    "    2. The data is structured by a graph, e.g. the data lies on a transportation, energy, brain or social network.\n",
+    "2. **Unstructured data**. We could use a fully connected network, but the learning and computational complexities are gonna be large. An alternative is to construct a sparse similarity graph between features (or between samples) and use a graph ConvNet, effectively structuring the data and drastically reducing the number of parameters through weight sharing. As for classical ConvNets, the number of parameters are independent of the input size.\n",
+    "\n",
+    "There are many ways, supervised or unsupervised, to construct a graph given some data. And better the graph, better the performance ! For this example we'll define the adjacency matrix as a simple similarity measure between features. Below are the choices one has to make when constructing such a graph.\n",
+    "1. The distance function. We'll use the Euclidean distance $d_{ij} = \\|x_i - x_j\\|_2$.\n",
+    "2. The kernel. We'll use the Gaussian kernel $a_{ij} = \\exp(d_{ij}^2 / \\sigma^2)$.\n",
+    "3. The type of graph. We'll use a $k$ nearest neigbors (kNN) graph."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "dist, idx = graph.distance_scipy_spatial(X_train.T, k=10, metric='euclidean')\n",
+    "A = graph.adjacency(dist, idx).astype(np.float32)\n",
+    "\n",
+    "assert A.shape == (d, d)\n",
+    "print('d = |V| = {}, k|V| < |E| = {}'.format(d, A.nnz))\n",
+    "plt.spy(A, markersize=2, color='black');"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To be able to pool graph signals, we need first to coarsen the graph, i.e. to find which vertices to group together. At the end we'll have multiple graphs, like a pyramid, each at one level of resolution. The finest graph is where the input data lies, the coarsest graph is where the data at the output of the graph convolutional layers lie. That data, of reduced spatial dimensionality, can then be fed to a fully connected layer.\n",
+    "\n",
+    "The parameter here is the number of times to coarsen the graph. Each coarsening approximately reduces the size of the graph by a factor two. Thus if you want a pooling of size 4 in the first layer followed by a pooling of size 2 in the second, you'll need to coarsen $\\log_2(4+2) = 3$ times.\n",
+    "\n",
+    "After coarsening we rearrange the vertices (and add fake vertices) such that pooling a graph signal is analog to pooling a 1D signal. See the [paper] for details.\n",
+    "\n",
+    "[paper]: https://arxiv.org/abs/1606.09375"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "graphs, perm = coarsening.coarsen(A, levels=3, self_connections=False)\n",
+    "\n",
+    "X_train = coarsening.perm_data(X_train, perm)\n",
+    "X_val = coarsening.perm_data(X_val, perm)\n",
+    "X_test = coarsening.perm_data(X_test, perm)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We finally need to compute the graph Laplacian $L$ for each of our graphs (the original and the coarsened versions), defined by their adjacency matrices $A$. The sole parameter here is the type of Laplacian, e.g. the combinatorial Laplacian, the normalized Laplacian or the random walk Laplacian."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "L = [graph.laplacian(A, normalized=True) for A in graphs]\n",
+    "graph.plot_spectrum(L)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 3 Graph ConvNet\n",
+    "\n",
+    "Here we apply the graph convolutional neural network to signals lying on graphs. After designing the architecture and setting the hyper-parameters, the model takes as inputs the data matrix $X$, the target $y$ and a list of graph Laplacians $L$, one per coarsening level.\n",
+    "\n",
+    "The data, architecture and hyper-parameters are absolutely *not engineered to showcase performance*. Its sole purpose is to illustrate usage and functionality."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Clean up the data generated by TensorFlow.\n",
+    "shutil.rmtree('summaries/demo', ignore_errors=True)\n",
+    "shutil.rmtree('checkpoints/demo', ignore_errors=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "params = dict()\n",
+    "params['dir_name']       = 'demo'\n",
+    "params['num_epochs']     = 40\n",
+    "params['batch_size']     = 100\n",
+    "params['eval_frequency'] = 200\n",
+    "\n",
+    "# Building blocks.\n",
+    "params['filter']         = 'chebyshev5'\n",
+    "params['brelu']          = 'b1relu'\n",
+    "params['pool']           = 'apool1'\n",
+    "\n",
+    "# Number of classes.\n",
+    "C = y.max() + 1\n",
+    "assert C == np.unique(y).size\n",
+    "\n",
+    "# Architecture.\n",
+    "params['F']              = [32, 64]  # Number of graph convolutional filters.\n",
+    "params['K']              = [20, 20]  # Polynomial orders.\n",
+    "params['p']              = [4, 2]    # Pooling sizes.\n",
+    "params['M']              = [512, C]  # Output dimensionality of fully connected layers.\n",
+    "\n",
+    "# Optimization.\n",
+    "params['regularization'] = 5e-4\n",
+    "params['dropout']        = 1\n",
+    "params['learning_rate']  = 1e-3\n",
+    "params['decay_rate']     = 0.95\n",
+    "params['momentum']       = 0.9\n",
+    "params['decay_steps']    = n_train / params['batch_size']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "model = cgcnn(L, **params)\n",
+    "accuracy, loss, t_step = model.fit(X_train, y_train, X_val, y_val)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 4 Evaluation\n",
+    "\n",
+    "We often want to monitor:\n",
+    "1. The convergence, i.e. the training loss and the classification accuracy on the validation set.\n",
+    "2. The performance, i.e. the classification accuracy on the testing set (to be compared with the training set accuracy to spot overfitting).\n",
+    "\n",
+    "The `model_perf` class in [utils.py](utils.py) can be used to compactly evaluate multiple models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "fig, ax1 = plt.subplots(figsize=(15, 5))\n",
+    "ax1.plot(accuracy, 'b.-')\n",
+    "ax1.set_ylabel('validation accuracy', color='b')\n",
+    "ax2 = ax1.twinx()\n",
+    "ax2.plot(loss, 'g.-')\n",
+    "ax2.set_ylabel('training loss', color='g')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print('Time per step: {:.2f} ms'.format(t_step*1000))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "res = model.evaluate(X_test, y_test)\n",
+    "print(res[0])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.4.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}