|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Feature extraction\n", |
| 8 | + "## Detecting cancer from histopatological images\n", |
| 9 | + "In this tutorial we will apply feature extractors to detect cancer in histopatological images of breast tissue. We will use selected images from the PatchCamelyon dataset https://github.com/basveeling/pcam.\n", |
| 10 | + "<img src=\"pictures/pcam.jpg\" style=\"max-width:100%; width: 100%; max-width: none\">\n", |
| 11 | + "\n", |
| 12 | + "### Load the dataset\n", |
| 13 | + "\n", |
| 14 | + "Run the code below to load the dataset from the file `histological_data.npz`.\n", |
| 15 | + "\n", |
| 16 | + "*Note: Download the dataset from* https://gin.g-node.org/MachineLearningBiomedApplications/data *and place it in the folder `datasets`*" |
| 17 | + ] |
| 18 | + }, |
| 19 | + { |
| 20 | + "cell_type": "code", |
| 21 | + "execution_count": null, |
| 22 | + "metadata": {}, |
| 23 | + "outputs": [], |
| 24 | + "source": [ |
| 25 | + "import numpy as np\n", |
| 26 | + "\n", |
| 27 | + "# Load dataset from .npz file\n", |
| 28 | + "data = np.load('datasets/histological_data.npz')\n", |
| 29 | + "\n", |
| 30 | + "# Train images and labels\n", |
| 31 | + "X_train = data['X_train']\n", |
| 32 | + "y_train = data['y_train'].astype('int')\n", |
| 33 | + "\n", |
| 34 | + "# Test images and labels\n", |
| 35 | + "X_test = data['X_test']\n", |
| 36 | + "y_test = data['y_test'].astype('int')\n", |
| 37 | + "\n", |
| 38 | + "# Print shapes here\n", |
| 39 | + "print('Training data - images:', X_train.shape)\n", |
| 40 | + "print('Training data - labels:',y_train.shape)\n", |
| 41 | + "print('Test data - images:',X_test.shape)\n", |
| 42 | + "print('Test data - labels:',y_test.shape)\n", |
| 43 | + "print('Labels: ', np.unique(y_test))" |
| 44 | + ] |
| 45 | + }, |
| 46 | + { |
| 47 | + "cell_type": "markdown", |
| 48 | + "metadata": {}, |
| 49 | + "source": [ |
| 50 | + "**Activity 1:** Answer the following questions:\n", |
| 51 | + "* How many training samples we have?\n", |
| 52 | + "* How many test samples we have?\n", |
| 53 | + "* What is the dimension of each sample image?\n", |
| 54 | + "* How many labels we have?\n", |
| 55 | + "\n", |
| 56 | + "**Answer:** " |
| 57 | + ] |
| 58 | + }, |
| 59 | + { |
| 60 | + "cell_type": "markdown", |
| 61 | + "metadata": {}, |
| 62 | + "source": [ |
| 63 | + "Let's now plot a few example histopathological images. Note that label 1 means presence of cancerous cells." |
| 64 | + ] |
| 65 | + }, |
| 66 | + { |
| 67 | + "cell_type": "code", |
| 68 | + "execution_count": null, |
| 69 | + "metadata": {}, |
| 70 | + "outputs": [], |
| 71 | + "source": [ |
| 72 | + "import matplotlib.pyplot as plt\n", |
| 73 | + "\n", |
| 74 | + "id_images = [4, 5, 6, 7]\n", |
| 75 | + "\n", |
| 76 | + "plt.figure(figsize=(15, 8))\n", |
| 77 | + "for i in np.arange(0, 4):\n", |
| 78 | + " plt.subplot(1, 4, i+1)\n", |
| 79 | + " plt.imshow(X_train[id_images[i], :, :], cmap='gray')\n", |
| 80 | + " plt.title('label: ' + str(y_train[id_images[i]]))" |
| 81 | + ] |
| 82 | + }, |
| 83 | + { |
| 84 | + "cell_type": "markdown", |
| 85 | + "metadata": {}, |
| 86 | + "source": [ |
| 87 | + "# Cancer detection using texture descriptors\n", |
| 88 | + "\n", |
| 89 | + "We will now calculate the texture descriptors using **Grey-level co-ocurrence matrix (GLCM)**. The matrix can be calculated using `skimage` object `greycomatrix`.\n", |
| 90 | + "\n", |
| 91 | + "We will select one healthy and one cancerous sample image. The GLCM for the healthy sample has been generated and plotted for you. \n", |
| 92 | + "\n", |
| 93 | + "**Activity 2:** Do the same for the cancerous sample. Do the matrices look different? Can you think why?\n", |
| 94 | + "\n", |
| 95 | + "**Answer:** " |
| 96 | + ] |
| 97 | + }, |
| 98 | + { |
| 99 | + "cell_type": "code", |
| 100 | + "execution_count": null, |
| 101 | + "metadata": {}, |
| 102 | + "outputs": [], |
| 103 | + "source": [ |
| 104 | + "# example images\n", |
| 105 | + "healthy = X_train[7, :, :] \n", |
| 106 | + "cancer = X_train[5, :, :] \n", |
| 107 | + "\n", |
| 108 | + "# calculate and plot GLCM\n", |
| 109 | + "from skimage.feature import greycomatrix\n", |
| 110 | + "\n", |
| 111 | + "plt.figure(figsize=(10,4))\n", |
| 112 | + "\n", |
| 113 | + "plt.subplot(121)\n", |
| 114 | + "glcm_healthy = greycomatrix(np.round(healthy*63).astype('uint8'), [3], [0],64)\n", |
| 115 | + "plt.imshow(glcm_healthy.reshape(64,64), cmap='gray')\n", |
| 116 | + "plt.title('GLCM healthy')\n", |
| 117 | + "\n", |
| 118 | + "plt.subplot(122)\n", |
| 119 | + "glcm_cancer = None\n", |
| 120 | + "\n", |
| 121 | + "_=plt.title('GLCM cancer')" |
| 122 | + ] |
| 123 | + }, |
| 124 | + { |
| 125 | + "cell_type": "markdown", |
| 126 | + "metadata": {}, |
| 127 | + "source": [ |
| 128 | + "Now we can calculate some statistical properties from the GLCM matrix. We can do that using `skimage` object `greycoprops`. Print out different statistical measures for the healthy and cancerous tissue:\n", |
| 129 | + "* `'contrast'`\n", |
| 130 | + "* `'dissimilarity'`\n", |
| 131 | + "* `'homogeneity'`\n", |
| 132 | + "* `'energy'`\n", |
| 133 | + "* `'correlation'`\n", |
| 134 | + "\n", |
| 135 | + "**Activity 3:** Complete the code below to generate all five measures for both healthy and cancerous samples." |
| 136 | + ] |
| 137 | + }, |
| 138 | + { |
| 139 | + "cell_type": "code", |
| 140 | + "execution_count": null, |
| 141 | + "metadata": {}, |
| 142 | + "outputs": [], |
| 143 | + "source": [ |
| 144 | + "from skimage.feature import greycoprops\n", |
| 145 | + "properties = ['contrast', 'dissimilarity']\n", |
| 146 | + "\n", |
| 147 | + "for p in properties:\n", |
| 148 | + " print(p+': ')\n", |
| 149 | + " print(' healthy: ', np.round(greycoprops(glcm_healthy, p)[0,0],2))\n", |
| 150 | + " print(' cancer: ', None)\n" |
| 151 | + ] |
| 152 | + }, |
| 153 | + { |
| 154 | + "cell_type": "markdown", |
| 155 | + "metadata": {}, |
| 156 | + "source": [ |
| 157 | + "## Exercise 1\n", |
| 158 | + "\n", |
| 159 | + "In this exercise you will train a logistic regression classifier to detect cancer using GLCM features. Complete code below as follows:\n", |
| 160 | + "* Extract two GLCM features of your choice. To do that, complete the function `getGLCMfeatures`. Feature extraction code is given.\n", |
| 161 | + "* Fit the logistic regression model to the training data and calculate training performance using function `PerformanceMeasures`.\n", |
| 162 | + "* Evaluate performance on the test data using function `PerformanceMeasures`.\n", |
| 163 | + "* Amend features extracted in function `getGLCMfeatures` to achieve good performance of the model." |
| 164 | + ] |
| 165 | + }, |
| 166 | + { |
| 167 | + "cell_type": "code", |
| 168 | + "execution_count": null, |
| 169 | + "metadata": {}, |
| 170 | + "outputs": [], |
| 171 | + "source": [ |
| 172 | + "from sklearn.preprocessing import StandardScaler\n", |
| 173 | + "from sklearn.metrics import recall_score\n", |
| 174 | + "from sklearn.linear_model import LogisticRegression\n", |
| 175 | + "\n", |
| 176 | + "def getGLCMfeatures(im):\n", |
| 177 | + " im = np.round(im*63).astype('uint8')\n", |
| 178 | + " glcm = greycomatrix(im, [3], [0], 64)\n", |
| 179 | + " feature1 = greycoprops(glcm, None)[0, 0]\n", |
| 180 | + " feature2 = None\n", |
| 181 | + " return feature1, feature2\n", |
| 182 | + "\n", |
| 183 | + "def PerformanceMeasures(model,X,y): \n", |
| 184 | + "\n", |
| 185 | + " accuracy = model.score(X,y)\n", |
| 186 | + " y_pred = model.predict(X)\n", |
| 187 | + " sensitivity = recall_score(y,y_pred)\n", |
| 188 | + " specificity = recall_score(y,y_pred,pos_label=0)\n", |
| 189 | + "\n", |
| 190 | + " print('Accuracy: ', round(accuracy,2))\n", |
| 191 | + " print('Sensitivity: ', round(sensitivity,2))\n", |
| 192 | + " print('Specificity: ', round(specificity,2))\n", |
| 193 | + "\n", |
| 194 | + "# feature extraction\n", |
| 195 | + "X_train_features = []\n", |
| 196 | + "for im in X_train:\n", |
| 197 | + " X_train_features.append(getGLCMfeatures(im))\n", |
| 198 | + "X_train_features = np.asarray(X_train_features)\n", |
| 199 | + "scaler= StandardScaler()\n", |
| 200 | + "X_train_features=scaler.fit_transform(X_train_features)\n", |
| 201 | + "\n", |
| 202 | + "# fit model\n", |
| 203 | + "model = None\n", |
| 204 | + "\n", |
| 205 | + "print('Training perforance:')\n", |
| 206 | + "\n", |
| 207 | + "\n", |
| 208 | + "# test\n", |
| 209 | + "X_test_features = []\n", |
| 210 | + "for im in X_test:\n", |
| 211 | + " X_test_features.append(getGLCMfeatures(im))\n", |
| 212 | + "X_test_features = np.asarray(X_test_features)\n", |
| 213 | + "X_test_features=scaler.fit_transform(X_test_features)\n", |
| 214 | + "\n", |
| 215 | + "print('Test performance:')\n" |
| 216 | + ] |
| 217 | + }, |
| 218 | + { |
| 219 | + "cell_type": "markdown", |
| 220 | + "metadata": {}, |
| 221 | + "source": [ |
| 222 | + "# Cancer detection using localised feature descriptors\n", |
| 223 | + "\n", |
| 224 | + "Now we will try to train a classifier using the DAISY descriptor instead. First, let's extract the DAISY features from the histological images. \n", |
| 225 | + "\n", |
| 226 | + "\n", |
| 227 | + "In the lectures we have seen a number of feature extractors that are available at `skimage`, including `daisy`. \n", |
| 228 | + "\n", |
| 229 | + "**Activity 4:** Run the code below to perform feature extraction using `skimage` object `daisy` and visualise your extracted features. \n", |
| 230 | + "* Change the parameters `step` and `radius` to see how the daisy extractor changes.\n", |
| 231 | + "* Set `step` to 60 and `radius` to 30. Then try to change the other parameters of the DAISY descriptor." |
| 232 | + ] |
| 233 | + }, |
| 234 | + { |
| 235 | + "cell_type": "code", |
| 236 | + "execution_count": null, |
| 237 | + "metadata": {}, |
| 238 | + "outputs": [], |
| 239 | + "source": [ |
| 240 | + "from skimage.feature import daisy\n", |
| 241 | + "\n", |
| 242 | + "# example feature extraction using daisy\n", |
| 243 | + "features_daisy, visualisation_daisy = daisy(healthy, step=50, radius=20, rings=2, histograms=8, orientations=8, visualize=True)\n", |
| 244 | + "plt.imshow(visualisation_daisy)\n", |
| 245 | + "plt.title('Daisy')\n", |
| 246 | + "# Extracted features\n", |
| 247 | + "print('Feature vector shape daisy: ', features_daisy.shape)" |
| 248 | + ] |
| 249 | + }, |
| 250 | + { |
| 251 | + "cell_type": "markdown", |
| 252 | + "metadata": {}, |
| 253 | + "source": [ |
| 254 | + "## Exercise 2 (optional)\n", |
| 255 | + "\n", |
| 256 | + "Train a classifier to detect cancer in histological images using features extracted by DAISY descriptor.\n", |
| 257 | + "* Complete the function `daisy_feature_extractor`. *Hint: Flatten the features after exraction.*\n", |
| 258 | + "* Run the code below to extract the daisy features for training and test sets. This may take a while to run." |
| 259 | + ] |
| 260 | + }, |
| 261 | + { |
| 262 | + "cell_type": "code", |
| 263 | + "execution_count": null, |
| 264 | + "metadata": {}, |
| 265 | + "outputs": [], |
| 266 | + "source": [ |
| 267 | + "# Feature extractor\n", |
| 268 | + "def daisy_feature_extractor(image): \n", |
| 269 | + " return None\n", |
| 270 | + "\n", |
| 271 | + "# Perform feature extraction for both training and test set\n", |
| 272 | + "\n", |
| 273 | + "X_train_features = []\n", |
| 274 | + "X_test_features = []\n", |
| 275 | + "\n", |
| 276 | + "# Go through all the images, perform feature extraction and then append them to the list\n", |
| 277 | + "for img in X_train:\n", |
| 278 | + " X_train_features.append(daisy_feature_extractor(img))\n", |
| 279 | + "for img in X_test:\n", |
| 280 | + " X_test_features.append(daisy_feature_extractor(img))\n", |
| 281 | + " \n", |
| 282 | + "# Make the lists back into numpy arrays\n", |
| 283 | + "X_train_features = np.asarray(X_train_features)\n", |
| 284 | + "X_test_features = np.asarray(X_test_features)\n", |
| 285 | + "\n", |
| 286 | + "# Print dimensions\n", |
| 287 | + "print('Feature matrix train: ', X_train_features.shape)\n", |
| 288 | + "print('Feature matrix test: ', X_test_features.shape)" |
| 289 | + ] |
| 290 | + }, |
| 291 | + { |
| 292 | + "cell_type": "markdown", |
| 293 | + "metadata": {}, |
| 294 | + "source": [ |
| 295 | + "* Train a random forest classifier to detect cancer\n", |
| 296 | + "* Evaluate training and test performance" |
| 297 | + ] |
| 298 | + }, |
| 299 | + { |
| 300 | + "cell_type": "code", |
| 301 | + "execution_count": null, |
| 302 | + "metadata": {}, |
| 303 | + "outputs": [], |
| 304 | + "source": [ |
| 305 | + "from sklearn.ensemble import RandomForestClassifier\n", |
| 306 | + "model = RandomForestClassifier(min_samples_leaf = 50) \n", |
| 307 | + "\n", |
| 308 | + "\n", |
| 309 | + "print('Training performance:')\n", |
| 310 | + "\n", |
| 311 | + "\n", |
| 312 | + "print('Test performance:')\n", |
| 313 | + "\n" |
| 314 | + ] |
| 315 | + }, |
| 316 | + { |
| 317 | + "cell_type": "markdown", |
| 318 | + "metadata": {}, |
| 319 | + "source": [ |
| 320 | + "* Compare the performance to GLSM features" |
| 321 | + ] |
| 322 | + }, |
| 323 | + { |
| 324 | + "cell_type": "markdown", |
| 325 | + "metadata": {}, |
| 326 | + "source": [ |
| 327 | + "**Answer:** " |
| 328 | + ] |
| 329 | + } |
| 330 | + ], |
| 331 | + "metadata": { |
| 332 | + "kernelspec": { |
| 333 | + "display_name": "Python 3", |
| 334 | + "language": "python", |
| 335 | + "name": "python3" |
| 336 | + }, |
| 337 | + "language_info": { |
| 338 | + "codemirror_mode": { |
| 339 | + "name": "ipython", |
| 340 | + "version": 3 |
| 341 | + }, |
| 342 | + "file_extension": ".py", |
| 343 | + "mimetype": "text/x-python", |
| 344 | + "name": "python", |
| 345 | + "nbconvert_exporter": "python", |
| 346 | + "pygments_lexer": "ipython3", |
| 347 | + "version": "3.8.3" |
| 348 | + } |
| 349 | + }, |
| 350 | + "nbformat": 4, |
| 351 | + "nbformat_minor": 4 |
| 352 | +} |
0 commit comments