|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "**[Machine Learning Micro-Course Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**\n", |
| 8 | + "\n", |
| 9 | + "---\n" |
| 10 | + ] |
| 11 | + }, |
| 12 | + { |
| 13 | + "cell_type": "markdown", |
| 14 | + "metadata": {}, |
| 15 | + "source": [ |
| 16 | + "## Recap\n", |
| 17 | + "You've built your first model, and now it's time to optimize the size of the tree to make better predictions. Run this cell to set up your coding environment where the previous step left off." |
| 18 | + ] |
| 19 | + }, |
| 20 | + { |
| 21 | + "cell_type": "code", |
| 22 | + "execution_count": 1, |
| 23 | + "metadata": {}, |
| 24 | + "outputs": [ |
| 25 | + { |
| 26 | + "name": "stdout", |
| 27 | + "output_type": "stream", |
| 28 | + "text": [ |
| 29 | + "Validation MAE: 29,653\n", |
| 30 | + "\n", |
| 31 | + "Setup complete\n" |
| 32 | + ] |
| 33 | + } |
| 34 | + ], |
| 35 | + "source": [ |
| 36 | + "# Code you have previously used to load data\n", |
| 37 | + "import pandas as pd\n", |
| 38 | + "from sklearn.metrics import mean_absolute_error\n", |
| 39 | + "from sklearn.model_selection import train_test_split\n", |
| 40 | + "from sklearn.tree import DecisionTreeRegressor\n", |
| 41 | + "\n", |
| 42 | + "\n", |
| 43 | + "# Path of the file to read\n", |
| 44 | + "iowa_file_path = '../input/home-data-for-ml-course/train.csv'\n", |
| 45 | + "\n", |
| 46 | + "home_data = pd.read_csv(iowa_file_path)\n", |
| 47 | + "# Create target object and call it y\n", |
| 48 | + "y = home_data.SalePrice\n", |
| 49 | + "# Create X\n", |
| 50 | + "features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\n", |
| 51 | + "X = home_data[features]\n", |
| 52 | + "\n", |
| 53 | + "# Split into validation and training data\n", |
| 54 | + "train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n", |
| 55 | + "\n", |
| 56 | + "# Specify Model\n", |
| 57 | + "iowa_model = DecisionTreeRegressor(random_state=1)\n", |
| 58 | + "# Fit Model\n", |
| 59 | + "iowa_model.fit(train_X, train_y)\n", |
| 60 | + "\n", |
| 61 | + "# Make validation predictions and calculate mean absolute error\n", |
| 62 | + "val_predictions = iowa_model.predict(val_X)\n", |
| 63 | + "val_mae = mean_absolute_error(val_predictions, val_y)\n", |
| 64 | + "print(\"Validation MAE: {:,.0f}\".format(val_mae))\n", |
| 65 | + "\n", |
| 66 | + "# Set up code checking\n", |
| 67 | + "from learntools.core import binder\n", |
| 68 | + "binder.bind(globals())\n", |
| 69 | + "from learntools.machine_learning.ex5 import *\n", |
| 70 | + "print(\"\\nSetup complete\")" |
| 71 | + ] |
| 72 | + }, |
| 73 | + { |
| 74 | + "cell_type": "markdown", |
| 75 | + "metadata": {}, |
| 76 | + "source": [ |
| 77 | + "# Exercises\n", |
| 78 | + "You could write the function `get_mae` yourself. For now, we'll supply it. This is the same function you read about in the previous lesson. Just run the cell below." |
| 79 | + ] |
| 80 | + }, |
| 81 | + { |
| 82 | + "cell_type": "code", |
| 83 | + "execution_count": 2, |
| 84 | + "metadata": {}, |
| 85 | + "outputs": [], |
| 86 | + "source": [ |
| 87 | + "def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):\n", |
| 88 | + " model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)\n", |
| 89 | + " model.fit(train_X, train_y)\n", |
| 90 | + " preds_val = model.predict(val_X)\n", |
| 91 | + " mae = mean_absolute_error(val_y, preds_val)\n", |
| 92 | + " return(mae)" |
| 93 | + ] |
| 94 | + }, |
| 95 | + { |
| 96 | + "cell_type": "markdown", |
| 97 | + "metadata": {}, |
| 98 | + "source": [ |
| 99 | + "## Step 1: Compare Different Tree Sizes\n", |
| 100 | + "Write a loop that tries the following values for *max_leaf_nodes* from a set of possible values.\n", |
| 101 | + "\n", |
| 102 | + "Call the *get_mae* function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of `max_leaf_nodes` that gives the most accurate model on your data." |
| 103 | + ] |
| 104 | + }, |
| 105 | + { |
| 106 | + "cell_type": "code", |
| 107 | + "execution_count": 3, |
| 108 | + "metadata": {}, |
| 109 | + "outputs": [ |
| 110 | + { |
| 111 | + "data": { |
| 112 | + "application/javascript": [ |
| 113 | + "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.5, \"interactionType\": 1, \"questionType\": 1, \"learnTutorialId\": 120, \"questionId\": \"1_BestTreeSize\", \"learnToolsVersion\": \"0.2.13\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" |
| 114 | + ], |
| 115 | + "text/plain": [ |
| 116 | + "<IPython.core.display.Javascript object>" |
| 117 | + ] |
| 118 | + }, |
| 119 | + "metadata": {}, |
| 120 | + "output_type": "display_data" |
| 121 | + }, |
| 122 | + { |
| 123 | + "data": { |
| 124 | + "text/markdown": [ |
| 125 | + "<span style=\"color:#33cc33\">Correct</span>" |
| 126 | + ], |
| 127 | + "text/plain": [ |
| 128 | + "Correct" |
| 129 | + ] |
| 130 | + }, |
| 131 | + "metadata": {}, |
| 132 | + "output_type": "display_data" |
| 133 | + } |
| 134 | + ], |
| 135 | + "source": [ |
| 136 | + "candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]\n", |
| 137 | + "# Write loop to find the ideal tree size from candidate_max_leaf_nodes\n", |
| 138 | + "value, mae = candidate_max_leaf_nodes[0], 1000000\n", |
| 139 | + "for mln in candidate_max_leaf_nodes:\n", |
| 140 | + " temp = get_mae(mln, train_X, val_X, train_y, val_y)\n", |
| 141 | + " if temp < mae:\n", |
| 142 | + " value, mae = mln, temp\n", |
| 143 | + "\n", |
| 144 | + "# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)\n", |
| 145 | + "best_tree_size = value\n", |
| 146 | + "\n", |
| 147 | + "step_1.check()" |
| 148 | + ] |
| 149 | + }, |
| 150 | + { |
| 151 | + "cell_type": "code", |
| 152 | + "execution_count": 4, |
| 153 | + "metadata": {}, |
| 154 | + "outputs": [], |
| 155 | + "source": [ |
| 156 | + "# The lines below will show you a hint or the solution.\n", |
| 157 | + "# step_1.hint() \n", |
| 158 | + "# step_1.solution()" |
| 159 | + ] |
| 160 | + }, |
| 161 | + { |
| 162 | + "cell_type": "markdown", |
| 163 | + "metadata": {}, |
| 164 | + "source": [ |
| 165 | + "## Step 2: Fit Model Using All Data\n", |
| 166 | + "You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions." |
| 167 | + ] |
| 168 | + }, |
| 169 | + { |
| 170 | + "cell_type": "code", |
| 171 | + "execution_count": 5, |
| 172 | + "metadata": {}, |
| 173 | + "outputs": [ |
| 174 | + { |
| 175 | + "data": { |
| 176 | + "application/javascript": [ |
| 177 | + "parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 0.5, \"interactionType\": 1, \"questionType\": 2, \"learnTutorialId\": 120, \"questionId\": \"2_FitModelWithAllData\", \"learnToolsVersion\": \"0.2.13\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")" |
| 178 | + ], |
| 179 | + "text/plain": [ |
| 180 | + "<IPython.core.display.Javascript object>" |
| 181 | + ] |
| 182 | + }, |
| 183 | + "metadata": {}, |
| 184 | + "output_type": "display_data" |
| 185 | + }, |
| 186 | + { |
| 187 | + "data": { |
| 188 | + "text/markdown": [ |
| 189 | + "<span style=\"color:#33cc33\">Correct</span>" |
| 190 | + ], |
| 191 | + "text/plain": [ |
| 192 | + "Correct" |
| 193 | + ] |
| 194 | + }, |
| 195 | + "metadata": {}, |
| 196 | + "output_type": "display_data" |
| 197 | + } |
| 198 | + ], |
| 199 | + "source": [ |
| 200 | + "# Fill in argument to make optimal size and uncomment\n", |
| 201 | + "final_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size, random_state=0)\n", |
| 202 | + "\n", |
| 203 | + "# fit the final model and uncomment the next two lines\n", |
| 204 | + "final_model.fit(X, y)\n", |
| 205 | + "step_2.check()" |
| 206 | + ] |
| 207 | + }, |
| 208 | + { |
| 209 | + "cell_type": "code", |
| 210 | + "execution_count": 6, |
| 211 | + "metadata": {}, |
| 212 | + "outputs": [], |
| 213 | + "source": [ |
| 214 | + "# step_2.hint()\n", |
| 215 | + "# step_2.solution()" |
| 216 | + ] |
| 217 | + }, |
| 218 | + { |
| 219 | + "cell_type": "markdown", |
| 220 | + "metadata": {}, |
| 221 | + "source": [ |
| 222 | + "You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.\n", |
| 223 | + "\n", |
| 224 | + "# Keep Going\n", |
| 225 | + "\n", |
| 226 | + "You are ready for **[Random Forests](https://www.kaggle.com/dansbecker/random-forests).**\n" |
| 227 | + ] |
| 228 | + }, |
| 229 | + { |
| 230 | + "cell_type": "markdown", |
| 231 | + "metadata": {}, |
| 232 | + "source": [ |
| 233 | + "---\n", |
| 234 | + "**[Machine Learning Micro-Course Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**\n", |
| 235 | + "\n" |
| 236 | + ] |
| 237 | + }, |
| 238 | + { |
| 239 | + "cell_type": "code", |
| 240 | + "execution_count": 7, |
| 241 | + "metadata": {}, |
| 242 | + "outputs": [], |
| 243 | + "source": [] |
| 244 | + } |
| 245 | + ], |
| 246 | + "metadata": { |
| 247 | + "kernelspec": { |
| 248 | + "display_name": "Python 3", |
| 249 | + "language": "python", |
| 250 | + "name": "python3" |
| 251 | + }, |
| 252 | + "language_info": { |
| 253 | + "codemirror_mode": { |
| 254 | + "name": "ipython", |
| 255 | + "version": 3 |
| 256 | + }, |
| 257 | + "file_extension": ".py", |
| 258 | + "mimetype": "text/x-python", |
| 259 | + "name": "python", |
| 260 | + "nbconvert_exporter": "python", |
| 261 | + "pygments_lexer": "ipython3", |
| 262 | + "version": "3.6.5" |
| 263 | + } |
| 264 | + }, |
| 265 | + "nbformat": 4, |
| 266 | + "nbformat_minor": 1 |
| 267 | +} |
0 commit comments