INRIA · glemaitre · Jan 6, 2022 · Sep 24, 2021 · Sep 24, 2021 · Nov 26, 2021
diff --git a/python_scripts/ensemble_ex_04.py b/python_scripts/ensemble_ex_04.py
@@ -34,10 +34,9 @@
 # Write your code here.
 
 # %% [markdown]
-# Create a validation curve to assess the impact of the number of trees
-# on the generalization performance of the model. Evaluate the list of parameters
-# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error
-# to assess the generalization performance of the model.
+# Create a validation curve using the training set to assess the impact of the
+# number of trees on the performance of the model. Evaluate the list of parameters
+# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error.
 
 # %%
 # Write your code here.
@@ -60,3 +59,10 @@
 
 # %%
 # Write your code here.
+
+# %% [markdown]
+# Estimate the generalization performance of this model using the test set
+# and `sklearn.metrics.mean_absolute_error`.
+
+# %%
+# Write your code here.
diff --git a/python_scripts/ensemble_hyperparameters.py b/python_scripts/ensemble_hyperparameters.py
@@ -61,22 +61,29 @@
 cv_results[columns].sort_values(by="rank_test_score")
 
 # %% [markdown]
-# We can observe that in our grid-search, the largest `max_depth` together
-# with the largest `n_estimators` led to the best generalization performance.
-#
+# We can observe that in our grid-search, the largest `max_depth` together with
+# the largest `n_estimators` led, on average, to the best performance on the
+# validation sets. Now we will estimate the generalization performance of the
+# best model by refitting it with the full training set and using the test set
+# for scoring on unseen data. This is done by default when calling the `.fit`
+# method.
+
+# %%
+error = -grid_search.score(data_test, target_test)
+print(f"On average, our random forest regressor makes an error of {error:.2f} k$")
+
+# %% [markdown]
 # ## Gradient-boosting decision trees
 #
-# For gradient-boosting, parameters are coupled, so we cannot set the
-# parameters one after the other anymore. The important parameters are
-# `n_estimators`, `max_depth`, and `learning_rate`.
+# For gradient-boosting, parameters are coupled, so we cannot set the parameters
+# one after the other anymore. The important parameters are `n_estimators`,
+# `max_depth`, and `learning_rate`.
 #
-# Let's first discuss the `max_depth` parameter.
-# We saw in the section on gradient-boosting that the algorithm fits the error
-# of the previous tree in the ensemble. Thus, fitting fully grown trees will
-# be detrimental.
-# Indeed, the first tree of the ensemble would perfectly fit (overfit) the data
-# and thus no subsequent tree would be required, since there would be no
-# residuals.
+# Let's first discuss the `max_depth` parameter. We saw in the section on
+# gradient-boosting that the algorithm fits the error of the previous tree in
+# the ensemble. Thus, fitting fully grown trees will be detrimental. Indeed, the
+# first tree of the ensemble would perfectly fit (overfit) the data and thus no
+# subsequent tree would be required, since there would be no residuals.
 # Therefore, the tree used in gradient-boosting should have a low depth,
 # typically between 3 to 8 levels. Having very weak learners at each step will
 # help reducing overfitting.
@@ -85,16 +92,15 @@
 # residuals will be corrected and less learners are required. Therefore,
 # `n_estimators` should be increased if `max_depth` is lower.
 #
-# Finally, we have overlooked the impact of the `learning_rate` parameter
-# until now. When fitting the residuals, we would like the tree
-# to try to correct all possible errors or only a fraction of them.
-# The learning-rate allows you to control this behaviour.
-# A small learning-rate value would only correct the residuals of very few
-# samples. If a large learning-rate is set (e.g., 1), we would fit the
-# residuals of all samples. So, with a very low learning-rate, we will need
-# more estimators to correct the overall error. However, a too large
-# learning-rate tends to obtain an overfitted ensemble,
-# similar to having a too large tree depth.
+# Finally, we have overlooked the impact of the `learning_rate` parameter until
+# now. When fitting the residuals, we would like the tree to try to correct all
+# possible errors or only a fraction of them. The learning-rate allows you to
+# control this behaviour. A small learning-rate value would only correct the
+# residuals of very few samples. If a large learning-rate is set (e.g., 1), we
+# would fit the residuals of all samples. So, with a very low learning-rate, we
+# will need more estimators to correct the overall error. However, a too large
+# learning-rate tends to obtain an overfitted ensemble, similar to having a too
+# large tree depth.
 
 # %%
 from sklearn.ensemble import GradientBoostingRegressor
@@ -121,3 +127,17 @@
 # Here, we tune the `n_estimators` but be aware that using early-stopping as
 # in the previous exercise will be better.
 # ```
+
+# %% [markdown]
+# Now we estimate the generalization performance of the best model
+# using the test set.
+
+# %%
+error = -grid_search.score(data_test, target_test)
+print(f"On average, our GBDT regressor makes an error of {error:.2f} k$")
+
+# %% [markdown]
+# The mean test score in the held-out test set is slightly better than the score
+# of the best model. The reason is that the final model is refitted on the whole
+# training set and therefore, on more data than the inner cross-validated models
+# of the grid search procedure.
diff --git a/python_scripts/ensemble_sol_04.py b/python_scripts/ensemble_sol_04.py
@@ -36,17 +36,16 @@
 gbdt = GradientBoostingRegressor(max_depth=5, learning_rate=0.5)
 
 # %% [markdown]
-# Create a validation curve to assess the impact of the number of trees
-# on the generalization performance of the model. Evaluate the list of parameters
-# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error
-# to assess the generalization performance of the model.
+# Create a validation curve using the training set to assess the impact of the
+# number of trees on the performance of the model. Evaluate the list of parameters
+# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error.
 
 # %%
 # solution
 from sklearn.model_selection import validation_curve
 
 param_range = [1, 2, 5, 10, 20, 50, 100]
-gbdt_train_scores, gbdt_test_scores = validation_curve(
+gbdt_train_scores, gbdt_validation_scores = validation_curve(
     gbdt,
     data_train,
     target_train,
@@ -55,7 +54,7 @@
     scoring="neg_mean_absolute_error",
     n_jobs=2,
 )
-gbdt_train_errors, gbdt_test_errors = -gbdt_train_scores, -gbdt_test_scores
+gbdt_train_errors, gbdt_validation_errors = -gbdt_train_scores, -gbdt_validation_scores
 
 # %% tags=["solution"]
 import matplotlib.pyplot as plt
@@ -68,8 +67,8 @@
 )
 plt.errorbar(
     param_range,
-    gbdt_test_errors.mean(axis=1),
-    yerr=gbdt_test_errors.std(axis=1),
+    gbdt_validation_errors.mean(axis=1),
+    yerr=gbdt_validation_errors.std(axis=1),
     label="Cross-validation",
 )
 
@@ -103,4 +102,29 @@
 # %% [markdown] tags=["solution"]
 # We see that the number of trees used is far below 1000 with the current
 # dataset. Training the GBDT with the entire 1000 trees would have been
-# useless.
+# useless.
+
+# %% [markdown]
+# Estimate the generalization performance of this model again using
+# the `sklearn.metrics.mean_absolute_error` metric but this time using
+# the test set that we held out at the beginning of the notebook.
+# Compare the resulting value with the values observed in the validation
+# curve.
+
+# %%
+# solution
+from sklearn.metrics import mean_absolute_error
+error = mean_absolute_error(target_test, gbdt.predict(data_test))
+print(f"On average, our GBDT regressor makes an error of {error:.2f} k$")
+
+# %% [markdown] tags=["solution"]
+# We observe that the MAE value measure on the held out test set is close to the
+# validation error measured to the right hand side of the validation curve. This
+# is kind of reassuring, as it means that both the cross-validation procedure
+# and the outer train-test split roughly agree as approximations of the true
+# generalization performance of the model. We can observe that the final
+# evaluation of the test error seems to be even slightly below than the
+# cross-validated test scores. This can be explained because the final model has
+# been trained on the full training set while the cross-validation models have
+# been trained on smaller subsets: in general the larger the number of training
+# points, the lower the test error.
diff --git a/python_scripts/linear_models_ex_05.py b/python_scripts/linear_models_ex_05.py
@@ -19,7 +19,6 @@
 
 # %%
 import pandas as pd
-from sklearn.model_selection import train_test_split
 
 penguins = pd.read_csv("../datasets/penguins_classification.csv")
 # only keep the Adelie and Chinstrap classes

diff --git a/python_scripts/linear_models_sol_05.py b/python_scripts/linear_models_sol_05.py
@@ -18,7 +18,6 @@
 
 # %%
 import pandas as pd
-from sklearn.model_selection import train_test_split
 
 penguins = pd.read_csv("../datasets/penguins_classification.csv")
 # only keep the Adelie and Chinstrap classes
@@ -67,6 +66,7 @@
 for C in Cs:
     logistic_regression.set_params(logisticregression__C=C)
     logistic_regression.fit(data_train, target_train)
+    accuracy = logistic_regression.score(data_test, target_test)
 
     DecisionBoundaryDisplay.from_estimator(
         logistic_regression,
@@ -78,7 +78,7 @@
     sns.scatterplot(
         data=penguins_test, x=culmen_columns[0], y=culmen_columns[1],
         hue=target_column, palette=["tab:red", "tab:blue"])
-    plt.title(f"C: {C}")
+    plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
 
 # %% [markdown]
 # Look at the impact of the `C` hyperparameter on the magnitude of the weights.

diff --git a/python_scripts/logistic_regression.py b/python_scripts/logistic_regression.py
@@ -81,6 +81,8 @@
     StandardScaler(), LogisticRegression(penalty="none")
 )
 logistic_regression.fit(data_train, target_train)
+accuracy = logistic_regression.score(data_test, target_test)
+print(f"Accuracy on test set: {accuracy:.3f}")
 
 # %% [markdown]
 #