Skip to content

Commit 196f7c9

Browse files
glemaitreogrisel
andcommitted
[ci skip] Scoring model as a good practice (#464)
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> 2fc9dd3
1 parent 57df756 commit 196f7c9

20 files changed

+344
-221
lines changed

_images/ensemble_sol_04_7_0.png

17 Bytes
Loading
3.09 KB
Loading
3.11 KB
Loading
2.91 KB
Loading
2.95 KB
Loading

_sources/python_scripts/ensemble_ex_04.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,9 @@
3434
# Write your code here.
3535

3636
# %% [markdown]
37-
# Create a validation curve to assess the impact of the number of trees
38-
# on the generalization performance of the model. Evaluate the list of parameters
39-
# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error
40-
# to assess the generalization performance of the model.
37+
# Create a validation curve using the training set to assess the impact of the
38+
# number of trees on the performance of the model. Evaluate the list of parameters
39+
# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error.
4140

4241
# %%
4342
# Write your code here.
@@ -60,3 +59,10 @@
6059

6160
# %%
6261
# Write your code here.
62+
63+
# %% [markdown]
64+
# Estimate the generalization performance of this model using the test set
65+
# and `sklearn.metrics.mean_absolute_error`.
66+
67+
# %%
68+
# Write your code here.

_sources/python_scripts/ensemble_hyperparameters.py

Lines changed: 43 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -61,22 +61,29 @@
6161
cv_results[columns].sort_values(by="rank_test_score")
6262

6363
# %% [markdown]
64-
# We can observe that in our grid-search, the largest `max_depth` together
65-
# with the largest `n_estimators` led to the best generalization performance.
66-
#
64+
# We can observe that in our grid-search, the largest `max_depth` together with
65+
# the largest `n_estimators` led, on average, to the best performance on the
66+
# validation sets. Now we will estimate the generalization performance of the
67+
# best model by refitting it with the full training set and using the test set
68+
# for scoring on unseen data. This is done by default when calling the `.fit`
69+
# method.
70+
71+
# %%
72+
error = -grid_search.score(data_test, target_test)
73+
print(f"On average, our random forest regressor makes an error of {error:.2f} k$")
74+
75+
# %% [markdown]
6776
# ## Gradient-boosting decision trees
6877
#
69-
# For gradient-boosting, parameters are coupled, so we cannot set the
70-
# parameters one after the other anymore. The important parameters are
71-
# `n_estimators`, `max_depth`, and `learning_rate`.
78+
# For gradient-boosting, parameters are coupled, so we cannot set the parameters
79+
# one after the other anymore. The important parameters are `n_estimators`,
80+
# `max_depth`, and `learning_rate`.
7281
#
73-
# Let's first discuss the `max_depth` parameter.
74-
# We saw in the section on gradient-boosting that the algorithm fits the error
75-
# of the previous tree in the ensemble. Thus, fitting fully grown trees will
76-
# be detrimental.
77-
# Indeed, the first tree of the ensemble would perfectly fit (overfit) the data
78-
# and thus no subsequent tree would be required, since there would be no
79-
# residuals.
82+
# Let's first discuss the `max_depth` parameter. We saw in the section on
83+
# gradient-boosting that the algorithm fits the error of the previous tree in
84+
# the ensemble. Thus, fitting fully grown trees will be detrimental. Indeed, the
85+
# first tree of the ensemble would perfectly fit (overfit) the data and thus no
86+
# subsequent tree would be required, since there would be no residuals.
8087
# Therefore, the tree used in gradient-boosting should have a low depth,
8188
# typically between 3 to 8 levels. Having very weak learners at each step will
8289
# help reducing overfitting.
@@ -85,16 +92,15 @@
8592
# residuals will be corrected and less learners are required. Therefore,
8693
# `n_estimators` should be increased if `max_depth` is lower.
8794
#
88-
# Finally, we have overlooked the impact of the `learning_rate` parameter
89-
# until now. When fitting the residuals, we would like the tree
90-
# to try to correct all possible errors or only a fraction of them.
91-
# The learning-rate allows you to control this behaviour.
92-
# A small learning-rate value would only correct the residuals of very few
93-
# samples. If a large learning-rate is set (e.g., 1), we would fit the
94-
# residuals of all samples. So, with a very low learning-rate, we will need
95-
# more estimators to correct the overall error. However, a too large
96-
# learning-rate tends to obtain an overfitted ensemble,
97-
# similar to having a too large tree depth.
95+
# Finally, we have overlooked the impact of the `learning_rate` parameter until
96+
# now. When fitting the residuals, we would like the tree to try to correct all
97+
# possible errors or only a fraction of them. The learning-rate allows you to
98+
# control this behaviour. A small learning-rate value would only correct the
99+
# residuals of very few samples. If a large learning-rate is set (e.g., 1), we
100+
# would fit the residuals of all samples. So, with a very low learning-rate, we
101+
# will need more estimators to correct the overall error. However, a too large
102+
# learning-rate tends to obtain an overfitted ensemble, similar to having a too
103+
# large tree depth.
98104

99105
# %%
100106
from sklearn.ensemble import GradientBoostingRegressor
@@ -121,3 +127,17 @@
121127
# Here, we tune the `n_estimators` but be aware that using early-stopping as
122128
# in the previous exercise will be better.
123129
# ```
130+
131+
# %% [markdown]
132+
# Now we estimate the generalization performance of the best model
133+
# using the test set.
134+
135+
# %%
136+
error = -grid_search.score(data_test, target_test)
137+
print(f"On average, our GBDT regressor makes an error of {error:.2f} k$")
138+
139+
# %% [markdown]
140+
# The mean test score in the held-out test set is slightly better than the score
141+
# of the best model. The reason is that the final model is refitted on the whole
142+
# training set and therefore, on more data than the inner cross-validated models
143+
# of the grid search procedure.

_sources/python_scripts/ensemble_sol_04.py

Lines changed: 33 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -36,17 +36,16 @@
3636
gbdt = GradientBoostingRegressor(max_depth=5, learning_rate=0.5)
3737

3838
# %% [markdown]
39-
# Create a validation curve to assess the impact of the number of trees
40-
# on the generalization performance of the model. Evaluate the list of parameters
41-
# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error
42-
# to assess the generalization performance of the model.
39+
# Create a validation curve using the training set to assess the impact of the
40+
# number of trees on the performance of the model. Evaluate the list of parameters
41+
# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error.
4342

4443
# %%
4544
# solution
4645
from sklearn.model_selection import validation_curve
4746

4847
param_range = [1, 2, 5, 10, 20, 50, 100]
49-
gbdt_train_scores, gbdt_test_scores = validation_curve(
48+
gbdt_train_scores, gbdt_validation_scores = validation_curve(
5049
gbdt,
5150
data_train,
5251
target_train,
@@ -55,7 +54,7 @@
5554
scoring="neg_mean_absolute_error",
5655
n_jobs=2,
5756
)
58-
gbdt_train_errors, gbdt_test_errors = -gbdt_train_scores, -gbdt_test_scores
57+
gbdt_train_errors, gbdt_validation_errors = -gbdt_train_scores, -gbdt_validation_scores
5958

6059
# %% tags=["solution"]
6160
import matplotlib.pyplot as plt
@@ -68,8 +67,8 @@
6867
)
6968
plt.errorbar(
7069
param_range,
71-
gbdt_test_errors.mean(axis=1),
72-
yerr=gbdt_test_errors.std(axis=1),
70+
gbdt_validation_errors.mean(axis=1),
71+
yerr=gbdt_validation_errors.std(axis=1),
7372
label="Cross-validation",
7473
)
7574

@@ -103,4 +102,29 @@
103102
# %% [markdown] tags=["solution"]
104103
# We see that the number of trees used is far below 1000 with the current
105104
# dataset. Training the GBDT with the entire 1000 trees would have been
106-
# useless.
105+
# useless.
106+
107+
# %% [markdown]
108+
# Estimate the generalization performance of this model again using
109+
# the `sklearn.metrics.mean_absolute_error` metric but this time using
110+
# the test set that we held out at the beginning of the notebook.
111+
# Compare the resulting value with the values observed in the validation
112+
# curve.
113+
114+
# %%
115+
# solution
116+
from sklearn.metrics import mean_absolute_error
117+
error = mean_absolute_error(target_test, gbdt.predict(data_test))
118+
print(f"On average, our GBDT regressor makes an error of {error:.2f} k$")
119+
120+
# %% [markdown] tags=["solution"]
121+
# We observe that the MAE value measure on the held out test set is close to the
122+
# validation error measured to the right hand side of the validation curve. This
123+
# is kind of reassuring, as it means that both the cross-validation procedure
124+
# and the outer train-test split roughly agree as approximations of the true
125+
# generalization performance of the model. We can observe that the final
126+
# evaluation of the test error seems to be even slightly below than the
127+
# cross-validated test scores. This can be explained because the final model has
128+
# been trained on the full training set while the cross-validation models have
129+
# been trained on smaller subsets: in general the larger the number of training
130+
# points, the lower the test error.

_sources/python_scripts/linear_models_ex_05.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@
1919

2020
# %%
2121
import pandas as pd
22-
from sklearn.model_selection import train_test_split
2322

2423
penguins = pd.read_csv("../datasets/penguins_classification.csv")
2524
# only keep the Adelie and Chinstrap classes

_sources/python_scripts/linear_models_sol_05.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@
1818

1919
# %%
2020
import pandas as pd
21-
from sklearn.model_selection import train_test_split
2221

2322
penguins = pd.read_csv("../datasets/penguins_classification.csv")
2423
# only keep the Adelie and Chinstrap classes
@@ -67,6 +66,7 @@
6766
for C in Cs:
6867
logistic_regression.set_params(logisticregression__C=C)
6968
logistic_regression.fit(data_train, target_train)
69+
accuracy = logistic_regression.score(data_test, target_test)
7070

7171
DecisionBoundaryDisplay.from_estimator(
7272
logistic_regression,
@@ -78,7 +78,7 @@
7878
sns.scatterplot(
7979
data=penguins_test, x=culmen_columns[0], y=culmen_columns[1],
8080
hue=target_column, palette=["tab:red", "tab:blue"])
81-
plt.title(f"C: {C}")
81+
plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
8282

8383
# %% [markdown]
8484
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.

0 commit comments

Comments
 (0)