Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions python_scripts/ensemble_ex_04.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,9 @@
# Write your code here.

# %% [markdown]
# Create a validation curve to assess the impact of the number of trees
# on the generalization performance of the model. Evaluate the list of parameters
# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error
# to assess the generalization performance of the model.
# Create a validation curve using the training set to assess the impact of the
# number of trees on the performance of the model. Evaluate the list of parameters
# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error.

# %%
# Write your code here.
Expand All @@ -60,3 +59,10 @@

# %%
# Write your code here.

# %% [markdown]
# Estimate the generalization performance of this model using the test set
# and `sklearn.metrics.mean_absolute_error`.

# %%
# Write your code here.
66 changes: 43 additions & 23 deletions python_scripts/ensemble_hyperparameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,22 +61,29 @@
cv_results[columns].sort_values(by="rank_test_score")

# %% [markdown]
# We can observe that in our grid-search, the largest `max_depth` together
# with the largest `n_estimators` led to the best generalization performance.
#
# We can observe that in our grid-search, the largest `max_depth` together with
# the largest `n_estimators` led, on average, to the best performance on the
# validation sets. Now we will estimate the generalization performance of the
# best model by refitting it with the full training set and using the test set
# for scoring on unseen data. This is done by default when calling the `.fit`
# method.

# %%
error = -grid_search.score(data_test, target_test)
print(f"On average, our random forest regressor makes an error of {error:.2f} k$")

# %% [markdown]
# ## Gradient-boosting decision trees
#
# For gradient-boosting, parameters are coupled, so we cannot set the
# parameters one after the other anymore. The important parameters are
# `n_estimators`, `max_depth`, and `learning_rate`.
# For gradient-boosting, parameters are coupled, so we cannot set the parameters
# one after the other anymore. The important parameters are `n_estimators`,
# `max_depth`, and `learning_rate`.
#
# Let's first discuss the `max_depth` parameter.
# We saw in the section on gradient-boosting that the algorithm fits the error
# of the previous tree in the ensemble. Thus, fitting fully grown trees will
# be detrimental.
# Indeed, the first tree of the ensemble would perfectly fit (overfit) the data
# and thus no subsequent tree would be required, since there would be no
# residuals.
# Let's first discuss the `max_depth` parameter. We saw in the section on
# gradient-boosting that the algorithm fits the error of the previous tree in
# the ensemble. Thus, fitting fully grown trees will be detrimental. Indeed, the
# first tree of the ensemble would perfectly fit (overfit) the data and thus no
# subsequent tree would be required, since there would be no residuals.
# Therefore, the tree used in gradient-boosting should have a low depth,
# typically between 3 to 8 levels. Having very weak learners at each step will
# help reducing overfitting.
Expand All @@ -85,16 +92,15 @@
# residuals will be corrected and less learners are required. Therefore,
# `n_estimators` should be increased if `max_depth` is lower.
#
# Finally, we have overlooked the impact of the `learning_rate` parameter
# until now. When fitting the residuals, we would like the tree
# to try to correct all possible errors or only a fraction of them.
# The learning-rate allows you to control this behaviour.
# A small learning-rate value would only correct the residuals of very few
# samples. If a large learning-rate is set (e.g., 1), we would fit the
# residuals of all samples. So, with a very low learning-rate, we will need
# more estimators to correct the overall error. However, a too large
# learning-rate tends to obtain an overfitted ensemble,
# similar to having a too large tree depth.
# Finally, we have overlooked the impact of the `learning_rate` parameter until
# now. When fitting the residuals, we would like the tree to try to correct all
# possible errors or only a fraction of them. The learning-rate allows you to
# control this behaviour. A small learning-rate value would only correct the
# residuals of very few samples. If a large learning-rate is set (e.g., 1), we
# would fit the residuals of all samples. So, with a very low learning-rate, we
# will need more estimators to correct the overall error. However, a too large
# learning-rate tends to obtain an overfitted ensemble, similar to having a too
# large tree depth.

# %%
from sklearn.ensemble import GradientBoostingRegressor
Expand All @@ -121,3 +127,17 @@
# Here, we tune the `n_estimators` but be aware that using early-stopping as
# in the previous exercise will be better.
# ```

# %% [markdown]
# Now we estimate the generalization performance of the best model
# using the test set.

# %%
error = -grid_search.score(data_test, target_test)
print(f"On average, our GBDT regressor makes an error of {error:.2f} k$")

# %% [markdown]
# The mean test score in the held-out test set is slightly better than the score
# of the best model. The reason is that the final model is refitted on the whole
# training set and therefore, on more data than the inner cross-validated models
# of the grid search procedure.
42 changes: 33 additions & 9 deletions python_scripts/ensemble_sol_04.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,17 +36,16 @@
gbdt = GradientBoostingRegressor(max_depth=5, learning_rate=0.5)

# %% [markdown]
# Create a validation curve to assess the impact of the number of trees
# on the generalization performance of the model. Evaluate the list of parameters
# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error
# to assess the generalization performance of the model.
# Create a validation curve using the training set to assess the impact of the
# number of trees on the performance of the model. Evaluate the list of parameters
# `param_range = [1, 2, 5, 10, 20, 50, 100]` and use the mean absolute error.

# %%
# solution
from sklearn.model_selection import validation_curve

param_range = [1, 2, 5, 10, 20, 50, 100]
gbdt_train_scores, gbdt_test_scores = validation_curve(
gbdt_train_scores, gbdt_validation_scores = validation_curve(
gbdt,
data_train,
target_train,
Expand All @@ -55,7 +54,7 @@
scoring="neg_mean_absolute_error",
n_jobs=2,
)
gbdt_train_errors, gbdt_test_errors = -gbdt_train_scores, -gbdt_test_scores
gbdt_train_errors, gbdt_validation_errors = -gbdt_train_scores, -gbdt_validation_scores

# %% tags=["solution"]
import matplotlib.pyplot as plt
Expand All @@ -68,8 +67,8 @@
)
plt.errorbar(
param_range,
gbdt_test_errors.mean(axis=1),
yerr=gbdt_test_errors.std(axis=1),
gbdt_validation_errors.mean(axis=1),
yerr=gbdt_validation_errors.std(axis=1),
label="Cross-validation",
)

Expand Down Expand Up @@ -103,4 +102,29 @@
# %% [markdown] tags=["solution"]
# We see that the number of trees used is far below 1000 with the current
# dataset. Training the GBDT with the entire 1000 trees would have been
# useless.
# useless.

# %% [markdown]
# Estimate the generalization performance of this model again using
# the `sklearn.metrics.mean_absolute_error` metric but this time using
# the test set that we held out at the beginning of the notebook.
# Compare the resulting value with the values observed in the validation
# curve.

# %%
# solution
from sklearn.metrics import mean_absolute_error
error = mean_absolute_error(target_test, gbdt.predict(data_test))
print(f"On average, our GBDT regressor makes an error of {error:.2f} k$")

# %% [markdown] tags=["solution"]
# We observe that the MAE value measure on the held out test set is close to the
# validation error measured to the right hand side of the validation curve. This
# is kind of reassuring, as it means that both the cross-validation procedure
# and the outer train-test split roughly agree as approximations of the true
# generalization performance of the model. We can observe that the final
# evaluation of the test error seems to be even slightly below than the
# cross-validated test scores. This can be explained because the final model has
# been trained on the full training set while the cross-validation models have
# been trained on smaller subsets: in general the larger the number of training
# points, the lower the test error.
1 change: 0 additions & 1 deletion python_scripts/linear_models_ex_05.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@

# %%
import pandas as pd
from sklearn.model_selection import train_test_split

penguins = pd.read_csv("../datasets/penguins_classification.csv")
# only keep the Adelie and Chinstrap classes
Expand Down
4 changes: 2 additions & 2 deletions python_scripts/linear_models_sol_05.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@

# %%
import pandas as pd
from sklearn.model_selection import train_test_split

penguins = pd.read_csv("../datasets/penguins_classification.csv")
# only keep the Adelie and Chinstrap classes
Expand Down Expand Up @@ -67,6 +66,7 @@
for C in Cs:
logistic_regression.set_params(logisticregression__C=C)
logistic_regression.fit(data_train, target_train)
accuracy = logistic_regression.score(data_test, target_test)

DecisionBoundaryDisplay.from_estimator(
logistic_regression,
Expand All @@ -78,7 +78,7 @@
sns.scatterplot(
data=penguins_test, x=culmen_columns[0], y=culmen_columns[1],
hue=target_column, palette=["tab:red", "tab:blue"])
plt.title(f"C: {C}")
plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")

# %% [markdown]
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
Expand Down
2 changes: 2 additions & 0 deletions python_scripts/logistic_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@
StandardScaler(), LogisticRegression(penalty="none")
)
logistic_regression.fit(data_train, target_train)
accuracy = logistic_regression.score(data_test, target_test)
print(f"Accuracy on test set: {accuracy:.3f}")

# %% [markdown]
#
Expand Down