-
Notifications
You must be signed in to change notification settings - Fork 587
Description
The full data-set (no train-test split or cv) is used for modeling in the following notebooks:
- linear_regression_without_sklearn.py
- linear_models_ex_01.py and its solution
- linear_regression_in_sklearn.py
- linear_models_ex_02.py and its solution
- linear_regression_non_linear_link.py
- linear_models_ex_04.py and its solution
- logistic_regression_non_linear.py
- trees_regression.py
- trees_ex_02.py and its solution
- ensemble_bagging.py
- ensemble_adaboost.py
This has been a source of confusion (see for instance this forum question).
We should add a Warning message similar (but adapted to each case) to the one in logistic_regression_non_linear.py:
Warning: Be aware that we fit and will check the boundary decision of the classifier on the same dataset without splitting the dataset into a training set and a testing set. While this is a bad practice, we use it for the sake of simplicity to depict the model behavior. Always use cross-validation when you want to assess the generalization performance of a machine-learning model.
Additionally, a Warning message should be added in the following notebooks
- linear_models_ex_01.py and its solution
- linear_regression_in_sklearn.py
- linear_models_ex_02.py and its solution
- linear_regression_non_linear_link.py
where we remind the user that scoring the model in the full data-set is not necessarily wrong but provides no info about under/over-fitting.
What do you think?