Skip to content

Commit

Permalink
Merge pull request scikit-learn#5531 from arjoly/float-min_samples
Browse files Browse the repository at this point in the history
[MRG +2 ]  min_samples_split and min_samples_leaf now accept a percentage
  • Loading branch information
arjoly committed Oct 26, 2015
2 parents 6541f3f + a20e37a commit d9f3277
Show file tree
Hide file tree
Showing 11 changed files with 450 additions and 251 deletions.
6 changes: 3 additions & 3 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,20 +165,20 @@ in bias::
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
... random_state=0)

>>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=1,
>>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
... random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean() # doctest: +ELLIPSIS
0.97...

>>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
... min_samples_split=1, random_state=0)
... min_samples_split=2, random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean() # doctest: +ELLIPSIS
0.999...

>>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
... min_samples_split=1, random_state=0)
... min_samples_split=2, random_state=0)
>>> scores = cross_val_score(clf, X, y)
>>> scores.mean() > 0.999
True
Expand Down
3 changes: 2 additions & 1 deletion doc/modules/tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,8 @@ Tips on practical use
* Use ``min_samples_split`` or ``min_samples_leaf`` to control the number of
samples at a leaf node. A very small number will usually mean the tree
will overfit, whereas a large number will prevent the tree from learning
the data. Try ``min_samples_leaf=5`` as an initial value.
the data. Try ``min_samples_leaf=5`` as an initial value. If the sample size
varies greatly, a float number can be used as percentage in these two parameters.
The main difference between the two is that ``min_samples_leaf`` guarantees
a minimum number of samples in a leaf, while ``min_samples_split`` can
create arbitrary small leaves, though ``min_samples_split`` is more common
Expand Down
23 changes: 19 additions & 4 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ New features
implementation supports kernel engineering, gradient-based hyperparameter optimization or
sampling of functions from GP prior and GP posterior. Extensive documentation and
examples are provided. By `Jan Hendrik Metzen`_.
- Added the :class:`ensemble.IsolationForest` class for anomaly detection based on

- Added the :class:`ensemble.IsolationForest` class for anomaly detection based on
random forests. By `Nicolas Goix`_.

Enhancements
Expand All @@ -39,8 +39,18 @@ Enhancements
method ``decision_path`` which returns the decision path of samples in
the tree. By `Arnaud Joly`_

- A new example has been added unveling the decision tree structure.
By `Arnaud Joly`_

- The random forest, extra tree and decision tree estimators now has a
method ``decision_path`` which returns the decision path of samples in
the tree. By `Arnaud Joly`_

- A new example has been added unveling the decision tree structure.
By `Arnaud Joly`_

- Random forest, extra trees, decision trees and gradient boosting estimator
accept the parameter ``min_samples_split`` and ``min_samples_leaf``
provided as a percentage of the training samples. By
`yelite`_ and `Arnaud Joly`_

Bug fixes
.........
Expand All @@ -65,6 +75,10 @@ Bug fixes
:class:`decomposition.KernelPCA`, :class:`manifold.LocallyLinearEmbedding`,
and :class:`manifold.SpectralEmbedding`. By `Peter Fischer`_.

- Random forest, extra trees, decision trees and gradient boosting
won't accept anymore ``min_samples_split=1`` as at least 2 samples
are required to split a decision tree node. By `Arnaud Joly`_

API changes summary
-------------------

Expand Down Expand Up @@ -3854,3 +3868,4 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Graham Clenaghan: https://github.com/gclenaghan
.. _Giorgio Patrini: https://github.com/giorgiop
.. _Elvis Dohmatob: https://github.com/dohmatob
.. _yelite https://github.com/yelite
2 changes: 1 addition & 1 deletion examples/ensemble/plot_gradient_boosting_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@

###############################################################################
# Fit regression model
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 1,
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

Expand Down
141 changes: 76 additions & 65 deletions sklearn/ensemble/forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -777,36 +777,38 @@ class RandomForestClassifier(ForestClassifier):
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
Note: this parameter is tree-specific.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
Ignored if ``max_leaf_nodes`` is not None.
Note: this parameter is tree-specific.
min_samples_split : integer, optional (default=2)
The minimum number of samples required to split an internal node.
Note: this parameter is tree-specific.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is
discarded if after the split, one of the leaves would contain less then
``min_samples_leaf`` samples.
Note: this parameter is tree-specific.
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a percentage and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
Note: this parameter is tree-specific.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
If not None then ``max_depth`` will be ignored.
Note: this parameter is tree-specific.
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
Expand Down Expand Up @@ -834,7 +836,6 @@ class RandomForestClassifier(ForestClassifier):
new forest.
class_weight : dict, list of dicts, "balanced", "balanced_subsample" or None, optional
Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
Expand All @@ -844,8 +845,9 @@ class RandomForestClassifier(ForestClassifier):
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``
The "balanced_subsample" mode is the same as "balanced" except that weights are
computed based on the bootstrap sample for every tree grown.
The "balanced_subsample" mode is the same as "balanced" except that
weights are computed based on the bootstrap sample for every tree
grown.
For multi-output, the weights of each column of y will be multiplied.
Expand Down Expand Up @@ -952,7 +954,6 @@ class RandomForestRegressor(ForestRegressor):
criterion : string, optional (default="mse")
The function to measure the quality of a split. The only supported
criterion is "mse" for the mean squared error.
Note: this parameter is tree-specific.
max_features : int, float, string or None, optional (default="auto")
The number of features to consider when looking for the best split:
Expand All @@ -969,36 +970,38 @@ class RandomForestRegressor(ForestRegressor):
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
Note: this parameter is tree-specific.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
Ignored if ``max_leaf_nodes`` is not None.
Note: this parameter is tree-specific.
min_samples_split : integer, optional (default=2)
The minimum number of samples required to split an internal node.
Note: this parameter is tree-specific.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is
discarded if after the split, one of the leaves would contain less then
``min_samples_leaf`` samples.
Note: this parameter is tree-specific.
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a percentage and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
Note: this parameter is tree-specific.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
If not None then ``max_depth`` will be ignored.
Note: this parameter is tree-specific.
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
Expand Down Expand Up @@ -1110,7 +1113,6 @@ class ExtraTreesClassifier(ForestClassifier):
criterion : string, optional (default="gini")
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "entropy" for the information gain.
Note: this parameter is tree-specific.
max_features : int, float, string or None, optional (default="auto")
The number of features to consider when looking for the best split:
Expand All @@ -1127,36 +1129,38 @@ class ExtraTreesClassifier(ForestClassifier):
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
Note: this parameter is tree-specific.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
Ignored if ``max_leaf_nodes`` is not None.
Note: this parameter is tree-specific.
min_samples_split : integer, optional (default=2)
The minimum number of samples required to split an internal node.
Note: this parameter is tree-specific.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is
discarded if after the split, one of the leaves would contain less then
``min_samples_leaf`` samples.
Note: this parameter is tree-specific.
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a percentage and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
Note: this parameter is tree-specific.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
If not None then ``max_depth`` will be ignored.
Note: this parameter is tree-specific.
bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees.
Expand Down Expand Up @@ -1184,7 +1188,6 @@ class ExtraTreesClassifier(ForestClassifier):
new forest.
class_weight : dict, list of dicts, "balanced", "balanced_subsample" or None, optional
Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
Expand Down Expand Up @@ -1266,7 +1269,8 @@ def __init__(self,
n_estimators=n_estimators,
estimator_params=("criterion", "max_depth", "min_samples_split",
"min_samples_leaf", "min_weight_fraction_leaf",
"max_features", "max_leaf_nodes", "random_state"),
"max_features", "max_leaf_nodes",
"random_state"),
bootstrap=bootstrap,
oob_score=oob_score,
n_jobs=n_jobs,
Expand Down Expand Up @@ -1302,7 +1306,6 @@ class ExtraTreesRegressor(ForestRegressor):
criterion : string, optional (default="mse")
The function to measure the quality of a split. The only supported
criterion is "mse" for the mean squared error.
Note: this parameter is tree-specific.
max_features : int, float, string or None, optional (default="auto")
The number of features to consider when looking for the best split:
Expand All @@ -1319,44 +1322,44 @@ class ExtraTreesRegressor(ForestRegressor):
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
Note: this parameter is tree-specific.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
Ignored if ``max_leaf_nodes`` is not None.
Note: this parameter is tree-specific.
min_samples_split : integer, optional (default=2)
The minimum number of samples required to split an internal node.
Note: this parameter is tree-specific.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is
discarded if after the split, one of the leaves would contain less then
``min_samples_leaf`` samples.
Note: this parameter is tree-specific.
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a percentage and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
leaf node.
Note: this parameter is tree-specific.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
If not None then ``max_depth`` will be ignored.
Note: this parameter is tree-specific.
bootstrap : boolean, optional (default=False)
Whether bootstrap samples are used when building trees.
Note: this parameter is tree-specific.
oob_score : bool
Whether to use out-of-bag samples to estimate
the generalization error.
Whether to use out-of-bag samples to estimate the generalization error.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both `fit` and `predict`.
Expand Down Expand Up @@ -1471,13 +1474,21 @@ class RandomTreesEmbedding(BaseForest):
min_samples_split samples.
Ignored if ``max_leaf_nodes`` is not None.
min_samples_split : integer, optional (default=2)
The minimum number of samples required to split an internal node.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` is the minimum
number of samples for each split.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
min_samples_leaf : integer, optional (default=1)
The minimum number of samples in newly created leaves. A split is
discarded if after the split, one of the leaves would contain less then
``min_samples_leaf`` samples.
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a percentage and
`ceil(min_samples_leaf * n_samples)` is the minimum
number of samples for each node.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the input samples required to be at a
Expand Down
Loading

0 comments on commit d9f3277

Please sign in to comment.