.. currentmodule:: imblearn.ensemble
In ensemble classifiers, bagging methods build several estimators on different randomly selected subset of data. In scikit-learn, this classifier is named :class:`~sklearn.ensemble.BaggingClassifier`. However, this classifier does not allow to balance each subset of data. Therefore, when training on imbalanced data set, this classifier will favor the majority classes:
>>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2, ... n_redundant=0, n_repeated=0, n_classes=3, ... n_clusters_per_class=1, ... weights=[0.01, 0.05, 0.94], class_sep=0.8, ... random_state=0) >>> from sklearn.model_selection import train_test_split >>> from sklearn.metrics import balanced_accuracy_score >>> from sklearn.ensemble import BaggingClassifier >>> from sklearn.tree import DecisionTreeClassifier >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) >>> bc = BaggingClassifier(base_estimator=DecisionTreeClassifier(), ... random_state=0) >>> bc.fit(X_train, y_train) #doctest: +ELLIPSIS BaggingClassifier(...) >>> y_pred = bc.predict(X_test) >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS 0.77...
In :class:`BalancedBaggingClassifier`, each bootstrap sample will be further resampled to achieve the sampling_strategy desired. Therefore, :class:`BalancedBaggingClassifier` takes the same parameters as the scikit-learn :class:`~sklearn.ensemble.BaggingClassifier`. In addition, the sampling is controlled by the parameter sampler or the two parameters sampling_strategy and replacement, if one wants to use the :class:`~imblearn.under_sampling.RandomUnderSampler`:
>>> from imblearn.ensemble import BalancedBaggingClassifier >>> bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(), ... sampling_strategy='auto', ... replacement=False, ... random_state=0) >>> bbc.fit(X_train, y_train) # doctest: +ELLIPSIS BalancedBaggingClassifier(...) >>> y_pred = bbc.predict(X_test) >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS 0.8...
Changing the sampler will give rise to different known implementation :cite:`maclin1997empirical`, :cite:`hido2009roughly`, :cite:`wang2009diversity`. You can refer to the following example shows in practice these different methods: :ref:`sphx_glr_auto_examples_ensemble_plot_bagging_classifier.py`
:class:`BalancedRandomForestClassifier` is another ensemble method in which each tree of the forest will be provided a balanced bootstrap sample :cite:`chen2004using`. This class provides all functionality of the :class:`~sklearn.ensemble.RandomForestClassifier`:
>>> from imblearn.ensemble import BalancedRandomForestClassifier >>> brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0) >>> brf.fit(X_train, y_train) # doctest: +ELLIPSIS BalancedRandomForestClassifier(...) >>> y_pred = brf.predict(X_test) >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS 0.8...
Several methods taking advantage of boosting have been designed.
:class:`RUSBoostClassifier` randomly under-sample the dataset before to perform a boosting iteration :cite:`seiffert2009rusboost`:
>>> from imblearn.ensemble import RUSBoostClassifier >>> rusboost = RUSBoostClassifier(n_estimators=200, algorithm='SAMME.R', ... random_state=0) >>> rusboost.fit(X_train, y_train) # doctest: +ELLIPSIS RUSBoostClassifier(...) >>> y_pred = rusboost.predict(X_test) >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS 0...
A specific method which uses :class:`~sklearn.ensemble.AdaBoostClassifier` as learners in the bagging classifier is called "EasyEnsemble". The :class:`EasyEnsembleClassifier` allows to bag AdaBoost learners which are trained on balanced bootstrap samples :cite:`liu2008exploratory`. Similarly to the :class:`BalancedBaggingClassifier` API, one can construct the ensemble as:
>>> from imblearn.ensemble import EasyEnsembleClassifier >>> eec = EasyEnsembleClassifier(random_state=0) >>> eec.fit(X_train, y_train) # doctest: +ELLIPSIS EasyEnsembleClassifier(...) >>> y_pred = eec.predict(X_test) >>> balanced_accuracy_score(y_test, y_pred) # doctest: +ELLIPSIS 0.6...