Skip to content

Commit

Permalink
DOC improve introduction to undersampling methods (scikit-learn-contr…
Browse files Browse the repository at this point in the history
  • Loading branch information
solegalli authored Jul 11, 2023
1 parent 87ef4fc commit ef2e75b
Showing 1 changed file with 35 additions and 11 deletions.
46 changes: 35 additions & 11 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,25 @@ Under-sampling

.. currentmodule:: imblearn.under_sampling

You can refer to
One way of handling imbalanced datasets is to reduce the number of observations from
all classes but the minority class. The minority class is that with the least number
of observations. The most well known algorithm in this group is random
undersampling, where samples from the targeted classes are removed at random.

But there are many other algorithms to help us reduce the number of observations in the
dataset. These algorithms can be grouped based on their undersampling strategy into:

- Prototype generation methods.
- Prototype selection methods.

And within the latter, we find:

- Controlled undersampling
- Cleaning methods

We will discuss the different algorithms throughout this document.

Check also
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`.

.. _cluster_centroids:
Expand All @@ -16,7 +34,7 @@ Prototype generation

Given an original data set :math:`S`, prototype generation algorithms will
generate a new set :math:`S'` where :math:`|S'| < |S|` and :math:`S' \not\subset
S`. In other words, prototype generation technique will reduce the number of
S`. In other words, prototype generation techniques will reduce the number of
samples in the targeted classes but the remaining samples are generated --- and
not selected --- from the original set.

Expand Down Expand Up @@ -61,16 +79,22 @@ original one.
Prototype selection
===================

On the contrary to prototype generation algorithms, prototype selection
algorithms will select samples from the original set :math:`S`. Therefore,
:math:`S'` is defined such as :math:`|S'| < |S|` and :math:`S' \subset S`.
Prototype selection algorithms will select samples from the original set :math:`S`,
generating a dataset :math:`S'`, where :math:`|S'| < |S|` and :math:`S' \subset S`. In
other words, :math:`S'` is a subset of :math:`S`.

Prototype selection algorithms can be divided into two groups: (i) controlled
under-sampling techniques and (ii) cleaning under-sampling techniques.

Controlled under-sampling methods reduce the number of observations in the majority
class or classes to an arbitrary number of samples specified by the user. Typically,
they reduce the number of observations to the number of samples observed in the
minority class.

In addition, these algorithms can be divided into two groups: (i) the
controlled under-sampling techniques and (ii) the cleaning under-sampling
techniques. The first group of methods allows for an under-sampling strategy in
which the number of samples in :math:`S'` is specified by the user. By
contrast, cleaning under-sampling techniques do not allow this specification
and are meant for cleaning the feature space.
In contrast, cleaning under-sampling techniques "clean" the feature space by removing
either "noisy" or "too easy to classify" observations, depending on the method. The
final number of observations in each class varies with the cleaning method and can't be
specified by the user.

.. _controlled_under_sampling:

Expand Down

0 comments on commit ef2e75b

Please sign in to comment.