Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC improve ENN documentation #1021

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 27 additions & 16 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -237,14 +237,23 @@ figure illustrates this behaviour.

.. _edited_nearest_neighbors:

Edited data set using nearest neighbours
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Editing data using nearest neighbours
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
"edit" the dataset by removing samples which do not agree "enough" with their
neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
under-sampled, the nearest-neighbours are computed and if the selection
criterion is not fulfilled, the sample is removed::
Edited nearest neighbours
~~~~~~~~~~~~~~~~~~~~~~~~~

The edited nearest neighbours methodology uses KNN to identify the neighbours of the
solegalli marked this conversation as resolved.
Show resolved Hide resolved
targeted class samples, and then removes observations if any or most of their
neighbours are from a different class :cite:`wilson1972asymptotic`.

:class:`EditedNearestNeighbours` carries out the following steps:

1. Train a KNN using the entire dataset.
solegalli marked this conversation as resolved.
Show resolved Hide resolved
2. Find each observations' 3 closest neighbours (only for the targeted classes).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost sure that the 3 is parametrizable. We should instead refer to the sampler argument.

3. Remove observations if any or most of its neighbours belong to a different class.

Below the code implementation::

>>> sorted(Counter(y).items())
[(0, 64), (1, 262), (2, 4674)]
Expand All @@ -254,12 +263,12 @@ criterion is not fulfilled, the sample is removed::
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 213), (2, 4568)]

Two selection criteria are currently available: (i) the majority (i.e.,
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
nearest-neighbors have to belong to the same class than the sample inspected to
keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
conservative than `kind_sel='mode'`, and more samples will be excluded in
the former strategy than the latest::

To paraphrase step 3, :class:`EditedNearestNeighbours` will retain observations from
the majority class when **most**, or **all** of its neighbours are from the same class.
To control this behaviour we set ``kind_sel='mode'`` or ``kind_sel='all'``,
respectively. Hence, `kind_sel='all'` is less conservative than `kind_sel='mode'`,
resulting in a removal of more samples::

>>> enn = EditedNearestNeighbours(kind_sel="all")
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
Expand All @@ -270,9 +279,11 @@ the former strategy than the latest::
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 234), (2, 4666)]

The parameter ``n_neighbors`` allows to give a classifier subclassed from
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
the decision to keep a given sample or not.
The parameter ``n_neighbors`` accepts integers. The integer refers to the number of
neighbours to examine for each sample. It can also take a classifier subclassed from
``KNeighborsMixin`` from scikit-learn. When passing a classifier, note that, if you
pass a 3-KNN classifier, only 2 neighbours will be examined for the cleaning, as the
solegalli marked this conversation as resolved.
Show resolved Hide resolved
third sample is the one being examined for undersampling.
solegalli marked this conversation as resolved.
Show resolved Hide resolved

:class:`RepeatedEditedNearestNeighbours` extends
:class:`EditedNearestNeighbours` by repeating the algorithm multiple times
Expand Down