[MRG] DOC/FIX fix Tomek links example #255

glemaitre · 2017-03-20T12:47:08Z

Reference Issue

Fixes #250

What does this implement/fix? Explain your changes.

Any other comments?

pep8speaks · 2017-03-20T12:47:19Z

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 20, 2017 at 13:13 Hours UTC

codecov · 2017-03-20T12:57:30Z

Codecov Report

Merging #255 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #255   +/-   ##
=======================================
  Coverage   98.27%   98.27%           
=======================================
  Files          58       58           
  Lines        3429     3429           
=======================================
  Hits         3370     3370           
  Misses         59       59

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13f4e8a...b39537f. Read the comment docs.

chkoar · 2017-03-20T13:10:09Z

examples/under-sampling/plot_tomek_links.py

-ax2.set_title('Tomek links')
+
+# create a synthetic dataset
+X = np.array([[0.31230513, 0.1216318], [0.68481731, 0.51935141],


Do we need that?

amueller · 2017-03-20T14:17:56Z

So this is a balanced dataset, right? Which class do the removed samples belong to?

glemaitre · 2017-03-20T14:29:12Z

The results is not necessarily a balanced dataset.
The samples removed belong to the majority class. The minority class samples are not evaluated.

And a Tomek link is defined such that a pair of samples are reciprocally nearest-neighour.

amueller · 2017-03-20T14:44:50Z

I meant the input here was balanced, right? So which one is the majority class?

How about

"""
===========
Tomek links
===========
An illustration of the Tomek links method.
"""

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from imblearn.under_sampling import TomekLinks, CondensedNearestNeighbour

print(__doc__)

rng = np.random.RandomState(0)
n_samples_1 = 500
n_samples_2 = 50
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
        0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)


# remove Tomek links
tl = TomekLinks(return_indices=True)
tl = CondensedNearestNeighbour(return_indices=True)
X_resampled, y_resampled, idx_resampled = tl.fit_sample(X_syn, y_syn)

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

idx_samples_removed = np.setdiff1d(np.arange(X_syn.shape[0]),
                                   idx_resampled)
idx_class_0 = y_resampled == 0
plt.scatter(X_resampled[idx_class_0, 0], X_resampled[idx_class_0, 1],
            c='g', alpha=.8, label='Class #0')
plt.scatter(X_resampled[~idx_class_0, 0], X_resampled[~idx_class_0, 1],
            c='b', alpha=.8, label='Class #1')
plt.scatter(X_syn[idx_samples_removed, 0], X_syn[idx_samples_removed, 1],
            c='r', alpha=.4, label='Removed samples')

plt.title('Under-sampling removing Tomek links')
plt.legend()

plt.show()

amueller · 2017-03-20T14:45:28Z

examples/under-sampling/plot_tomek_links.py

+idx_samples_removed = np.setdiff1d(np.flatnonzero(y == 1),
+                                   np.union1d(idx_class_0, idx_class_1))
+
+plt.scatter(X[idx_class_0, 0], X[idx_class_0, 1],


this should be X_resampled, I think.

This reverts commit a76693b.

glemaitre · 2017-03-20T15:11:04Z

Thanks for pointing those stuff out. I thinking that went in a rush to merge, my bad.
Thanks for the example fix.

amueller · 2017-03-20T15:33:16Z

I'm preparing a lecture on this, I might have some nice examples later today. I'll post a notebook but I'm not sure if I have time for PRs.
I think it would be nice to point out that Tomek and ENN do basically the exact opposite of each other:

from imblearn.under_sampling import TomekLinks, CondensedNearestNeighbour

rng = np.random.RandomState(0)
n_samples_1 = 500
n_samples_2 = 50
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
        0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)

def plot_resampled(X_org, y_org, X_res, y_res, idx, ax=None):
    if ax is None:
        ax = plt.gca()
    idx_samples_removed = np.setdiff1d(np.arange(X_org.shape[0]),
                                       idx)
    idx_class_0 = y_res == 0
    ax.scatter(X_res[idx_class_0, 0], X_res[idx_class_0, 1],
                c='g', alpha=.8, label='Class 0')
    ax.scatter(X_res[~idx_class_0, 0], X_res[~idx_class_0, 1],
                c='b', alpha=.8, label='Class 1')
    ax.scatter(X_org[idx_samples_removed, 0], X_org[idx_samples_removed, 1],
                c='g', alpha=.4, s=10, label='Samples removed from Class 0')
    ax.legend()

fig, ax = plt.subplots(1, 2)
enn = EditedNearestNeighbours(return_indices=True)
X_resampled, y_resampled, idx_resampled = enn.fit_sample(X_syn, y_syn)
plot_resampled(X_syn, y_syn, X_resampled, y_resampled, idx_resampled, ax=ax[0])
ax[0].set_title("Edited Nearest Neighbor")
X_resampled, y_resampled, idx_resampled = CondensedNearestNeighbour(return_indices=True).fit_sample(X_syn, y_syn)
plot_resampled(X_syn, y_syn, X_resampled, y_resampled, idx_resampled, ax=ax[1])
ax[1].set_title("Condensed Nearest Neighbor")

amueller · 2017-03-20T15:35:16Z

Hm I'm still seeing strange behavior if I use Tomek above instead of CNN. Shouldn't they look very similar? Insted Tomek looks like ENN.

amueller · 2017-03-20T15:37:59Z

I'd appreciate it if you could clarify what method Tomek is. It really doesn't look like the one that's described in the paper you're linking.

glemaitre · 2017-03-20T20:55:19Z

‎To be honest I did not implemented it. It was Fernando while implementing SMOTE. I will check the literature more in depth. I had in head that the method was cleaning the boundaries which did not look like from your example.

glemaitre · 2017-03-20T21:23:21Z

I'd appreciate it if you could clarify what method Tomek is. It really doesn't look like the one that's described in the paper you're linking.

So after checking several articles, the same definition came back. A pair of samples is a Tomek link if:

they are from two different classes,
they reciprocally nearest neighbours.

Long story short, they should be only borderline or noisy points. While under-sampling, this method has been used to remove the point of the link which is not in the minority class.

If the reference is bad, this is my fault. All the different articles are referring to that article and I did not check the original reference. I still have to check the code to be sure that we perform this implementation.

glemaitre · 2017-03-20T21:32:51Z

oh, I just found what was my confusion. I was running your example and did not having the behaviour that I specified earlier. But this is normal since you were running CNN.

# remove Tomek links
tl = TomekLinks(return_indices=True)
tl = CondensedNearestNeighbour(return_indices=True)
X_resampled, y_resampled, idx_resampled = tl.fit_sample(X_syn, y_syn)

glemaitre · 2017-03-20T21:51:07Z

Getting back to:

I'd appreciate it if you could clarify what method Tomek is. It really doesn't look like the one that's described in the paper you're linking.

In the article of Tomek (1976), the definition of a Tomek link is given in section Method 1 $1.

Then, this definition should not be confused with the actual algorithm presented in the paper which seems to add those link in the inner loop of the CNN to add the boundary samples. This algorithm is not implemented in our algorithm.

We could probably extend the CNN to take into account these links. We should also correct the reference of the Tomek link since that the code represent a small part of the paper which is not even defined as a definition inside. Really confusing. The other thing which is fun is that people used the Tomek links in the opposite way that he used it in the CNN :)

Bottle line -> we really a User Guide with simple explanations of the method and an intuitive explanation of the behaviour of the algorithm. I still have to learn from the scikit-learn endeavour.

It is pretty late here, I will try to correct those things tomorrow.

amueller · 2017-03-20T22:26:12Z

Thank you for researching this. Your explanation makes sense. I looked at the main part, not Method 1 $1, so that was indeed the confusion. And your explanations are in line with the behavior I was observing. I was just expecting something like CNN.

DOC/FIX fix Tomek links example

20117c0

chkoar reviewed Mar 20, 2017

View reviewed changes

EXA/FIX remove useless data

b39537f

chkoar merged commit a76693b into scikit-learn-contrib:master Mar 20, 2017

amueller reviewed Mar 20, 2017

View reviewed changes

glemaitre added a commit that referenced this pull request Mar 20, 2017

Revert "DOC/FIX fix Tomek links example (#255)"

642694d

This reverts commit a76693b.

glemaitre mentioned this pull request Mar 20, 2017

Revert "[MRG] DOC/FIX fix Tomek links example" #258

Closed

christophe-rannou pushed a commit to christophe-rannou/imbalanced-learn that referenced this pull request Apr 3, 2017

DOC/FIX fix Tomek links example (scikit-learn-contrib#255)

1b23efa

glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017

DOC/FIX fix Tomek links example (scikit-learn-contrib#255)

c0841f4

glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017

DOC/FIX fix Tomek links example (scikit-learn-contrib#255)

2508f96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] DOC/FIX fix Tomek links example #255

[MRG] DOC/FIX fix Tomek links example #255

glemaitre commented Mar 20, 2017

pep8speaks commented Mar 20, 2017 •

edited

Loading

codecov bot commented Mar 20, 2017 •

edited

Loading

chkoar Mar 20, 2017

amueller commented Mar 20, 2017

glemaitre commented Mar 20, 2017

amueller commented Mar 20, 2017

amueller Mar 20, 2017

glemaitre commented Mar 20, 2017

amueller commented Mar 20, 2017

amueller commented Mar 20, 2017

amueller commented Mar 20, 2017

glemaitre commented Mar 20, 2017 via email

glemaitre commented Mar 20, 2017

glemaitre commented Mar 20, 2017

glemaitre commented Mar 20, 2017 •

edited

Loading

amueller commented Mar 20, 2017

[MRG] DOC/FIX fix Tomek links example #255

[MRG] DOC/FIX fix Tomek links example #255

Conversation

glemaitre commented Mar 20, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

pep8speaks commented Mar 20, 2017 • edited Loading

Comment last updated on March 20, 2017 at 13:13 Hours UTC

codecov bot commented Mar 20, 2017 • edited Loading

Codecov Report

chkoar Mar 20, 2017

Choose a reason for hiding this comment

amueller commented Mar 20, 2017

glemaitre commented Mar 20, 2017

amueller commented Mar 20, 2017

amueller Mar 20, 2017

Choose a reason for hiding this comment

glemaitre commented Mar 20, 2017

amueller commented Mar 20, 2017

amueller commented Mar 20, 2017

amueller commented Mar 20, 2017

glemaitre commented Mar 20, 2017 via email

glemaitre commented Mar 20, 2017

glemaitre commented Mar 20, 2017

glemaitre commented Mar 20, 2017 • edited Loading

amueller commented Mar 20, 2017

pep8speaks commented Mar 20, 2017 •

edited

Loading

codecov bot commented Mar 20, 2017 •

edited

Loading

glemaitre commented Mar 20, 2017 •

edited

Loading