Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] DOC/FIX fix Tomek links example #255

Merged
merged 2 commits into from
Mar 20, 2017

Conversation

glemaitre
Copy link
Member

Reference Issue

Fixes #250

What does this implement/fix? Explain your changes.

Any other comments?

@pep8speaks
Copy link

pep8speaks commented Mar 20, 2017

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 20, 2017 at 13:13 Hours UTC

@codecov
Copy link

codecov bot commented Mar 20, 2017

Codecov Report

Merging #255 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #255   +/-   ##
=======================================
  Coverage   98.27%   98.27%           
=======================================
  Files          58       58           
  Lines        3429     3429           
=======================================
  Hits         3370     3370           
  Misses         59       59

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13f4e8a...b39537f. Read the comment docs.

ax2.set_title('Tomek links')

# create a synthetic dataset
X = np.array([[0.31230513, 0.1216318], [0.68481731, 0.51935141],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need that?

@chkoar chkoar merged commit a76693b into scikit-learn-contrib:master Mar 20, 2017
@amueller
Copy link
Member

So this is a balanced dataset, right? Which class do the removed samples belong to?

@glemaitre
Copy link
Member Author

The results is not necessarily a balanced dataset.
The samples removed belong to the majority class. The minority class samples are not evaluated.

And a Tomek link is defined such that a pair of samples are reciprocally nearest-neighour.

@amueller
Copy link
Member

I meant the input here was balanced, right? So which one is the majority class?

How about

"""
===========
Tomek links
===========
An illustration of the Tomek links method.
"""

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from imblearn.under_sampling import TomekLinks, CondensedNearestNeighbour

print(__doc__)

rng = np.random.RandomState(0)
n_samples_1 = 500
n_samples_2 = 50
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
        0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)


# remove Tomek links
tl = TomekLinks(return_indices=True)
tl = CondensedNearestNeighbour(return_indices=True)
X_resampled, y_resampled, idx_resampled = tl.fit_sample(X_syn, y_syn)

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

idx_samples_removed = np.setdiff1d(np.arange(X_syn.shape[0]),
                                   idx_resampled)
idx_class_0 = y_resampled == 0
plt.scatter(X_resampled[idx_class_0, 0], X_resampled[idx_class_0, 1],
            c='g', alpha=.8, label='Class #0')
plt.scatter(X_resampled[~idx_class_0, 0], X_resampled[~idx_class_0, 1],
            c='b', alpha=.8, label='Class #1')
plt.scatter(X_syn[idx_samples_removed, 0], X_syn[idx_samples_removed, 1],
            c='r', alpha=.4, label='Removed samples')

plt.title('Under-sampling removing Tomek links')
plt.legend()

plt.show()

image

idx_samples_removed = np.setdiff1d(np.flatnonzero(y == 1),
np.union1d(idx_class_0, idx_class_1))

plt.scatter(X[idx_class_0, 0], X[idx_class_0, 1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be X_resampled, I think.

glemaitre added a commit that referenced this pull request Mar 20, 2017
@glemaitre
Copy link
Member Author

Thanks for pointing those stuff out. I thinking that went in a rush to merge, my bad.
Thanks for the example fix.

@amueller
Copy link
Member

I'm preparing a lecture on this, I might have some nice examples later today. I'll post a notebook but I'm not sure if I have time for PRs.
I think it would be nice to point out that Tomek and ENN do basically the exact opposite of each other:

from imblearn.under_sampling import TomekLinks, CondensedNearestNeighbour

rng = np.random.RandomState(0)
n_samples_1 = 500
n_samples_2 = 50
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
        0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)

def plot_resampled(X_org, y_org, X_res, y_res, idx, ax=None):
    if ax is None:
        ax = plt.gca()
    idx_samples_removed = np.setdiff1d(np.arange(X_org.shape[0]),
                                       idx)
    idx_class_0 = y_res == 0
    ax.scatter(X_res[idx_class_0, 0], X_res[idx_class_0, 1],
                c='g', alpha=.8, label='Class 0')
    ax.scatter(X_res[~idx_class_0, 0], X_res[~idx_class_0, 1],
                c='b', alpha=.8, label='Class 1')
    ax.scatter(X_org[idx_samples_removed, 0], X_org[idx_samples_removed, 1],
                c='g', alpha=.4, s=10, label='Samples removed from Class 0')
    ax.legend()

fig, ax = plt.subplots(1, 2)
enn = EditedNearestNeighbours(return_indices=True)
X_resampled, y_resampled, idx_resampled = enn.fit_sample(X_syn, y_syn)
plot_resampled(X_syn, y_syn, X_resampled, y_resampled, idx_resampled, ax=ax[0])
ax[0].set_title("Edited Nearest Neighbor")
X_resampled, y_resampled, idx_resampled = CondensedNearestNeighbour(return_indices=True).fit_sample(X_syn, y_syn)
plot_resampled(X_syn, y_syn, X_resampled, y_resampled, idx_resampled, ax=ax[1])
ax[1].set_title("Condensed Nearest Neighbor")

image

@amueller
Copy link
Member

Hm I'm still seeing strange behavior if I use Tomek above instead of CNN. Shouldn't they look very similar? Insted Tomek looks like ENN.

@amueller
Copy link
Member

I'd appreciate it if you could clarify what method Tomek is. It really doesn't look like the one that's described in the paper you're linking.

@glemaitre
Copy link
Member Author

glemaitre commented Mar 20, 2017 via email

@glemaitre
Copy link
Member Author

I'd appreciate it if you could clarify what method Tomek is. It really doesn't look like the one that's described in the paper you're linking.

So after checking several articles, the same definition came back. A pair of samples is a Tomek link if:

  • they are from two different classes,
  • they reciprocally nearest neighbours.

Long story short, they should be only borderline or noisy points. While under-sampling, this method has been used to remove the point of the link which is not in the minority class.

If the reference is bad, this is my fault. All the different articles are referring to that article and I did not check the original reference. I still have to check the code to be sure that we perform this implementation.

@glemaitre
Copy link
Member Author

oh, I just found what was my confusion. I was running your example and did not having the behaviour that I specified earlier. But this is normal since you were running CNN.

# remove Tomek links
tl = TomekLinks(return_indices=True)
tl = CondensedNearestNeighbour(return_indices=True)
X_resampled, y_resampled, idx_resampled = tl.fit_sample(X_syn, y_syn)

@glemaitre
Copy link
Member Author

glemaitre commented Mar 20, 2017

Getting back to:

I'd appreciate it if you could clarify what method Tomek is. It really doesn't look like the one that's described in the paper you're linking.

In the article of Tomek (1976), the definition of a Tomek link is given in section Method 1 $1.

Then, this definition should not be confused with the actual algorithm presented in the paper which seems to add those link in the inner loop of the CNN to add the boundary samples. This algorithm is not implemented in our algorithm.

We could probably extend the CNN to take into account these links. We should also correct the reference of the Tomek link since that the code represent a small part of the paper which is not even defined as a definition inside. Really confusing. The other thing which is fun is that people used the Tomek links in the opposite way that he used it in the CNN :)

Bottle line -> we really a User Guide with simple explanations of the method and an intuitive explanation of the behaviour of the algorithm. I still have to learn from the scikit-learn endeavour.

It is pretty late here, I will try to correct those things tomorrow.

@amueller
Copy link
Member

Thank you for researching this. Your explanation makes sense. I looked at the main part, not Method 1 $1, so that was indeed the confusion. And your explanations are in line with the behavior I was observing. I was just expecting something like CNN.

christophe-rannou pushed a commit to christophe-rannou/imbalanced-learn that referenced this pull request Apr 3, 2017
glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017
glemaitre added a commit to glemaitre/imbalanced-learn that referenced this pull request Jun 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tomek example unclear
4 participants