-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] DOC/FIX fix Tomek links example #255
Conversation
Hello @glemaitre! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 20, 2017 at 13:13 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #255 +/- ##
=======================================
Coverage 98.27% 98.27%
=======================================
Files 58 58
Lines 3429 3429
=======================================
Hits 3370 3370
Misses 59 59 Continue to review full report at Codecov.
|
ax2.set_title('Tomek links') | ||
|
||
# create a synthetic dataset | ||
X = np.array([[0.31230513, 0.1216318], [0.68481731, 0.51935141], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need that?
So this is a balanced dataset, right? Which class do the removed samples belong to? |
The results is not necessarily a balanced dataset. And a Tomek link is defined such that a pair of samples are reciprocally nearest-neighour. |
I meant the input here was balanced, right? So which one is the majority class? How about """
===========
Tomek links
===========
An illustration of the Tomek links method.
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from imblearn.under_sampling import TomekLinks, CondensedNearestNeighbour
print(__doc__)
rng = np.random.RandomState(0)
n_samples_1 = 500
n_samples_2 = 50
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)
# remove Tomek links
tl = TomekLinks(return_indices=True)
tl = CondensedNearestNeighbour(return_indices=True)
X_resampled, y_resampled, idx_resampled = tl.fit_sample(X_syn, y_syn)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
idx_samples_removed = np.setdiff1d(np.arange(X_syn.shape[0]),
idx_resampled)
idx_class_0 = y_resampled == 0
plt.scatter(X_resampled[idx_class_0, 0], X_resampled[idx_class_0, 1],
c='g', alpha=.8, label='Class #0')
plt.scatter(X_resampled[~idx_class_0, 0], X_resampled[~idx_class_0, 1],
c='b', alpha=.8, label='Class #1')
plt.scatter(X_syn[idx_samples_removed, 0], X_syn[idx_samples_removed, 1],
c='r', alpha=.4, label='Removed samples')
plt.title('Under-sampling removing Tomek links')
plt.legend()
plt.show() |
idx_samples_removed = np.setdiff1d(np.flatnonzero(y == 1), | ||
np.union1d(idx_class_0, idx_class_1)) | ||
|
||
plt.scatter(X[idx_class_0, 0], X[idx_class_0, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be X_resampled, I think.
This reverts commit a76693b.
Thanks for pointing those stuff out. I thinking that went in a rush to merge, my bad. |
I'm preparing a lecture on this, I might have some nice examples later today. I'll post a notebook but I'm not sure if I have time for PRs. from imblearn.under_sampling import TomekLinks, CondensedNearestNeighbour
rng = np.random.RandomState(0)
n_samples_1 = 500
n_samples_2 = 50
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2),
0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)
def plot_resampled(X_org, y_org, X_res, y_res, idx, ax=None):
if ax is None:
ax = plt.gca()
idx_samples_removed = np.setdiff1d(np.arange(X_org.shape[0]),
idx)
idx_class_0 = y_res == 0
ax.scatter(X_res[idx_class_0, 0], X_res[idx_class_0, 1],
c='g', alpha=.8, label='Class 0')
ax.scatter(X_res[~idx_class_0, 0], X_res[~idx_class_0, 1],
c='b', alpha=.8, label='Class 1')
ax.scatter(X_org[idx_samples_removed, 0], X_org[idx_samples_removed, 1],
c='g', alpha=.4, s=10, label='Samples removed from Class 0')
ax.legend()
fig, ax = plt.subplots(1, 2)
enn = EditedNearestNeighbours(return_indices=True)
X_resampled, y_resampled, idx_resampled = enn.fit_sample(X_syn, y_syn)
plot_resampled(X_syn, y_syn, X_resampled, y_resampled, idx_resampled, ax=ax[0])
ax[0].set_title("Edited Nearest Neighbor")
X_resampled, y_resampled, idx_resampled = CondensedNearestNeighbour(return_indices=True).fit_sample(X_syn, y_syn)
plot_resampled(X_syn, y_syn, X_resampled, y_resampled, idx_resampled, ax=ax[1])
ax[1].set_title("Condensed Nearest Neighbor") |
Hm I'm still seeing strange behavior if I use Tomek above instead of CNN. Shouldn't they look very similar? Insted Tomek looks like ENN. |
I'd appreciate it if you could clarify what method Tomek is. It really doesn't look like the one that's described in the paper you're linking. |
To be honest I did not implemented it. It was Fernando while implementing SMOTE. I will check the literature more in depth. I had in head that the method was cleaning the boundaries which did not look like from your example.
|
So after checking several articles, the same definition came back. A pair of samples is a Tomek link if:
Long story short, they should be only borderline or noisy points. While under-sampling, this method has been used to remove the point of the link which is not in the minority class. If the reference is bad, this is my fault. All the different articles are referring to that article and I did not check the original reference. I still have to check the code to be sure that we perform this implementation. |
oh, I just found what was my confusion. I was running your example and did not having the behaviour that I specified earlier. But this is normal since you were running CNN. # remove Tomek links
tl = TomekLinks(return_indices=True)
tl = CondensedNearestNeighbour(return_indices=True)
X_resampled, y_resampled, idx_resampled = tl.fit_sample(X_syn, y_syn) |
Getting back to:
In the article of Tomek (1976), the definition of a Tomek link is given in section Method 1 $1. Then, this definition should not be confused with the actual algorithm presented in the paper which seems to add those link in the inner loop of the CNN to add the boundary samples. This algorithm is not implemented in our algorithm. We could probably extend the CNN to take into account these links. We should also correct the reference of the Tomek link since that the code represent a small part of the paper which is not even defined as a definition inside. Really confusing. The other thing which is fun is that people used the Tomek links in the opposite way that he used it in the CNN :) Bottle line -> we really a User Guide with simple explanations of the method and an intuitive explanation of the behaviour of the algorithm. I still have to learn from the scikit-learn endeavour. It is pretty late here, I will try to correct those things tomorrow. |
Thank you for researching this. Your explanation makes sense. I looked at the main part, not Method 1 $1, so that was indeed the confusion. And your explanations are in line with the behavior I was observing. I was just expecting something like CNN. |
Reference Issue
Fixes #250
What does this implement/fix? Explain your changes.
Any other comments?