-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC udpate the datasets used in CondensedNearestNeighbour docstring #1041
Conversation
It seems some tests have failed, but it seems to me these would be unrelated to the suggested changes. |
Yes the test is failing for numerical issue. |
@@ -110,17 +110,19 @@ class CondensedNearestNeighbour(BaseCleaningSampler): | |||
Examples | |||
-------- | |||
>>> from collections import Counter # doctest: +SKIP | |||
>>> from sklearn.datasets import fetch_mldata # doctest: +SKIP | |||
>>> from sklearn.datasets import make_classification # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead I would prefer that we use fetch_openml
and still fetch the pima dataset.
It should look like:
from sklearn.datasets import fetch_openml
from sklearn.preprocesing import scale
X, y = fetch_openml("diabetes", version=1, return_X_y=True)
X = scale(X)
In this manner, we should still deal with the same dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used this dataset now. As opposed to
Resampled dataset shape Counter({-1: 268, 1: 227})
in the original example, the counts have now changed to
Resampled dataset shape Counter({{'tested_positive': 268, 'tested_negative': 181}})
.
It also seems that omitting the scaling does not retrieve the counts in the original example. Might be due to a difference to the original, now not anymore findable dataset - noticed that also labels have been changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge deal (we might not apply exactly the same scaling). But this is closer to the original example. Updating the labels and counts would be enough here.
Thanks @eroell LGTM. |
Reference Issue
Fixes #1040.
What does this implement/fix? Explain your changes.
This is a suggestion to use the same dataset generator
sklearn.datasets.make_classification
as is used in the doc of e.g.ClusterCentroids
instead of the deadsklearn.datasets.fetch_mldata
.Any other comments?
This is a suggestion, if you are interested in a low-cost fix. There may be more interesting datasets which could be used, or ones you find more illustrative.
btw pycodestyle gave me a few warnings of E501 about lines being longer than 79 characters, but I think you have a max line length of 88 in your checks which I think is also fine.