Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC udpate the datasets used in CondensedNearestNeighbour docstring #1041

Merged
merged 5 commits into from
Sep 7, 2023

Conversation

eroell
Copy link
Contributor

@eroell eroell commented Sep 5, 2023

Reference Issue

Fixes #1040.

What does this implement/fix? Explain your changes.

This is a suggestion to use the same dataset generator sklearn.datasets.make_classification as is used in the doc of e.g. ClusterCentroids instead of the dead sklearn.datasets.fetch_mldata.

Any other comments?

This is a suggestion, if you are interested in a low-cost fix. There may be more interesting datasets which could be used, or ones you find more illustrative.

btw pycodestyle gave me a few warnings of E501 about lines being longer than 79 characters, but I think you have a max line length of 88 in your checks which I think is also fine.

@eroell
Copy link
Contributor Author

eroell commented Sep 6, 2023

It seems some tests have failed, but it seems to me these would be unrelated to the suggested changes.

@glemaitre
Copy link
Member

Yes the test is failing for numerical issue. 315 / 404 is 0.779 while we expect > 0.78.
Not a big deal for this PR.

@@ -110,17 +110,19 @@ class CondensedNearestNeighbour(BaseCleaningSampler):
Examples
--------
>>> from collections import Counter # doctest: +SKIP
>>> from sklearn.datasets import fetch_mldata # doctest: +SKIP
>>> from sklearn.datasets import make_classification # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead I would prefer that we use fetch_openml and still fetch the pima dataset.
It should look like:

from sklearn.datasets import fetch_openml
from sklearn.preprocesing import scale

X, y = fetch_openml("diabetes", version=1, return_X_y=True)
X = scale(X)

In this manner, we should still deal with the same dataset.

Copy link
Contributor Author

@eroell eroell Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used this dataset now. As opposed to
Resampled dataset shape Counter({-1: 268, 1: 227})
in the original example, the counts have now changed to
Resampled dataset shape Counter({{'tested_positive': 268, 'tested_negative': 181}}).

It also seems that omitting the scaling does not retrieve the counts in the original example. Might be due to a difference to the original, now not anymore findable dataset - noticed that also labels have been changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal (we might not apply exactly the same scaling). But this is closer to the original example. Updating the labels and counts would be enough here.

@glemaitre glemaitre changed the title [MRG] suggested new dataset in doc of CondensedNearestNeighbour DOC udpate the datasets used in CondensedNearestNeighbour docstring Sep 7, 2023
@eroell eroell marked this pull request as draft September 7, 2023 09:53
@eroell eroell marked this pull request as ready for review September 7, 2023 11:07
@glemaitre glemaitre merged commit dcb476d into scikit-learn-contrib:master Sep 7, 2023
@glemaitre
Copy link
Member

Thanks @eroell LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dead dataset used in documentation
2 participants