-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] New Feature StratifiedKFold #3109
[REVIEW] New Feature StratifiedKFold #3109
Conversation
sync with upstream
sync with upstream
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #3109 +/- ##
===============================================
- Coverage 59.20% 58.99% -0.21%
===============================================
Files 142 142
Lines 8966 9002 +36
===============================================
+ Hits 5308 5311 +3
- Misses 3658 3691 +33
Continue to review full report at Codecov.
|
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
sync with upstream
Sync with upstream
Sync with Branch 0.18
@daxiongshu, is there any reason this PR is targeting |
This PR has been marked stale due to no recent activity in the past 30d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be marked rotten if there is no activity in the next 60d. |
This PR has been labeled |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Removing |
self.n_splits = n_splits | ||
self.shuffle = shuffle | ||
self.seed = random_state | ||
self.tpb = 64 # threads per bloc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this configurable or why do we save it as an attribute of the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I was hesitant to make it configurable or not. I think setting it to 64 would be just fine. I'll make the change.
""" | ||
self._check_array_shape(y) | ||
if isinstance(y, cudf.DataFrame) or isinstance(y, cudf.Series): | ||
data = y.values.ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason we can't use existing output functionality the need for ravel
?
|
||
|
||
@pytest.mark.parametrize("shuffle", [True, False]) | ||
@pytest.mark.parametrize("n_splits", [3, 5, 10]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this values? Seems like 3, 5, and 10 are testing essentially the same thing.
Would be good to have tests of invalid input, say 0 splits for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these values are the most common choices in kaggle competitions. Yes, I agree they mean the same thing. I'll add a separate test for the invalid number of folds.
rerun test |
rerun tests |
rerun tests |
rerun tests |
@gpucibot merge |
Add equivalent of [sklearn's StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) to `cuml`. Authors: - Jiwei Liu (https://github.com/daxiongshu) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#3109
Add equivalent of sklearn's StratifiedKFold to
cuml
.