-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combined methods related to splitting data into one single method. Also fixed related issues. #5227
Merged
Merged
Changes from 2 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
d2fb318
Fixed issue
antoniovs1029 e9454a0
Added test
antoniovs1029 8a747e5
Added LightGBM Fact Attribute to test
antoniovs1029 1f94584
Added test for Key type SamplingKeyColumn on TrainTestSplit
antoniovs1029 033ae83
Added test for CV and that SamplingKeyColumn isn't removed by running…
antoniovs1029 1638341
Added assertion for non-emptiness and fixed mini-issue with the MinMa…
antoniovs1029 5e9a8e4
Removed CreateStratificationColumn and use EnsureGroupPreservationCol…
antoniovs1029 cc9f6ed
Change MAML to also use EnsureGroupPreservationColumn method
antoniovs1029 37332c1
Change name EnsureGroupPreservationColumn to CreateGroupPreservationC…
antoniovs1029 177fa04
Fallback on EnvSeed when calling through TrainCatalog (fix Tensorflow…
antoniovs1029 e2a76c7
Normalize even the column was already normalized
antoniovs1029 c7e0437
Merge remote-tracking branch 'upstream/master' into forNimbusML
antoniovs1029 2cf9d6a
* Drop newSamplingKeyColumn after splitting dataset to:
antoniovs1029 3656236
Renamed CreatePreservationColumn to CreatSplitColumn and other simila…
antoniovs1029 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be added to
EnsureGroupPreservationColumn
as well?And more generally, would it make sense to unify the two methods? There is also a method called
GetSplitColumn
in line 285 in CrossValidationCommand.cs that does more or less the same thing. I think the same utility method should be called for maml, entry points and the C# API. #ResolvedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I took a quick look into the
EnsureGroupPreservationColumn
callers, and it doesn't seem that they drop the created column, so this wouldn't be an issue there. I'll look into it more later to confirm this.About merging these methods, I think it would be a good idea to maintain consistency across the codebase, although it wouldn't be necessary to fix the issue opened by @ganik to unblock NimbusML update. I'll look into this idea.
In reply to: 438552643 [](ancestors = 438552643)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that when
EnsureGroupPreservationColumn
was called, thesamplingKeyColumn
was never dropped, so the issue didn't apply on that case.Also, I've merged the methods you've mentioned, so now only
CreateGroupPreservationColumn
existsIn reply to: 438969558 [](ancestors = 438969558,438552643)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I've renamed
CreateGroupPreservationColumn
toCreateSplitColumn
and made similar renames elsewhere to make the code easier to read. Now "splitColumn" is the name of the temporary column we create for splitting, regardless if it's based-off a "samplingKeyColumn" (ML.NET) or a "stratificationColumn" (NimbusML, Maml, legacy TLC naming conventions...)Also now I do drop the new "splitColumn" when it's created while splitting through ML.NET's API, because not dropping it was causing other issues... see comment:
#5227 (comment)
Added 2 tests to cheked this is dropped, and updated my PR description to explain all the issues it's now fixing.
In reply to: 441992076 [](ancestors = 441992076,438969558,438552643)