-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to the sort routine #5776
Conversation
PrimitiveDataFrameColumn<long> sortIndices = column.GetAscendingSortIndices(out Int64DataFrameColumn nullIndices); | ||
for (long i = 0; i < nullIndices.Length; i++) | ||
{ | ||
sortIndices.Append(nullIndices[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would cause null
rows to be at the top for a descending sort. I have to think about if this is acceptable or if we want to change this behavior. Just calling it out here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the moment, I think we can leave the null rows at the top, until we get feedback about it (if any). To move the null rows to the bottom, we'd need to thread the nullIndices
column into the Clone
API implementations, and I don't think it's worth it currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Pandas puts them in the 'last' position by default, but you can change it with an option.
The unit test failure must be the descending sort unit test looking for |
PrimitiveDataFrameColumn<long> sortIndices = column.GetAscendingSortIndices(out Int64DataFrameColumn nullIndices); | ||
for (long i = 0; i < nullIndices.Length; i++) | ||
{ | ||
sortIndices.Append(nullIndices[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an AppendRange
so we don't have to loop over each one and append?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, no. Does seem like a good API to add though.
|
||
/// <summary> | ||
/// Returns the indices of non-null values that, when applied, result in this column being sorted in ascending order | ||
/// </summary> | ||
internal virtual PrimitiveDataFrameColumn<long> GetAscendingSortIndices() => throw new NotImplementedException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do with the null
values in this API? Are they just dropped? When is that valuable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, they are just dropped. At the moment it's used for Median
. We get the ascending sort indices and use that to calculate the median. I don't have strong opinions about keeping/removing this API. My reasoning to keep it was that I didn't want to break anyone who's overriding this API, so I made a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning to keep it was that I didn't want to break anyone who's overriding this API,
It's internal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, my bad. For some reason I thought it was protected
. Fixed now. I added the nullIndices
as an out parameter to the existing method.
Codecov Report
@@ Coverage Diff @@
## main #5776 +/- ##
==========================================
+ Coverage 68.38% 68.46% +0.07%
==========================================
Files 1131 1131
Lines 241019 241553 +534
Branches 25024 25175 +151
==========================================
+ Hits 164822 165368 +546
+ Misses 69714 69712 -2
+ Partials 6483 6473 -10
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, when I onboarded arcade to corefxlab, there was a setting in arcade that forwarded the errors in CI to AzDO such that we could see them in AzDO/GH. @safern helped me set it up. Right now the CI failure here is telling me that some test in ML failed, with no indication of what happened. Would it be beneficial to have the failure information show up the same way we have it set up in runtime here? @michaelgsharp |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
Currently our sort routine doesn't handle
null
values in columns correctly. It assumes that all columns have the sameNullCount
. This PR removes this limitation.Fixes recent reports from @asmirnov82. Also fixes #5655