Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the sort routine #5776

Merged
merged 3 commits into from
May 10, 2021
Merged

Conversation

pgovind
Copy link
Contributor

@pgovind pgovind commented Apr 29, 2021

Currently our sort routine doesn't handle null values in columns correctly. It assumes that all columns have the same NullCount. This PR removes this limitation.

Fixes recent reports from @asmirnov82. Also fixes #5655

@pgovind pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Apr 29, 2021
@pgovind pgovind requested a review from eerhardt April 29, 2021 21:01
PrimitiveDataFrameColumn<long> sortIndices = column.GetAscendingSortIndices(out Int64DataFrameColumn nullIndices);
for (long i = 0; i < nullIndices.Length; i++)
{
sortIndices.Append(nullIndices[i]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause null rows to be at the top for a descending sort. I have to think about if this is acceptable or if we want to change this behavior. Just calling it out here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment, I think we can leave the null rows at the top, until we get feedback about it (if any). To move the null rows to the bottom, we'd need to thread the nullIndices column into the Clone API implementations, and I don't think it's worth it currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Pandas puts them in the 'last' position by default, but you can change it with an option.

@pgovind
Copy link
Contributor Author

pgovind commented May 1, 2021

The unit test failure must be the descending sort unit test looking for null at the expected index. I'll fix it early next week

PrimitiveDataFrameColumn<long> sortIndices = column.GetAscendingSortIndices(out Int64DataFrameColumn nullIndices);
for (long i = 0; i < nullIndices.Length; i++)
{
sortIndices.Append(nullIndices[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an AppendRange so we don't have to loop over each one and append?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no. Does seem like a good API to add though.


/// <summary>
/// Returns the indices of non-null values that, when applied, result in this column being sorted in ascending order
/// </summary>
internal virtual PrimitiveDataFrameColumn<long> GetAscendingSortIndices() => throw new NotImplementedException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do with the null values in this API? Are they just dropped? When is that valuable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, they are just dropped. At the moment it's used for Median. We get the ascending sort indices and use that to calculate the median. I don't have strong opinions about keeping/removing this API. My reasoning to keep it was that I didn't want to break anyone who's overriding this API, so I made a new one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning to keep it was that I didn't want to break anyone who's overriding this API,

It's internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, my bad. For some reason I thought it was protected. Fixed now. I added the nullIndices as an out parameter to the existing method.

@codecov
Copy link

codecov bot commented May 3, 2021

Codecov Report

Merging #5776 (bec63e7) into main (ebc431f) will increase coverage by 0.07%.
The diff coverage is 94.59%.

@@            Coverage Diff             @@
##             main    #5776      +/-   ##
==========================================
+ Coverage   68.38%   68.46%   +0.07%     
==========================================
  Files        1131     1131              
  Lines      241019   241553     +534     
  Branches    25024    25175     +151     
==========================================
+ Hits       164822   165368     +546     
+ Misses      69714    69712       -2     
+ Partials     6483     6473      -10     
Flag Coverage Δ
Debug 68.46% <94.59%> (+0.07%) ⬆️
production 63.08% <89.18%> (+0.05%) ⬆️
test 89.27% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/Microsoft.Data.Analysis/DataFrameColumn.cs 66.76% <0.00%> (+5.16%) ⬆️
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs 81.11% <0.00%> (+2.39%) ⬆️
src/Microsoft.Data.Analysis/DataFrame.cs 86.40% <100.00%> (+0.14%) ⬆️
...oft.Data.Analysis/PrimitiveDataFrameColumn.Sort.cs 84.81% <100.00%> (+2.72%) ⬆️
...c/Microsoft.Data.Analysis/StringDataFrameColumn.cs 66.58% <100.00%> (+1.69%) ⬆️
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs 100.00% <100.00%> (+0.05%) ⬆️
src/Microsoft.ML.AutoML/Experiment/Experiment.cs 72.38% <0.00%> (-0.41%) ⬇️
...crosoft.ML.StandardTrainers/Optimizer/Optimizer.cs 73.12% <0.00%> (ø)
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs 88.56% <0.00%> (ø)
... and 9 more

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@pgovind
Copy link
Contributor Author

pgovind commented May 5, 2021

IIRC, when I onboarded arcade to corefxlab, there was a setting in arcade that forwarded the errors in CI to AzDO such that we could see them in AzDO/GH. @safern helped me set it up. Right now the CI failure here is telling me that some test in ML failed, with no indication of what happened. Would it be beneficial to have the failure information show up the same way we have it set up in runtime here? @michaelgsharp

@pgovind
Copy link
Contributor Author

pgovind commented May 5, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@pgovind
Copy link
Contributor Author

pgovind commented May 6, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@pgovind
Copy link
Contributor Author

pgovind commented May 7, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@pgovind pgovind merged commit 43c49f6 into dotnet:main May 10, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sorting DataFrame that has null values in numerical columns
2 participants