Improvements to the sort routine #5776

pgovind · 2021-04-29T21:01:20Z

Currently our sort routine doesn't handle null values in columns correctly. It assumes that all columns have the same NullCount. This PR removes this limitation.

Fixes recent reports from @asmirnov82. Also fixes #5655

pgovind · 2021-05-01T00:04:17Z

src/Microsoft.Data.Analysis/DataFrame.cs

+            PrimitiveDataFrameColumn<long> sortIndices = column.GetAscendingSortIndices(out Int64DataFrameColumn nullIndices);
+            for (long i = 0; i < nullIndices.Length; i++)
+            {
+                sortIndices.Append(nullIndices[i]);


This would cause null rows to be at the top for a descending sort. I have to think about if this is acceptable or if we want to change this behavior. Just calling it out here

For the moment, I think we can leave the null rows at the top, until we get feedback about it (if any). To move the null rows to the bottom, we'd need to thread the nullIndices column into the Clone API implementations, and I don't think it's worth it currently.

Looks like Pandas puts them in the 'last' position by default, but you can change it with an option.

pgovind · 2021-05-01T00:12:59Z

The unit test failure must be the descending sort unit test looking for null at the expected index. I'll fix it early next week

eerhardt · 2021-05-03T18:49:15Z

src/Microsoft.Data.Analysis/DataFrame.cs

+            PrimitiveDataFrameColumn<long> sortIndices = column.GetAscendingSortIndices(out Int64DataFrameColumn nullIndices);
+            for (long i = 0; i < nullIndices.Length; i++)
+            {
+                sortIndices.Append(nullIndices[i]);


Is there an AppendRange so we don't have to loop over each one and append?

Unfortunately, no. Does seem like a good API to add though.

eerhardt · 2021-05-03T18:59:58Z

src/Microsoft.Data.Analysis/DataFrameColumn.cs

+
+        /// <summary>
+        /// Returns the indices of non-null values that, when applied, result in this column being sorted in ascending order
+        /// </summary>
        internal virtual PrimitiveDataFrameColumn<long> GetAscendingSortIndices() => throw new NotImplementedException();


What do we do with the null values in this API? Are they just dropped? When is that valuable?

Yup, they are just dropped. At the moment it's used for Median. We get the ascending sort indices and use that to calculate the median. I don't have strong opinions about keeping/removing this API. My reasoning to keep it was that I didn't want to break anyone who's overriding this API, so I made a new one.

My reasoning to keep it was that I didn't want to break anyone who's overriding this API,

It's internal.

Huh, my bad. For some reason I thought it was protected. Fixed now. I added the nullIndices as an out parameter to the existing method.

codecov · 2021-05-03T19:02:49Z

Codecov Report

Merging #5776 (bec63e7) into main (ebc431f) will increase coverage by 0.07%.
The diff coverage is 94.59%.

@@            Coverage Diff             @@
##             main    #5776      +/-   ##
==========================================
+ Coverage   68.38%   68.46%   +0.07%     
==========================================
  Files        1131     1131              
  Lines      241019   241553     +534     
  Branches    25024    25175     +151     
==========================================
+ Hits       164822   165368     +546     
+ Misses      69714    69712       -2     
+ Partials     6483     6473      -10

Flag	Coverage Δ
Debug	`68.46% <94.59%> (+0.07%)`	⬆️
production	`63.08% <89.18%> (+0.05%)`	⬆️
test	`89.27% <100.00%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/Microsoft.Data.Analysis/DataFrameColumn.cs	`66.76% <0.00%> (+5.16%)`	⬆️
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs	`81.11% <0.00%> (+2.39%)`	⬆️
src/Microsoft.Data.Analysis/DataFrame.cs	`86.40% <100.00%> (+0.14%)`	⬆️
...oft.Data.Analysis/PrimitiveDataFrameColumn.Sort.cs	`84.81% <100.00%> (+2.72%)`	⬆️
...c/Microsoft.Data.Analysis/StringDataFrameColumn.cs	`66.58% <100.00%> (+1.69%)`	⬆️
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs	`100.00% <100.00%> (+0.05%)`	⬆️
src/Microsoft.ML.AutoML/Experiment/Experiment.cs	`72.38% <0.00%> (-0.41%)`	⬇️
...crosoft.ML.StandardTrainers/Optimizer/Optimizer.cs	`73.12% <0.00%> (ø)`
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs	`88.56% <0.00%> (ø)`
... and 9 more

eerhardt

pgovind · 2021-05-05T17:50:06Z

IIRC, when I onboarded arcade to corefxlab, there was a setting in arcade that forwarded the errors in CI to AzDO such that we could see them in AzDO/GH. @safern helped me set it up. Right now the CI failure here is telling me that some test in ML failed, with no indication of what happened. Would it be beneficial to have the failure information show up the same way we have it set up in runtime here? @michaelgsharp

pgovind · 2021-05-05T21:04:49Z

/azp run

azure-pipelines · 2021-05-05T21:05:31Z

Azure Pipelines successfully started running 2 pipeline(s).

pgovind · 2021-05-06T22:58:10Z

/azp run

azure-pipelines · 2021-05-06T22:58:22Z

Azure Pipelines successfully started running 2 pipeline(s).

pgovind · 2021-05-07T16:57:08Z

/azp run

azure-pipelines · 2021-05-07T16:57:21Z

Azure Pipelines successfully started running 2 pipeline(s).

Improvements to the sort routine

8b6b3f8

pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Apr 29, 2021

pgovind requested a review from eerhardt April 29, 2021 21:01

pgovind commented May 1, 2021

View reviewed changes

Fix unit test

0df5f7a

eerhardt reviewed May 3, 2021

View reviewed changes

Fold into existing API

bec63e7

eerhardt approved these changes May 5, 2021

View reviewed changes

pgovind merged commit 43c49f6 into dotnet:main May 10, 2021

ghost locked as resolved and limited conversation to collaborators Mar 17, 2022

Improvements to the sort routine #5776

Improvements to the sort routine #5776

Uh oh!

Conversation

pgovind commented Apr 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgovind commented May 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eerhardt left a comment

Choose a reason for hiding this comment

Uh oh!

pgovind commented May 5, 2021

Uh oh!

pgovind commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azure-pipelines bot commented May 5, 2021

Uh oh!

pgovind commented May 6, 2021

Uh oh!

azure-pipelines bot commented May 6, 2021

Uh oh!

pgovind commented May 7, 2021

Uh oh!

azure-pipelines bot commented May 7, 2021

Uh oh!

Uh oh!

codecov bot commented May 3, 2021 •

edited

Loading

pgovind commented May 5, 2021 •

edited

Loading