Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the Merge routine #5778

Merged
merged 4 commits into from
May 6, 2021
Merged

Conversation

pgovind
Copy link

@pgovind pgovind commented Apr 30, 2021

Currently, we don't differentiate between null values and default values in columns. As a result, these values are considered equivalent in the Merge routine. This PR fixes that by handling them separately. Also added unit tests so we don't regress in the future.

Fixes #5765 and contributes to #5691

@pgovind pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Apr 30, 2021
@pgovind pgovind requested a review from eerhardt April 30, 2021 23:51
@@ -1152,7 +1152,7 @@ public void TestGroupBy()
if (originalColumn.Name == "Bool")
continue;
DataFrameColumn headColumn = head.Columns[originalColumn.Name];
Assert.Equal(originalColumn[5], headColumn[verify[5]]);
Assert.Equal(originalColumn[7], headColumn[verify[5]]);
Copy link
Author

@pgovind pgovind Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 -> 7 because null is not equivalent to default in the GroupBy method anymore

@@ -1658,7 +1658,7 @@ public void TestMerge()
Assert.Equal(16, merge.Rows.Count);
Assert.Equal(merge.Columns.Count, left.Columns.Count + right.Columns.Count);
Assert.Null(merge.Columns["Int_left"][12]);
Assert.Null(merge.Columns["Int_left"][5]);
Assert.Null(merge.Columns["Int_left"][15]);
Copy link
Author

@pgovind pgovind Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 -> 15 because we don't preserve the ordering of rows in OuterJoin, so rows with null values in the Column we are merging on will always go to the end. I don't think this needs to be a requirement. If somebody requests it at some point, we can think about preserving the ordering then.

@codecov
Copy link

codecov bot commented May 1, 2021

Codecov Report

Merging #5778 (822385c) into main (9ece0ff) will increase coverage by 0.07%.
The diff coverage is 85.15%.

@@            Coverage Diff             @@
##             main    #5778      +/-   ##
==========================================
+ Coverage   68.33%   68.40%   +0.07%     
==========================================
  Files        1131     1131              
  Lines      241008   241156     +148     
  Branches    25025    25033       +8     
==========================================
+ Hits       164686   164957     +271     
+ Misses      69824    69716     -108     
+ Partials     6498     6483      -15     
Flag Coverage Δ
Debug 68.40% <85.15%> (+0.07%) ⬆️
production 63.03% <70.31%> (+0.06%) ⬆️
test 89.24% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rosoft.Data.Analysis/ArrowStringDataFrameColumn.cs 64.00% <0.00%> (-1.11%) ⬇️
src/Microsoft.Data.Analysis/DataFrameColumn.cs 61.60% <0.00%> (ø)
...c/Microsoft.Data.Analysis/StringDataFrameColumn.cs 63.54% <0.00%> (-1.36%) ⬇️
src/Microsoft.Data.Analysis/DataFrame.Join.cs 94.73% <89.41%> (-3.16%) ⬇️
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs 79.03% <100.00%> (+0.31%) ⬆️
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs 99.94% <100.00%> (+<0.01%) ⬆️
...StandardTrainers/Standard/LinearModelParameters.cs 66.32% <0.00%> (+0.25%) ⬆️
src/Microsoft.ML.Data/Utils/LossFunctions.cs 67.35% <0.00%> (+0.51%) ⬆️
...oft.ML.StandardTrainers/Standard/SdcaMulticlass.cs 92.49% <0.00%> (+1.02%) ⬆️
...crosoft.ML.StandardTrainers/Optimizer/Optimizer.cs 73.12% <0.00%> (+1.15%) ⬆️
... and 6 more

for (long i = 0; i < matchedFullRows.Length; i++)
{
int rowIndex = matchedFullRows[i];
MatchRowsOnMergedDataFrame(merge, left, right, rowIndex, rowIndex, rowIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MatchRowsOnMergedDataFrame

Can you also add a test case for the scenario reported in the issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually a part of TestMergeEdgeCases_Left, but I added one now anyway

@pgovind
Copy link
Author

pgovind commented May 5, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@michaelgsharp
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@pgovind pgovind merged commit 750956d into dotnet:main May 6, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.Merge method incorrect behavior
3 participants