Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DataFrame to allow to store columns with size more than 2 Gb #6710

Merged
merged 5 commits into from
Jul 6, 2023
Merged

Fix DataFrame to allow to store columns with size more than 2 Gb #6710

merged 5 commits into from
Jul 6, 2023

Conversation

asmirnov82
Copy link
Contributor

@asmirnov82 asmirnov82 commented May 25, 2023

Fixes #6699

  1. Decrease limit from Int.Max value on buffer size to the maximum amount of bytes supported by .Net Arrays
  2. Fix calculation of items amount for filling remaining capacity of fully allocated chunk
  3. Fix calculation of buffer size during default growing strategy (double the initial size)

@codecov
Copy link

codecov bot commented May 25, 2023

Codecov Report

Merging #6710 (ee9cc4c) into main (184e661) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6710   +/-   ##
=======================================
  Coverage   68.87%   68.88%           
=======================================
  Files        1216     1216           
  Lines      250915   250916    +1     
  Branches    26259    26259           
=======================================
+ Hits       172825   172841   +16     
+ Misses      71265    71258    -7     
+ Partials     6825     6817    -8     
Flag Coverage Δ
Debug 68.88% <100.00%> (+<0.01%) ⬆️
production 63.38% <100.00%> (+<0.01%) ⬆️
test 88.90% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
test/Microsoft.Data.Analysis.Tests/BufferTests.cs 100.00% <ø> (ø)
src/Microsoft.Data.Analysis/DataFrameBuffer.cs 82.97% <100.00%> (+0.37%) ⬆️
...icrosoft.Data.Analysis/PrimitiveColumnContainer.cs 83.04% <100.00%> (ø)
...Microsoft.Data.Analysis/ReadOnlyDataFrameBuffer.cs 47.36% <100.00%> (ø)

... and 7 files with indirect coverage changes

Copy link
Member

@michaelgsharp michaelgsharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@torronen torronen mentioned this pull request Jun 3, 2023
@michaelgsharp
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@asmirnov82
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 6710 in repo dotnet/machinelearning

@asmirnov82
Copy link
Contributor Author

@michaelgsharp, tests are failing due to "Unable to load DLL 'lib_lightgbm' or one of its dependencies: The specified module could not be found. (0x8007007E)" in ML.Fairlearn.Tests.GridSearchTest.TestGridSearchTrialRunner2. It doesn't look to be somehow related to changes done in this PR. More over, I see that the same error happens from time to time during build in many other my PRs. Is it possible to disable this test temporaly (until the fix for the test is found)?

@JakeRadMSFT
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@@ -188,6 +188,28 @@ public void TestArrowStringColumnClone()
Assert.Null(clone[i]);
}

/* Don't run tests during build as they fail, because build if build machine doesn't have enought memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asmirnov82 are you sure this is the reason?

Can we add a flag so these still run locally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to do a #if and then set a flag in build definition

Copy link
Contributor

@JakeRadMSFT JakeRadMSFT Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asmirnov82 can you open an issue. I don't want to block on this.

@tannergooding said this might actually be a netfx thing? There might need to be something set to support > 2gb. Comment below.

"For the more than 2GB one, it will be allowed but will then fail on .NET Framework anyways due to GC limitations

You have to opt in with gcAllowVeryLargeObjects iirc, and that's only for 64-bit"

Copy link
Contributor Author

@asmirnov82 asmirnov82 Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JakeRadMSFT I fixed tests using X64FactAttribute. Please take a look at #6758 and #6759

@JakeRadMSFT JakeRadMSFT merged commit 26c2446 into dotnet:main Jul 6, 2023
23 checks passed
@asmirnov82 asmirnov82 deleted the 6699_support_large_columns branch July 7, 2023 05:48
@ghost ghost locked as resolved and limited conversation to collaborators Aug 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame crashes on attempt to create column with size higher than 2,147,483,591 bytes
3 participants