-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better memory limiting in parquet ArrowWriter
#5484
Labels
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Comments
alamb
added
the
enhancement
Any new improvement worthy of a entry in the changelog
label
Mar 7, 2024
I think we should probably just remove the confusing |
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Mar 8, 2024
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Mar 8, 2024
tustvold
added a commit
that referenced
this issue
Mar 13, 2024
|
mwylde
added a commit
to ArroyoSystems/arrow-rs
that referenced
this issue
May 9, 2024
…apache#5485)" This reverts commit 19a3bb0.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@DDtKey suggested in #5457 #5457 (review)
Describe the solution you'd like
The idea is to add an additional option to force the writer to flush when its buffered data hits a certain limit.
Describe alternatives you've considered
The challenge is how to enforce buffer limiting without slowing down encoding. One idea would be to checking memory usage after completing encoding each RecordBatch. This would be imprecise (the writer could go over), as noted by @tustvold , but the overage would be bounded to the size of one RecordBatch (which the user could control)
Since all writers
This might look like adding something like this to the ArrowWriter:
Since not all the parquet writers buffer their data like this, I think it doesn't make sense to put the buffer size on the
WriterProperties
struct.Additional context
@tustvold documented the current behavior better in #5457
The text was updated successfully, but these errors were encountered: