Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More accurate estimate on parquet row groups size #11258

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix testBinPackCombineMediumFiles test failure
  • Loading branch information
jinyang_li committed Oct 7, 2024
commit 5e646685cf116570e32ececf388bf15782b6049a
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ public long length() {
if (!closed && recordCount > 0) {
// recordCount > 0 when there are records in the write store that have not been flushed to
// the Parquet file
length += writeStore.getBufferedSize();
length += estimateBufferSize();
}

return length;
Expand All @@ -192,7 +192,7 @@ public List<Long> splitOffsets() {

private long estimateBufferSize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename estimateBufferSize to estimateBufferedSize for naming consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we add a few comments about the estimation logic? i.e. what the ratio totalRowGroupSize / totalBufferSize means as estimating the ratio of size reduction to match closely what will be written out encoded/compressed.

if (totalRowGroupSize == 0 || totalBufferSize == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also fallback to writeStore.getBufferedSize() when currentBufferSize == 0? Likely it'd be 0 since I don't see other case, but I feel that'd make it consistent on the fallbacks.

return currentBufferSize;
return writeStore.getBufferedSize();
}

return currentBufferSize * totalRowGroupSize / totalBufferSize;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -557,8 +557,8 @@ public void testBinPackCombineMixedFiles() {

@Test
public void testBinPackCombineMediumFiles() {
Table table = createTable(4);
shouldHaveFiles(table, 4);
Table table = createTable(6);
shouldHaveFiles(table, 6);
Comment on lines +560 to +561
Copy link
Author

@jinyangli34 jinyangli34 Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, the test table has smaller number of row groups. Small change like 4->3 may cause small tail when generating taskGroups
Size of taskGroups with 4 -> 3: [197078, 188542, 189047, 15617]

Using 6 -> 3 can have better tolerance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why changes to this test are being made

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4/3 may produce groups with [1.4, 1.3, 1.3] or [1.3, 1.3, 1.3, 0.1]
The testing data generates [197078, 188542, 189047, 15617] with small tail.
Using 4/2 or 6/3 can make test more robust.


List<Object[]> expectedRecords = currentData();
int targetSize = ((int) testDataSize(table) / 3);
Expand All @@ -578,7 +578,7 @@ public void testBinPackCombineMediumFiles() {

assertThat(result.rewrittenDataFilesCount())
.as("Action should delete 4 data files")
.isEqualTo(4);
.isEqualTo(6);
assertThat(result.addedDataFilesCount()).as("Action should add 3 data files").isEqualTo(3);
assertThat(result.rewrittenBytesCount()).isEqualTo(dataSizeBefore);

Expand Down