Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Feb 7, 2024

On top of @HonahX's work in #388

@Fokko Fokko force-pushed the fd-fix-row-group-page-size branch from edac3df to 4cd240f Compare February 7, 2024 12:51
@sungwy sungwy added this to the PyIceberg 0.6.0 release milestone Feb 7, 2024
Comment on lines 137 to 144
PARQUET_ROW_GROUP_SIZE_BYTES = "write.parquet.row-group-size-bytes"
PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT = 128 * 1024 * 1024 # 128 MB

PARQUET_ROW_GROUP_LIMIT = "write.parquet.row-group-limit"
PARQUET_ROW_GROUP_LIMIT_DEFAULT = 128 * 1024 * 1024 # 128 MB

PARQUET_PAGE_SIZE_BYTES = "write.parquet.page-size-bytes"
PARQUET_PAGE_SIZE_BYTES_DEFAULT = 1024 * 1024 # 1 MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine for the initial PR just so people have the properties, but I do think we may want to benchmark these values more. I think in general, Arrow and DuckDB will benefit from smaller row group sizes because they are more aggressive on parallel reads. But of course we should measure that.

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Fokko Fokko force-pushed the fd-fix-row-group-page-size branch from 4cd240f to 0165e4f Compare February 8, 2024 08:51
@Fokko Fokko merged commit 7a1fe28 into apache:main Feb 8, 2024
@Fokko Fokko deleted the fd-fix-row-group-page-size branch February 14, 2024 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants