Description
Environment
Delta-rs version: 0.16.4
Binding: python
Environment:
- Cloud provider: AWS
- OS: macOS
- Other:
Bug
What happened:
Compact produces parquet files that are larger than expected:
dt = DeltaTable("...")
dt.optimize.compact(
writer_properties=WriterProperties(
max_row_group_size=8192,
write_batch_size=8192,
)
)
The resulting parquet files have row groups with 1024 rows instead of 8192.
What you expected to happen:
Most row groups in the compacted parquet should have size 8192.
How to reproduce it:
Call dt.optimize.compact()
with max_row_group_size
greater than 1024.
More details:
This is caused by calling self.arrow_writer.flush()
at the end of each batch in core/src/operations/writer.rs
, introduced recently in #2318. This creates a new row group for each batch, even when there are less than rows than max_row_group_size
. Since we read batches using ParquetRecordBatchStreamBuilder
with default config (i.e. batch size 1024), we end up with only row groups up to 1024 rows, even if we set max_row_group_size
to a larger value.
I don't think calling flush
is necessary since ArrowWriter
does that automatically when we reach max_row_group_size
rows.
This negatively impacts our use cases by inflating our parquet sizes, sometimes by up to 4x (40 MB to 160 MB).
Activity