Skip to content

Compacting produces smaller row groups than expected #2386

Closed
@PeterKeDer

Description

@PeterKeDer

Environment

Delta-rs version: 0.16.4

Binding: python

Environment:

  • Cloud provider: AWS
  • OS: macOS
  • Other:

Bug

What happened:

Compact produces parquet files that are larger than expected:

dt = DeltaTable("...")
dt.optimize.compact(
    writer_properties=WriterProperties(
        max_row_group_size=8192,
        write_batch_size=8192,
    )
)

The resulting parquet files have row groups with 1024 rows instead of 8192.

What you expected to happen:

Most row groups in the compacted parquet should have size 8192.

How to reproduce it:

Call dt.optimize.compact() with max_row_group_size greater than 1024.

More details:

This is caused by calling self.arrow_writer.flush() at the end of each batch in core/src/operations/writer.rs, introduced recently in #2318. This creates a new row group for each batch, even when there are less than rows than max_row_group_size. Since we read batches using ParquetRecordBatchStreamBuilder with default config (i.e. batch size 1024), we end up with only row groups up to 1024 rows, even if we set max_row_group_size to a larger value.

I don't think calling flush is necessary since ArrowWriter does that automatically when we reach max_row_group_size rows.

This negatively impacts our use cases by inflating our parquet sizes, sometimes by up to 4x (40 MB to 160 MB).

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions