Skip to content

Parquet WriterProperties.max_row_group_size not wired up #257

@crepererum

Description

@crepererum

Describe the bug
This property (that can be set via the WriterPropertiesBuilder):

max_row_group_size: usize,

can only be retrieved using this getter:

/// Returns max size for a row group.
pub fn max_row_group_size(&self) -> usize {
self.max_row_group_size
}

but this getter is never used. In fact quickly trying out this property has no effect. I think it should probably we wired up here:

/// Write a RecordBatch to writer
///
/// *NOTE:* The writer currently does not support all Arrow data types
pub fn write(&mut self, batch: &RecordBatch) -> Result<()> {
// validate batch schema against writer's supplied schema
if self.arrow_schema != batch.schema() {
return Err(ParquetError::ArrowError(
"Record batch schema does not match writer schema".to_string(),
));
}
// compute the definition and repetition levels of the batch
let batch_level = LevelInfo::new_from_batch(batch);
let mut row_group_writer = self.writer.next_row_group()?;
for (array, field) in batch.columns().iter().zip(batch.schema().fields()) {
let mut levels = batch_level.calculate_array_levels(array, field, false);
// Reverse levels as we pop() them when writing arrays
levels.reverse();
write_leaves(&mut row_group_writer, array, &mut levels)?;
}
self.writer.close_row_group(row_group_writer)
}

where the incoming RecordBatch is split into batches of the configured size that will then fed into individual record batches.

To Reproduce
Steps to reproduce the behavior:

  1. create a RecordBatch with 3 rows.
  2. Set WriterProperties.max_row_group_size to 1
  3. Create a parquet file
  4. See the that parquet file has only 1 row group (should be 3).

Expected behavior
Record batches created from arrow should respect WriterProperties.max_row_group_size.

Additional context
Commit in question is 508f25c10032857da34ea88cc8166f0741616a32.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions