-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
This property (that can be set via the WriterPropertiesBuilder):
arrow-rs/parquet/src/file/properties.rs
Line 99 in 508f25c
| max_row_group_size: usize, |
can only be retrieved using this getter:
arrow-rs/parquet/src/file/properties.rs
Lines 132 to 135 in 508f25c
| /// Returns max size for a row group. | |
| pub fn max_row_group_size(&self) -> usize { | |
| self.max_row_group_size | |
| } |
but this getter is never used. In fact quickly trying out this property has no effect. I think it should probably we wired up here:
arrow-rs/parquet/src/arrow/arrow_writer.rs
Lines 80 to 101 in 508f25c
| /// Write a RecordBatch to writer | |
| /// | |
| /// *NOTE:* The writer currently does not support all Arrow data types | |
| pub fn write(&mut self, batch: &RecordBatch) -> Result<()> { | |
| // validate batch schema against writer's supplied schema | |
| if self.arrow_schema != batch.schema() { | |
| return Err(ParquetError::ArrowError( | |
| "Record batch schema does not match writer schema".to_string(), | |
| )); | |
| } | |
| // compute the definition and repetition levels of the batch | |
| let batch_level = LevelInfo::new_from_batch(batch); | |
| let mut row_group_writer = self.writer.next_row_group()?; | |
| for (array, field) in batch.columns().iter().zip(batch.schema().fields()) { | |
| let mut levels = batch_level.calculate_array_levels(array, field, false); | |
| // Reverse levels as we pop() them when writing arrays | |
| levels.reverse(); | |
| write_leaves(&mut row_group_writer, array, &mut levels)?; | |
| } | |
| self.writer.close_row_group(row_group_writer) | |
| } |
where the incoming RecordBatch is split into batches of the configured size that will then fed into individual record batches.
To Reproduce
Steps to reproduce the behavior:
- create a
RecordBatchwith 3 rows. - Set
WriterProperties.max_row_group_sizeto 1 - Create a parquet file
- See the that parquet file has only 1 row group (should be 3).
Expected behavior
Record batches created from arrow should respect WriterProperties.max_row_group_size.
Additional context
Commit in question is 508f25c10032857da34ea88cc8166f0741616a32.