GenericColumnReader::read_records Yields Truncated Records

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Prior to version 41 of the parquet crate we had access to the `read_batch` function, which was deprecated (and changed) in favor of `read_records`. What we are trying to does not seem to be possible with the new API. We have a program that reads parquet files and concatenates them vertically, so that a number of parquet files with identical schemas become one file.

We did this by, for each input file and column:

```rust
loop {
    let (values_read, levels_read) = column_reader.read_batch(
        BATCH_SIZE,
        Some(&mut def_levels[..]),
        Some(&mut rep_levels[..]),
        &mut value_buffer[..],
    )?;
    
    if values_read == 0 && levels_read == 0 {
      break;
    }
    
    let values_written = column_writer.write_batch(
        &value_buffer[0..values_read],
        Some(&def_levels[0..levels_read]),
        Some(&rep_levels[0..levels_read]),
    )?;
    
    assert_eq!(values_written, values_read);
}
```

This simple loop turned many "small" files into one large file, with the same schema. After this change when we replace the call to `read_batch` with a call to `read_records` we will no longer get a complete batch which means that sometimes we will start writing a new batch while rep_levels is still 1.

**Describe the solution you'd like**

A way to simply read a complete batch without fiddling around with realigning our buffers between writes. I am also open to suggestions for why what we are doing is better solved in a different way, but the code we have works great with previous versions of parquet and we are currently blocked from upgrading.

**Describe alternatives you've considered**

Manually finding the last element with rep_levels = 0 and stopping our reads there, doing some math, writing a batch excluding the end of the buffers, copying the end of the buffers to the start of our buffers, and reading fewer records according to how much space is already used in the buffers.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GenericColumnReader::read_records Yields Truncated Records #5150

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GenericColumnReader::read_records Yields Truncated Records #5150

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions