Skip to content

GenericColumnReader::read_records Yields Truncated Records #5150

@ogrman

Description

@ogrman

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Prior to version 41 of the parquet crate we had access to the read_batch function, which was deprecated (and changed) in favor of read_records. What we are trying to does not seem to be possible with the new API. We have a program that reads parquet files and concatenates them vertically, so that a number of parquet files with identical schemas become one file.

We did this by, for each input file and column:

loop {
    let (values_read, levels_read) = column_reader.read_batch(
        BATCH_SIZE,
        Some(&mut def_levels[..]),
        Some(&mut rep_levels[..]),
        &mut value_buffer[..],
    )?;
    
    if values_read == 0 && levels_read == 0 {
      break;
    }
    
    let values_written = column_writer.write_batch(
        &value_buffer[0..values_read],
        Some(&def_levels[0..levels_read]),
        Some(&rep_levels[0..levels_read]),
    )?;
    
    assert_eq!(values_written, values_read);
}

This simple loop turned many "small" files into one large file, with the same schema. After this change when we replace the call to read_batch with a call to read_records we will no longer get a complete batch which means that sometimes we will start writing a new batch while rep_levels is still 1.

Describe the solution you'd like

A way to simply read a complete batch without fiddling around with realigning our buffers between writes. I am also open to suggestions for why what we are doing is better solved in a different way, but the code we have works great with previous versions of parquet and we are currently blocked from upgrading.

Describe alternatives you've considered

Manually finding the last element with rep_levels = 0 and stopping our reads there, doing some math, writing a batch excluding the end of the buffers, copying the end of the buffers to the start of our buffers, and reading fewer records according to how much space is already used in the buffers.

Metadata

Metadata

Assignees

Labels

bugenhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions