Skip to content

Improve Documentation of Parquet ChunkReader #4118

@zilder

Description

@zilder

Describe the bug
Not sure that it's a bug, but it seems that arrow-rs version 37 performs more read operations from parquet files compared to version 19 (which we have been using so far). Some of the byte ranges seem to be overlapping (see the output below). For the context we use a custom implementation of ChunkReader with ParquetRecordBatchReader (and with SerializedFileReader in v19) to access S3 storage. Here's a reduced implementation:

pub struct S3Request {
    client: Client,
    bucket: String,
    key: String,
    len: u64,
    rt: Runtime,
}

impl ChunkReader for S3Request {
    type T = ByteBuf;

    fn get_read(
        &self,
        start: u64,
        length: usize,
    ) -> Result<Self::T, parquet::errors::ParquetError> {
        let end = start + length as u64 - 1;
        println!("S3Request::get_read(): {}, {}", start, end);

        let data = self
            .rt
            .block_on(async {
                let resp = match self
                    .client
                    .get_object()
                    .bucket(&self.bucket)
                    .key(&self.key)
                    .range(format!("bytes={}-{}", start, end))
                    .send()
                    .await
                {
                    Ok(r) => r,
                    Err(e) => {
                        panic!("{}", e);
                    },
                };

                resp.body.collect().await
            })
            .unwrap();

        Ok(ByteBuf(data))
    }
}

(I added println!("S3Request::get_read(): {}, {}", start, end); to track each read operations)

In the output we get 8 read operations (v37):

S3Request::get_read(): 2359, 2366
S3Request::get_read(): 435, 2358
S3Request::get_read(): 4, 121
S3Request::get_read(): 18, 121
S3Request::get_read(): 43, 121
S3Request::get_read(): 214, 331
S3Request::get_read(): 228, 331
S3Request::get_read(): 253, 331
+----+-------------+
| ts | temperature |
+----+-------------+
| 1  | 111         |
| 5  | 555         |
+----+-------------+

While with the same implementation we only get 4 read operations using SerializedFileReader and ParquetFileArrowReader (in v19):

S3Request::get_read(): 2359, 2366
S3Request::get_read(): 435, 2358
S3Request::get_read(): 4, 121
S3Request::get_read(): 214, 331
+----+-------------+
| ts | temperature |
+----+-------------+
| 1  | 111         |
| 5  | 555         |
+----+-------------+

Was that an intended change?

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions