Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParquetMetaDataBuilder #6466

Merged
merged 7 commits into from
Oct 1, 2024
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 26, 2024

Which issue does this PR close?

Closes #6465

Rationale for this change

At the moment it is

  1. awkward to modify ParquetMetaData (you have to re-create it from its constitutent fields)
  2. and there is no way to avoid clone'ing when trying to modify it

See #6465 for more rationale

What changes are included in this PR?

  1. Add ParquetMetaDataBuilder
  2. Add documentation and examples
  3. Update code that modifies ParquetMetaData to use ParquetMetadataBuilder (which requires fewer clones) so might be marginally faster

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 26, 2024
Copy link
Contributor

@wiedld wiedld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are planning to eventually deprecate the non-builder ParquetMetaData::new*, then this would also need to be switched to the builder.

let mut filtered_row_groups = Vec::<RowGroupMetaData>::new();
for (i, rg_meta) in row_groups.into_iter().enumerate() {

// Filter row groups based on the predicates
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cleanup of this code (which is modifying the ParquetMetaData) is the best example of why having this API makes sense -- it makes one fewer copies and also I think is quite a bit clearer

}

/// Adds a row group to the metadata
pub fn add_row_group(mut self, row_group: RowGroupMetaData) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Sep 27, 2024

If you are planning to eventually deprecate the non-builder ParquetMetaData::new*, then this would also need to be switched to the builder.

That is a good point -- I wasn't planing to deprecate the functions, though we could argume that deprecating new_with_page_index might be a good idea 🤔

@alamb
Copy link
Contributor Author

alamb commented Sep 27, 2024

Thank you for the review @wiedld

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this. ❤️

parquet/src/file/metadata/mod.rs Show resolved Hide resolved
}
}

if options.enable_page_index {
let mut columns_indexes = vec![];
let mut offset_indexes = vec![];

for rg in &mut filtered_row_groups {
for rg in metadata_builder.row_groups().iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can build the metadata here (with the filtered row groups), pass it into ParquetMetaDataReader and then load the page indexes into the metadata. Let me give that a try.

Copy link
Contributor

@etseidl etseidl Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        if options.enable_page_index {
            let mut reader = ParquetMetaDataReader::new_with_metadata(metadata_builder.build())
                .with_page_indexes(options.enable_page_index);
            reader.read_page_indexes(&chunk_reader)?;
            metadata_builder = ParquetMetaDataBuilder::new_from_metadata(reader.finish()?);
        }

I forgot to do this in #6450.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow why this is needed. What scenario does it help (I can write a test to cover it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean replace

        if options.enable_page_index {
            let mut columns_indexes = vec![];
            let mut offset_indexes = vec![];

            for rg in metadata_builder.row_groups().iter() {
                let column_index = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?;
                let offset_index = index_reader::read_offset_indexes(&chunk_reader, rg.columns())?;
                columns_indexes.push(column_index);
                offset_indexes.push(offset_index);
            }
            metadata_builder = metadata_builder
                .set_column_index(Some(columns_indexes))
                .set_offset_index(Some(offset_indexes));
        }

with the above code snippet from my earlier comment. This should be a bit more efficient since read_page_indexes will fetch the necessary bytes from the file in a single read, rather than 2 reads per row group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented something slightly different:

Since there is already a ParquetMetaDataReader created at the beginning of the function, I made the change to simply read it when needed.

One thing that might be different is that it looks like the current code may only read the column index/page index for row groups that passed the "predicates" but the ParquetMetadataReader reads the index for all the row groups.

That being said, no tests fail, so i am not sure if it is a real problem or not

Copy link
Contributor

@etseidl etseidl Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this worries me a bit, since the column and offset indexes will have more row groups represented than are in the ParquetMetaData. The split path from before would only read the page indexes for the remaining row groups.

We could prune the page indexes at the same time we're pruning the row groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could prune the page indexes at the same time we're pruning the row groups.

Yeah, that is probably mirrors the intent the most closely. How about I'll back out c0432e6 and we can address improving this code as a follow on PR (along with tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #6491 and reverted c0432e6

@alamb
Copy link
Contributor Author

alamb commented Sep 28, 2024

If you are planning to eventually deprecate the non-builder ParquetMetaData::new*, then this would also need to be switched to the builder.

That is a good point -- I wasn't planing to deprecate the functions, though we could argume that deprecating new_with_page_index might be a good idea 🤔

I deprecated new_with_page_index in b6c63a8

@alamb alamb merged commit 31d6891 into apache:master Oct 1, 2024
16 checks passed
@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2024

Thanks again for the reviews @wiedld and @etseidl

Follow on work is tracked in #6491

@alamb alamb deleted the alamb/metadata_builder_api branch October 1, 2024 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add builder style API for manipulating ParquetMetaData
3 participants