Skip to content

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 20, 2025

Which issue does this PR close?

Note: this targets a feature branch, not main

Rationale for this change

Parsing the column index is very slow. The largest part of the cost is taking the thrift structure (which is a struct of arrays) and converting it to an array of structs. This results in a large number of allocations when dealing with binary columns.

This is an experiment in creating a new structure to hold the column index info that is a little friendlier to parse. It may also be easier to consume on the datafusion side.

What changes are included in this PR?

A new ColumnIndexMetaData enum is added along with a type parameterized NativeColumnIndex struct.

Are these changes tested?

No, this is an experiment only. If this work can be honed into an acceptible Index replacement, then tests will be added at that time.

Are there any user-facing changes?

Yes, this would be a radical change to the column indexes in ParquetMetaData.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 20, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Aug 20, 2025

This PR does not include the offset index changes from #8190. Running the metadata benchmark shows a substantial reduction in decode time.

old
open(page index)        time:   [1.7446 ms 1.7524 ms 1.7639 ms]

new with Index
open(page index)        time:   [698.00 µs 699.75 µs 701.66 µs]

new with ColumnIndexMetaData
open(page index)        time:   [422.47 µs 423.85 µs 425.18 µs]

Combining both PRs gives:

open(page index)        time:   [340.96 µs 341.60 µs 342.30 µs]

Over 5X faster than where we started.

@etseidl
Copy link
Contributor Author

etseidl commented Aug 20, 2025

CI failures are real...these changes are not yet fully integrated so column index parsing is broken at the moment. 🚧


/// index
#[allow(non_camel_case_types)]
pub enum ColumnIndexMetaData {
Copy link
Contributor Author

@etseidl etseidl Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed what was done with the Index enum, but I'll admit I don't love this. We could instead drop the type bounds on NativeColumnIndex and rename that ColumnIndexMetaData (this name was picked to mirror OffsetIndexMetaData). Then we could add functions to convert the min/max values to the proper types.

Refactored since I made this comment. Translating the stats for primitive types speeds things up even further.

@etseidl
Copy link
Contributor Author

etseidl commented Aug 21, 2025

@alamb can you take a look when you have time? 🙏 Do you think the ColumnIndexMetaData here would work better for datafusion?

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

@alamb can you take a look when you have time? 🙏 Do you think the ColumnIndexMetaData here would work better for datafusion?

I think we now have mostly switched DataFusion to using StatisticsConverter , not Index directly, so I don't actually expect this change to have much of an impact on DataFusion

I am looking at this PR now

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot -- thank you @etseidl -- I found it much easier to understand than the existing Index stucture


/// Common bits of the column index
#[derive(Debug, Clone, PartialEq)]
pub struct ColumnIndex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how this structure now mirrors the actual parquet definition: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1154-L1214

👍

column_index.append(false, vec![1u8], vec![2u8, 3u8], 4);
let column_index = column_index.build_to_thrift();
let native_index = NativeIndex::<bool>::try_new(column_index).unwrap();
let native_index = PrimitiveColumnIndex::<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so much nicer I think

@etseidl
Copy link
Contributor Author

etseidl commented Aug 26, 2025

I've got a good start on the write path now, so I'd like to merge this. But I would like a few more eyes on this on, given the radical change to the form of the column index.

And while we're breaking things, do we actually like the column and page indexes as Vec<Vec<index>>? I'm wondering if tacking them individually onto the ColumnChunkMetaData as Option<index> would be more intuitive? Or perhaps wrap them in Arc and store both places? Or if the primary use of this data is via the StatisticsConverter does this not really matter?

CC @jhorstmann @tustvold @XiangpengHao

@adriangb you've also been in the metadata space, perhaps you have some opinions

@adriangb
Copy link
Contributor

@adriangb you've also been in the metadata space, perhaps you have some opinions

I don't have a super strong opinion on the structure of the metadata as loaded from Thrift but what I will say is I would love if we could break up metadata loading into pieces. This is part something DataFusion could do, part up to Parquet itself. In my mind one of the the biggest mistakes / bottlenecks is trying to load row group stats as part of the thrift data. IMO the footer / thrift data should be basic metadata (the schema) + data offsets + index type and index offsets. Then the index types / offsets are used to load page indexes, bloom filters, row group zone maps and future types of indexes. But that's a very ambitious goal that I don't have the bandwidth to push for atm.

@alamb
Copy link
Contributor

alamb commented Aug 27, 2025

And while we're breaking things, do we actually like the column and page indexes as Vec<Vec>?

I do not like this structure and find it mind mending any time I have to deal with it

I'm wondering if tacking them individually onto the ColumnChunkMetaData as Option would be more intuitive?

I do think this would be great (even though then the rust metadata doesn't match the physical representation in the Parquetfile). If we could do the same with BloomFilters that would also resolve an outstanding wart with the API.

The trick will be to have reasonable APIs to add/strip these extra index structures from the RowGroupMetadata when needed

This is part something DataFusion could do, part up to Parquet itself. In my mind one of the the biggest mistakes / bottlenecks is trying to load row group stats as part of the thrift data.
IMO the footer / thrift data should be basic metadata (the schema) + data offsets + index type and index offsets. Then the index types / offsets are used to load page indexes, bloom filters, row group zone maps and future types of indexes. But that's a very ambitious goal that I don't have the bandwidth to push for atm.

I don't think we can change the location of the row group statistics (as that is part of the format)

What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed.

This is similar to my understanding of what @adrian-thurston suggested in this DataFusion issue:

@etseidl
Copy link
Contributor Author

etseidl commented Aug 27, 2025

I do think this would be great (even though then the rust metadata doesn't match the physical representation in the Parquetfile). If we could do the same with BloomFilters that would also resolve an outstanding wart with the API.

The trick will be to have reasonable APIs to add/strip these extra index structures from the RowGroupMetadata when needed

I was thinking something along the lines of the ParquetMetaDataBuilder that would allow for adding the indexes to the appropriate column. Once the dust settles a bit I'll give this a try and see what the implications are as far as the StatisticsConverter. It should be a bit more natural to get the page stats out, but we'll see. It might also make projections a bit simpler. But the read side may be trickier if we only want a few columns for the column index...right now we pretty much read the entire page index to do fewer reads.

What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed.

Yes, this sort of thing is exactly why I started on this remodel...to allow selectively reading bits of the metadata we want without having to parse the entire footer every time we touch a file.

I'm working on the page reading now, and really want to just skip over the page stats if they exist...I don't believe they are actually used at all. Do we still even write them?

Unrelated note...@alamb and @jhorstmann were right...better to do the reads with a trait rather than my weird TryFrom approach. I forgot about reading the page headers, where we don't know the size up front, so I need to use a Read to parse them rather than &[u8]. This will be corrected in an upcoming PR. Still learning...

@adriangb
Copy link
Contributor

I don't think we can change the location of the row group statistics (as that is part of the format)

What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed.

In my mind what would happen is:

  1. We create a new index type for row group stats that is stored more efficiently and can be optionally read (including reading single columns, etc., as if it were itself a parquet file).
  2. Users can opt in by not writing row group stats and writing the new index. They could write both for backwards compatibility with readers that don't support the new index type, but that might not be as performant on the read side for readers that do support the new index. Maybe not too bad if we can avoid parsing it as @etseidl is proposing but we'd still pay the IO price.

@etseidl
Copy link
Contributor Author

etseidl commented Aug 27, 2025

Going to merge this now. We can continue to discuss the final form of the page indexes here or on #5854 (or add a new issue).

@etseidl etseidl merged commit f777584 into apache:gh5854_thrift_remodel Aug 27, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 5, 2025

I don't think we can change the location of the row group statistics (as that is part of the format)
What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed.

In my mind what would happen is:

  1. We create a new index type for row group stats that is stored more efficiently and can be optionally read (including reading single columns, etc., as if it were itself a parquet file).
  2. Users can opt in by not writing row group stats and writing the new index. They could write both for backwards compatibility with readers that don't support the new index type, but that might not be as performant on the read side for readers that do support the new index. Maybe not too bad if we can avoid parsing it as @etseidl is proposing but we'd still pay the IO price.

FWIW this seems like a good idea to me for a particular system, but seems overly specialized for inclusion in the general arrow-rs parquet implementation.

What I think does belong in the arrow-rs implementation is enough API hooks to make implementing such a specialized format feasible.

I think it would be highly interesting to have an example that shows how to "add a custom index that is more efficient for whatever usecase, and then only write minimal metadata in the main footer"

That could potentially satisfy the usecase of "we have files with 10000 columns and the statistics take too long to read"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants