[thrift-remodel] PoC new form for column index #8191

etseidl · 2025-08-20T20:49:44Z

Which issue does this PR close?

Note: this targets a feature branch, not main

Part of Use custom thrift decoder to improve speed of parsing parquet metadata #5854.

Rationale for this change

Parsing the column index is very slow. The largest part of the cost is taking the thrift structure (which is a struct of arrays) and converting it to an array of structs. This results in a large number of allocations when dealing with binary columns.

This is an experiment in creating a new structure to hold the column index info that is a little friendlier to parse. It may also be easier to consume on the datafusion side.

What changes are included in this PR?

A new ColumnIndexMetaData enum is added along with a type parameterized NativeColumnIndex struct.

Are these changes tested?

No, this is an experiment only. If this work can be honed into an acceptible Index replacement, then tests will be added at that time.

Are there any user-facing changes?

Yes, this would be a radical change to the column indexes in ParquetMetaData.

etseidl · 2025-08-20T20:53:09Z

This PR does not include the offset index changes from #8190. Running the metadata benchmark shows a substantial reduction in decode time.

old
open(page index)        time:   [1.7446 ms 1.7524 ms 1.7639 ms]

new with Index
open(page index)        time:   [698.00 µs 699.75 µs 701.66 µs]

new with ColumnIndexMetaData
open(page index)        time:   [422.47 µs 423.85 µs 425.18 µs]

Combining both PRs gives:

open(page index)        time:   [340.96 µs 341.60 µs 342.30 µs]

Over 5X faster than where we started.

etseidl · 2025-08-20T20:54:50Z

CI failures are real...these changes are not yet fully integrated so column index parsing is broken at the moment. 🚧

etseidl · 2025-08-20T20:58:43Z

parquet/src/file/page_index/index_reader.rs

+
+/// index
+#[allow(non_camel_case_types)]
+pub enum ColumnIndexMetaData {


I followed what was done with the Index enum, but I'll admit I don't love this. We could instead drop the type bounds on NativeColumnIndex and rename that ColumnIndexMetaData (this name was picked to mirror OffsetIndexMetaData). Then we could add functions to convert the min/max values to the proper types.

Refactored since I made this comment. Translating the stats for primitive types speeds things up even further.

etseidl · 2025-08-21T18:38:39Z

@alamb can you take a look when you have time? 🙏 Do you think the ColumnIndexMetaData here would work better for datafusion?

alamb · 2025-08-23T10:25:54Z

@alamb can you take a look when you have time? 🙏 Do you think the ColumnIndexMetaData here would work better for datafusion?

I think we now have mostly switched DataFusion to using StatisticsConverter , not Index directly, so I don't actually expect this change to have much of an impact on DataFusion

I am looking at this PR now

alamb

I like this a lot -- thank you @etseidl -- I found it much easier to understand than the existing Index stucture

alamb · 2025-08-23T10:27:52Z

parquet/src/file/page_index/column_index.rs

+
+/// Common bits of the column index
+#[derive(Debug, Clone, PartialEq)]
+pub struct ColumnIndex {


I really like how this structure now mirrors the actual parquet definition: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1154-L1214

👍

alamb · 2025-08-23T10:29:35Z

parquet/src/file/metadata/mod.rs

        column_index.append(false, vec![1u8], vec![2u8, 3u8], 4);
        let column_index = column_index.build_to_thrift();
-        let native_index = NativeIndex::<bool>::try_new(column_index).unwrap();
+        let native_index = PrimitiveColumnIndex::<bool> {


this is so much nicer I think

etseidl · 2025-08-26T15:09:30Z

I've got a good start on the write path now, so I'd like to merge this. But I would like a few more eyes on this on, given the radical change to the form of the column index.

And while we're breaking things, do we actually like the column and page indexes as Vec<Vec<index>>? I'm wondering if tacking them individually onto the ColumnChunkMetaData as Option<index> would be more intuitive? Or perhaps wrap them in Arc and store both places? Or if the primary use of this data is via the StatisticsConverter does this not really matter?

CC @jhorstmann @tustvold @XiangpengHao

@adriangb you've also been in the metadata space, perhaps you have some opinions

adriangb · 2025-08-26T15:22:20Z

@adriangb you've also been in the metadata space, perhaps you have some opinions

I don't have a super strong opinion on the structure of the metadata as loaded from Thrift but what I will say is I would love if we could break up metadata loading into pieces. This is part something DataFusion could do, part up to Parquet itself. In my mind one of the the biggest mistakes / bottlenecks is trying to load row group stats as part of the thrift data. IMO the footer / thrift data should be basic metadata (the schema) + data offsets + index type and index offsets. Then the index types / offsets are used to load page indexes, bloom filters, row group zone maps and future types of indexes. But that's a very ambitious goal that I don't have the bandwidth to push for atm.

alamb · 2025-08-27T14:13:52Z

And while we're breaking things, do we actually like the column and page indexes as Vec<Vec>?

I do not like this structure and find it mind mending any time I have to deal with it

I'm wondering if tacking them individually onto the ColumnChunkMetaData as Option would be more intuitive?

I do think this would be great (even though then the rust metadata doesn't match the physical representation in the Parquetfile). If we could do the same with BloomFilters that would also resolve an outstanding wart with the API.

The trick will be to have reasonable APIs to add/strip these extra index structures from the RowGroupMetadata when needed

This is part something DataFusion could do, part up to Parquet itself. In my mind one of the the biggest mistakes / bottlenecks is trying to load row group stats as part of the thrift data.
IMO the footer / thrift data should be basic metadata (the schema) + data offsets + index type and index offsets. Then the index types / offsets are used to load page indexes, bloom filters, row group zone maps and future types of indexes. But that's a very ambitious goal that I don't have the bandwidth to push for atm.

I don't think we can change the location of the row group statistics (as that is part of the format)

What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed.

This is similar to my understanding of what @adrian-thurston suggested in this DataFusion issue:

Reduce page metadata loading to only what is necessary for query execution in ParquetOpen datafusion#16200

etseidl · 2025-08-27T16:10:17Z

I do think this would be great (even though then the rust metadata doesn't match the physical representation in the Parquetfile). If we could do the same with BloomFilters that would also resolve an outstanding wart with the API.

The trick will be to have reasonable APIs to add/strip these extra index structures from the RowGroupMetadata when needed

I was thinking something along the lines of the ParquetMetaDataBuilder that would allow for adding the indexes to the appropriate column. Once the dust settles a bit I'll give this a try and see what the implications are as far as the StatisticsConverter. It should be a bit more natural to get the page stats out, but we'll see. It might also make projections a bit simpler. But the read side may be trickier if we only want a few columns for the column index...right now we pretty much read the entire page index to do fewer reads.

What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed.

Yes, this sort of thing is exactly why I started on this remodel...to allow selectively reading bits of the metadata we want without having to parse the entire footer every time we touch a file.

I'm working on the page reading now, and really want to just skip over the page stats if they exist...I don't believe they are actually used at all. Do we still even write them?

Unrelated note...@alamb and @jhorstmann were right...better to do the reads with a trait rather than my weird TryFrom approach. I forgot about reading the page headers, where we don't know the size up front, so I need to use a Read to parse them rather than &[u8]. This will be corrected in an upcoming PR. Still learning...

adriangb · 2025-08-27T16:20:21Z

I don't think we can change the location of the row group statistics (as that is part of the format)

What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed.

In my mind what would happen is:

We create a new index type for row group stats that is stored more efficiently and can be optionally read (including reading single columns, etc., as if it were itself a parquet file).
Users can opt in by not writing row group stats and writing the new index. They could write both for backwards compatibility with readers that don't support the new index type, but that might not be as performant on the read side for readers that do support the new index. Maybe not too bad if we can avoid parsing it as @etseidl is proposing but we'd still pay the IO price.

etseidl · 2025-08-27T19:43:11Z

Going to merge this now. We can continue to discuss the final form of the page indexes here or on #5854 (or add a new issue).

alamb · 2025-09-05T13:48:47Z

I don't think we can change the location of the row group statistics (as that is part of the format)
What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed.

In my mind what would happen is:

We create a new index type for row group stats that is stored more efficiently and can be optionally read (including reading single columns, etc., as if it were itself a parquet file).

Users can opt in by not writing row group stats and writing the new index. They could write both for backwards compatibility with readers that don't support the new index type, but that might not be as performant on the read side for readers that do support the new index. Maybe not too bad if we can avoid parsing it as @etseidl is proposing but we'd still pay the IO price.

FWIW this seems like a good idea to me for a particular system, but seems overly specialized for inclusion in the general arrow-rs parquet implementation.

What I think does belong in the arrow-rs implementation is enough API hooks to make implementing such a specialized format feasible.

I think it would be highly interesting to have an example that shows how to "add a custom index that is more efficient for whatever usecase, and then only write minimal metadata in the main footer"

That could potentially satisfy the usecase of "we have files with 10000 columns and the statistics take too long to read"

experimental new form for column index

1f2c216

github-actions bot added the parquet Changes to the parquet crate label Aug 20, 2025

etseidl commented Aug 20, 2025

View reviewed changes

etseidl added 2 commits August 21, 2025 11:27

fix for test added in main

37f3b20

refactor new column index

3d4e28e

etseidl added 10 commits August 21, 2025 14:54

checkpoint...everything but stats converter

2b85b89

fix bug found in testing

5ee1b8f

Merge branch 'new_col_idx' into new_col_idx_full

624b88b

stats converter works

d99a06a

get rid of import

79a6917

get parquet-index working

878d460

doc fixes

009632a

move column index to its own module

a822dfd

add ColumnIndexIterators trait, simplify stats converter a little

20df075

restore comment

7755b7b

alamb approved these changes Aug 23, 2025

View reviewed changes

etseidl marked this pull request as ready for review August 23, 2025 13:40

etseidl mentioned this pull request Aug 26, 2025

[thrift-remodel] Add custom PageLocation decoder to speed up decoding of page indexes #8190

Merged

Merge branch 'gh5854_thrift_remodel' into new_col_idx

fadffbc

etseidl merged commit f777584 into apache:gh5854_thrift_remodel Aug 27, 2025
16 checks passed

[thrift-remodel] PoC new form for column index #8191

[thrift-remodel] PoC new form for column index #8191

Uh oh!

Conversation

etseidl commented Aug 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etseidl commented Aug 20, 2025

Uh oh!

etseidl Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl commented Aug 21, 2025

Uh oh!

alamb commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl commented Aug 26, 2025

Uh oh!

adriangb commented Aug 26, 2025

Uh oh!

alamb commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etseidl commented Aug 27, 2025

Uh oh!

adriangb commented Aug 27, 2025

Uh oh!

etseidl commented Aug 27, 2025

Uh oh!

Uh oh!

alamb commented Sep 5, 2025

Uh oh!

Uh oh!

etseidl commented Aug 20, 2025 •

edited

Loading

etseidl Aug 20, 2025 •

edited

Loading

alamb commented Aug 23, 2025 •

edited

Loading

alamb commented Aug 27, 2025 •

edited

Loading