-
Notifications
You must be signed in to change notification settings - Fork 1k
[thrift-remodel] PoC new form for column index #8191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR does not include the offset index changes from #8190. Running the metadata benchmark shows a substantial reduction in decode time.
Combining both PRs gives:
Over 5X faster than where we started. |
CI failures are real...these changes are not yet fully integrated so column index parsing is broken at the moment. 🚧 |
|
||
/// index | ||
#[allow(non_camel_case_types)] | ||
pub enum ColumnIndexMetaData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed what was done with the Index
enum, but I'll admit I don't love this. We could instead drop the type bounds on NativeColumnIndex
and rename that ColumnIndexMetaData
(this name was picked to mirror OffsetIndexMetaData
). Then we could add functions to convert the min/max values to the proper types.
Refactored since I made this comment. Translating the stats for primitive types speeds things up even further.
@alamb can you take a look when you have time? 🙏 Do you think the |
I think we now have mostly switched DataFusion to using I am looking at this PR now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot -- thank you @etseidl -- I found it much easier to understand than the existing Index
stucture
|
||
/// Common bits of the column index | ||
#[derive(Debug, Clone, PartialEq)] | ||
pub struct ColumnIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this structure now mirrors the actual parquet definition: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1154-L1214
👍
column_index.append(false, vec![1u8], vec![2u8, 3u8], 4); | ||
let column_index = column_index.build_to_thrift(); | ||
let native_index = NativeIndex::<bool>::try_new(column_index).unwrap(); | ||
let native_index = PrimitiveColumnIndex::<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so much nicer I think
I've got a good start on the write path now, so I'd like to merge this. But I would like a few more eyes on this on, given the radical change to the form of the column index. And while we're breaking things, do we actually like the column and page indexes as CC @jhorstmann @tustvold @XiangpengHao @adriangb you've also been in the metadata space, perhaps you have some opinions |
I don't have a super strong opinion on the structure of the metadata as loaded from Thrift but what I will say is I would love if we could break up metadata loading into pieces. This is part something DataFusion could do, part up to Parquet itself. In my mind one of the the biggest mistakes / bottlenecks is trying to load row group stats as part of the thrift data. IMO the footer / thrift data should be basic metadata (the schema) + data offsets + index type and index offsets. Then the index types / offsets are used to load page indexes, bloom filters, row group zone maps and future types of indexes. But that's a very ambitious goal that I don't have the bandwidth to push for atm. |
I do not like this structure and find it mind mending any time I have to deal with it
I do think this would be great (even though then the rust metadata doesn't match the physical representation in the Parquetfile). If we could do the same with BloomFilters that would also resolve an outstanding wart with the API. The trick will be to have reasonable APIs to add/strip these extra index structures from the RowGroupMetadata when needed
I don't think we can change the location of the row group statistics (as that is part of the format) What we could do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed. This is similar to my understanding of what @adrian-thurston suggested in this DataFusion issue: |
I was thinking something along the lines of the
Yes, this sort of thing is exactly why I started on this remodel...to allow selectively reading bits of the metadata we want without having to parse the entire footer every time we touch a file. I'm working on the page reading now, and really want to just skip over the page stats if they exist...I don't believe they are actually used at all. Do we still even write them? Unrelated note...@alamb and @jhorstmann were right...better to do the reads with a trait rather than my weird |
In my mind what would happen is:
|
Going to merge this now. We can continue to discuss the final form of the page indexes here or on #5854 (or add a new issue). |
FWIW this seems like a good idea to me for a particular system, but seems overly specialized for inclusion in the general arrow-rs parquet implementation. What I think does belong in the arrow-rs implementation is enough API hooks to make implementing such a specialized format feasible. I think it would be highly interesting to have an example that shows how to "add a custom index that is more efficient for whatever usecase, and then only write minimal metadata in the main footer" That could potentially satisfy the usecase of "we have files with 10000 columns and the statistics take too long to read" |
Which issue does this PR close?
Note: this targets a feature branch, not main
Rationale for this change
Parsing the column index is very slow. The largest part of the cost is taking the thrift structure (which is a struct of arrays) and converting it to an array of structs. This results in a large number of allocations when dealing with binary columns.
This is an experiment in creating a new structure to hold the column index info that is a little friendlier to parse. It may also be easier to consume on the datafusion side.
What changes are included in this PR?
A new
ColumnIndexMetaData
enum is added along with a type parameterizedNativeColumnIndex
struct.Are these changes tested?
No, this is an experiment only. If this work can be honed into an acceptible
Index
replacement, then tests will be added at that time.Are there any user-facing changes?
Yes, this would be a radical change to the column indexes in
ParquetMetaData
.