Skip to content

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 1, 2025

Which issue does this PR close?

Rationale for this change

See issue, but this change is needed to allow greater control over Parquet metadata handling. These changes speed up Thrift decoding about 2X, and enable further optimizations down the line, such as allowing selective decoding of only that metadata that is needed.

What changes are included in this PR?

In short, the format crate is now no longer necessary. This crate will decode from Thrift encoded bytes straight to the metadata objects defined by this crate (ParquetMetaData and children).

Are these changes tested?

Yes, but more should be added by follow-on PRs

Are there any user-facing changes?

Yes, many.

etseidl and others added 30 commits August 6, 2025 14:36
…8048)

**Note: this targets a feature branch, not main**

# Which issue does this PR close?

- Part of #5854.
- related to #6129

# Rationale for this change

This is the first step towards a rework of Parquet metadata handling.
This PR attempts to remove as many references as possible to the
`parquet::format` module in the public API. This is done by creating new
enums and structs that mirror their `format` counterparts and using them
in publicly exposed structures like `FileMetaData`.

# What changes are included in this PR?

# Are these changes tested?

Current tests should suffice for now. More thorough tests will be added
as needed.

# Are there any user-facing changes?

Yes, public facing interfaces should no longer expose `format`
# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.
- Related to #6129

# Rationale for this change

Next step.

# What changes are included in this PR?

Attempt to make use of macros originally developed by @jhorstmann to
implement some thrift enums and unions in `basic.rs`. Also adds yet
another spin on a thrift decoder based heavily on
`TCompactSliceInputProtocol`. The current approach is to use this new
decoder with `TryFrom` impls on the data structures in question.

This PR does not yet complete the process far enough to parse a parquet
footer, but I'm putting it out here to get early feedback on the design.

Some structures already deviate enough from their thrift counterparts
that the macro based parsing is not practical. Likewise, the macro
approach doesn't work for structs that need lifetime annotations
(necessary for reading binary fields as slices).

# Are these changes tested?

Not yet.

# Are there any user-facing changes?

Yes.
…aData` (#8111)

# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

# What changes are included in this PR?
This PR completes reading of the `FileMetaData` and `RowGroupMetaData`
pieces of the `ParquetMetaData`. Column indexes and encryption will be
follow-on work.

This replaces the macro for generating structs with a more general one
that can take visibility and lifetime specifiers. Also (temporarily)
adds a new function `ParquetMetaDataReader::decode_file_metadata` which
should be a drop-in replacement for
`ParquetMetaDataReader::decode_metadata`.

Still todo:

1. Add some tests that verify this produces the same output as
`ParquetMetaDataReader::decode_metadata`
2. Read column indexes with new decoder
3. Read page headers with new decoder
4. Integrate with @alamb's push decoder work #8080

# Are these changes tested?

Not yet

# Are there any user-facing changes?

Yes
# Which issue does this PR close?

**Note: this targets a feature branch, not main**

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of #5854.

# Rationale for this change

Speed

# What changes are included in this PR?

Still a work in progress, but begins the process of converting page
index parsing to the new thrift decoder.

# Are these changes tested?

This PR actually uses the new decoder when parsing the page indexes
using the existing machinery. As such all tests involving the page
indexes should apply to this code.

# Are there any user-facing changes?

Yes
# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

Parsing the column index is _very_ slow. The largest part of the cost is
taking the thrift structure (which is a struct of arrays) and converting
it to an array of structs. This results in a large number of allocations
when dealing with binary columns.

This is an experiment in creating a new structure to hold the column
index info that is a little friendlier to parse. It may also be easier
to consume on the datafusion side.

# What changes are included in this PR?
A new `ColumnIndexMetaData` enum is added along with a type
parameterized `NativeColumnIndex` struct.

# Are these changes tested?
No, this is an experiment only. If this work can be honed into an
acceptible `Index` replacement, then tests will be added at that time.

# Are there any user-facing changes?
Yes, this would be a radical change to the column indexes in
`ParquetMetaData`.
…ng of page indexes (#8190)

# Which issue does this PR close?
**Note: this targets a feature branch, not main**

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of #5854.

# Rationale for this change

Add a custom parser for `PageLocation` as the decoding of this struct is
one of several hot spots.

# What changes are included in this PR?

This adds a faster means of obtaining the struct field ids to
`ThriftCompactInputProtocol`. For a small struct (3 fields) with all of
them required, we can save a good bit of time bypassing
`ThriftCompactInputProtocol::read_field_begin` which is very general and
can handle out-of-order fields, among other things. By adding a new
function `read_field_header`, we can avoid the costly branching that
occurs when calculating the new field id (as well as special handling
needed for boolean fields). Field validation is then handled on the
consuming side while decoding the `PageLocation` struct.

Note that to obtain the speed up seen, we need to assume the fields will
always be in order, and the field ids will all be encoded as field
deltas. This is probably a fairly safe assumption, but there does exist
the possibility of custom thrift writers that use absolute field ids. If
we encounter such a writer in the wild, this change will need to be
reverted.

# Are these changes tested?

These changes should be covered by existing changes.

# Are there any user-facing changes?

None beyond the changes in this branch.
# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

Begins adding custom thrift write support.

# What changes are included in this PR?

Adds traits to aid in writing of thrift and modifies thrift macros to
generate writing code.

# Are these changes tested?

Yes, adds some roundtrip tests to validate encoded data can be decoded
to the same state.

# Are there any user-facing changes?

No
…ter decryption code (#8313)

# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

Continuing the remodel

# What changes are included in this PR?

This begins the process of replacing the current footer parsing code
with the new version. As part of this much of the decryption machinery
also needed to be changed.

# Are these changes tested?

Should be covered by existing tests

# Are there any user-facing changes?

Yes
# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

As I started on decoding thrift page headers, I found that the way I had
been going was no longer going to work. This PR begins the process of
abstracting the thrift reader to allow for other implementations.

# What changes are included in this PR?

In addition to reworking the reader itself, this PR moves away from the
previous `TryFrom` approach and instead adds a `ReadThrift` trait.

# Are these changes tested?
Should be covered by existing tests

# Are there any user-facing changes?
Yes
…ers (#8376)

# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

This continues the remodel, moving on to `PageHeader` support.

# What changes are included in this PR?

Swaps out old `format` page header structs for new ones. This also adds
a `Read` based implementation of the thrift compact protocol reader (the
sizes of the Thrift encoded page headers are not knowable in advance, so
we need a way to read them from the thrift input stream used by the page
decoder).

This PR also makes decoding of the `Statistics` in the page header
optional (defaults to `false`). We do not use them, and the decode takes
a good chunk of time.

# Are these changes tested?

These changes should be covered by existing tests

# Are there any user-facing changes?

Yes, page level stats are no longer decoded by default
# Which issue does this PR close?

**Note: this targets a feature branch, not main**

- Closes #5854.

# Rationale for this change

Continues the remodel by implementing writing of the page index
structures.

# What changes are included in this PR?

This PR removes the old `parquet::file::page_index::Index` enum and
replaces with the new `ColumnIndexMetaData` struct.

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

Yes.
etseidl and others added 9 commits September 26, 2025 08:10
# Which issue does this PR close?

**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

This PR closes the loop and and now Parquet metadata is completely
handled by the new code.

# What changes are included in this PR?

Changes the metadata builders to use the new structs rather than those
from `format`. As a consequence, the `close` methods no longer return a
`format::FileMetaData` but instead return a `ParquetMetaData`.

# Are these changes tested?

Covered by existing tests, but many tests were modified to deal with the
switch to `ParquetMetaData` mentioned above.

# Are there any user-facing changes?

Yes
…8476)

# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

This should complete the major work of the remodel. Follow-on tasks
include performance tweaks, documentation, and adding the ability to
skip unneeded structures when decoding. None of the latter should
involve breaking changes.

# What changes are included in this PR?

The major change is removing all of the code to convert to
`parquet::format` structures. The only places those are still used are
in the benchmark and `bin` code which are not strictly in the parquet
crate. Once those are cleaned up we can deprecate the `format` module.

This also adds a markdown file documenting the use of the new Thrift
macros.

# Are these changes tested?

Should be covered by existing tests.

# Are there any user-facing changes?

Yes

---------

Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
@etseidl etseidl added this to the 57.0.0 milestone Oct 1, 2025
@etseidl etseidl added parquet Changes to the parquet crate api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Oct 1, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Oct 1, 2025

Most recent benchmark (on Core i7 macbook)

group                             gh5854                                 main
-----                             ------                                 ----
decode parquet metadata           1.00     19.5±0.37µs        ? ?/sec    1.83     35.7±0.74µs        ? ?/sec
decode parquet metadata (wide)    1.00     90.7±3.11ms        ? ?/sec    2.51    227.8±2.72ms        ? ?/sec
open(default)                     1.00     21.2±0.64µs        ? ?/sec    1.72     36.3±0.66µs        ? ?/sec
open(page index)                  1.00    257.5±5.27µs        ? ?/sec    6.84  1761.1±29.66µs        ? ?/sec

@etseidl
Copy link
Contributor Author

etseidl commented Oct 1, 2025

Benchmark on Intel i7-12700K

group                             gh5854                                 main
-----                             ------                                 ----
decode parquet metadata           1.00      6.6±0.06µs        ? ?/sec    1.94     12.9±0.20µs        ? ?/sec
decode parquet metadata (wide)    1.00     33.0±0.38ms        ? ?/sec    2.23     73.7±1.54ms        ? ?/sec
open(default)                     1.00      7.1±0.09µs        ? ?/sec    1.91     13.5±0.50µs        ? ?/sec
open(page index)                  1.00    104.7±1.46µs        ? ?/sec    5.99    627.9±7.33µs        ? ?/sec

…8528)

# Which issue does this PR close?
**Note: this targets a feature branch, not main**

- Part of #5854.

# Rationale for this change

This brings over changes to handling of geo-spatial statistics
introduced by @paleolimbot in #8520.

# What changes are included in this PR?

Primarily adds documentation and tests to changes already made. The only
significant change is adding a `Default` implementation for
`EdgeInterpolationAlgorithm`.

# Are these changes tested?

Yes

# Are there any user-facing changes?

Yes

---------

Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
@alamb
Copy link
Contributor

alamb commented Oct 2, 2025

FYI I am starting to test this branch with DataFusion but it will take me a while as the arrow upgrade is quite substantial due to how DataTypes are represented

@alamb
Copy link
Contributor

alamb commented Oct 2, 2025

Most recent benchmark (on Core i7 macbook)

200_d

@alamb
Copy link
Contributor

alamb commented Oct 4, 2025

I got far enough to start testing DataFusion with this actual branch. I am not done yet but making good progress #8513 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use custom thrift decoder to improve speed of parsing parquet metadata
2 participants