Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 13, 2025

Which issue does this PR close?

Rationale for this change

Earlier work had introduced some code duplication dealing with decoding of the ColumnMetaData Thrift struct. This PR addresses that, and also addresses earlier review comments (#8587 (comment)).

What changes are included in this PR?

This PR changes how some metadata structures are parsed, utilizing a flag for required fields rather than relying on Option::is_some. This allows for passing around partially initialized ColumnChunkMetaData structs which in turn allows for sharing of the ColumnMetaData parsing code between the encrypted and unencrypted code paths.

This PR also moves the file/metadata/{encryption,thrift_gen}.rs files to a new file::metadata::thrift module.

Are these changes tested?

Covered by existing tests.

Are there any user-facing changes?

No, only makes changes to private APIs.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 13, 2025
/// Create a [`crate::file::statistics::Statistics`] from a thrift [`Statistics`] object.
pub(crate) fn convert_stats(
physical_type: Type,
fn convert_stats(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a branch where this becomes read_stats and avoids the intermediate Statistics<'a> struct, but that didn't move the needle much on performance. I'll revisit this later.

@alamb
Copy link
Contributor

alamb commented Oct 14, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing refactor_thrift_module (5f4b18c) to 891d31d diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=refactor_thrift_module
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thanks @etseidl

I also kicked off some benchmarks to see if we can see an improvement due to the more efficient row group / column decoding

list_ident.size
));
}
let mut cols = Vec::with_capacity(list_ident.size as usize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice -- this gets rid of another allocation in the parsing code

@alamb
Copy link
Contributor

alamb commented Oct 14, 2025

🤖: Benchmark completed

Details

group                             main                                   refactor_thrift_module
-----                             ----                                   ----------------------
decode parquet metadata           1.02     11.0±0.11µs        ? ?/sec    1.00     10.7±0.05µs        ? ?/sec
decode parquet metadata (wide)    1.09     61.9±9.35ms        ? ?/sec    1.00     56.6±8.08ms        ? ?/sec
open(default)                     1.01     11.0±0.27µs        ? ?/sec    1.00     10.8±0.03µs        ? ?/sec
open(page index)                  1.00    203.1±1.55µs        ? ?/sec    1.00    203.4±2.34µs        ? ?/sec

@etseidl
Copy link
Contributor Author

etseidl commented Oct 14, 2025

Thanks for the fast review @alamb. I think this is it for 57.0.0 barring things like deprecating format (#8572) and the encodings API (#8587 (comment)), adding more tests and fixing documentation (#8571).

The next speedup will be skipping decoding of column chunk stats and encoding statistics. The latter can also become a bitmap to satisfy the only use I've heard of for the stats (enabling use of the dictionary for pruning). But enabling these optimizations is going to involve passing arguments down somehow (likely on the metadata readers) and may involve further breaking changes.

@etseidl etseidl merged commit c94698c into apache:main Oct 14, 2025
16 checks passed
@etseidl etseidl deleted the refactor_thrift_module branch October 14, 2025 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants