[thrift-remodel] Refactor Thrift encryption and store encodings as bitmask #8587

etseidl · 2025-10-10T18:26:02Z

Which issue does this PR close?

Part of [EPIC] A collection of items to improve speed of parquet metadata encoding #5853.
Part of Reduce allocations in ParquetMetaData for improved performance #8518.
Closes [parquet] Improve encoding mask API (wrap bare i32 in a struct w/ docs) #8588.

Rationale for this change

Reduce code clutter (too many [cfg(feature = "encryption")]s).

Also reduces the number of vector allocations using a strategy proposed by @jhorstmann (#8518 (comment)).

What changes are included in this PR?

Refactors (most) of the Thrift decoding involving encryption into a new module.

Replaces the encodings vector in ColumnChunkMetaData with a bitmask.

Are these changes tested?

Covered by existing and new tests.

Are there any user-facing changes?

Yes. This changes the return type of ColumnChunkMetaData::encodings from &Vec<Encoding> to Vec<Encoding> (the vector is now created on demand rather than stored).

etseidl · 2025-10-10T18:30:55Z

This PR shows the value of reducing vector allocations. Two major offenders during decoding are the encodings and page_encoding_stats vectors in ColumnChunkMetaData. This PR gets rid of the former. The biggest impact is in the wide schema test.

default features
group                             encodings_default                      main
-----                             -----------------                      -------
decode parquet metadata           1.00     16.2±0.28µs        ? ?/sec    1.08     17.4±0.31µs        ? ?/sec
decode parquet metadata (wide)    1.00     62.0±1.92ms        ? ?/sec    1.20     74.2±1.58ms        ? ?/sec
open(default)                     1.00     16.9±0.25µs        ? ?/sec    1.07     18.1±0.41µs        ? ?/sec
open(page index)                  1.01    255.8±6.08µs        ? ?/sec    1.00    252.9±4.98µs        ? ?/sec

all features
group                             encodings_encr                         main
-----                             --------------                         ------------
decode parquet metadata           1.00     16.9±0.29µs        ? ?/sec    1.07     18.0±0.32µs        ? ?/sec
decode parquet metadata (wide)    1.00     67.6±1.37ms        ? ?/sec    1.16     78.5±1.69ms        ? ?/sec
open(default)                     1.00     17.5±0.35µs        ? ?/sec    1.07     18.8±0.39µs        ? ?/sec
open(page index)                  1.00    254.1±4.33µs        ? ?/sec    1.00    254.6±5.31µs        ? ?/sec

etseidl · 2025-10-10T18:32:35Z

Note to reviewers: this PR is actually two things, a refactor and a performance enhancement. I've left them as two commits to make it easier to see what changed in each part.

etseidl · 2025-10-10T18:36:24Z

parquet/src/file/metadata/encryption.rs

+#[derive(Clone, Debug, Eq, PartialEq)]
+struct ColumnMetaData<'a> {


Most of this diff is code moved here from thrift_gen.rs. From this point on is code added to support the change to encodings.

etseidl · 2025-10-10T18:37:04Z

parquet/src/file/metadata/mod.rs

    /// All encodings used for this column.
-    pub fn encodings(&self) -> &Vec<Encoding> {
-        &self.encodings
+    pub fn encodings(&self) -> Vec<Encoding> {


This is the breaking change

How about we also mark this function as #[deprecated] to help people know they should use the more efficient encoding_mask if they are using it?

I also recommend changing the signature to avoid forcing an allocation -- for example how about returning an iterator like this?

Suggested change

pub fn encodings(&self) -> Vec<Encoding> {

pub fn encodings(&self) -> impl Iterator<Item = &Encoding> {

Done in c94c263

alamb · 2025-10-10T18:37:47Z

This PR shows the value of reducing vector allocations. Two major offenders during decoding are the encodings and page_encoding_stats vectors in ColumnChunkMetaData. This PR gets rid of the former. The biggest impact is in the wide schema test.

I have a good feeling that this thrift decoder is going to be the gift that keeps on giving performance wise 🎁

etseidl · 2025-10-10T18:39:55Z

I have a good feeling that this thrift decoder is going to be the gift that keeps on giving performance wise 🎁

There is more to come 😁

alamb

Thanks @etseidl -- this looks great!

I have some suggestions for the interface design of the Encodings mask, though I think we could do it as a follow on PR as well

alamb · 2025-10-10T18:39:39Z

parquet/src/file/metadata/mod.rs

    /// All encodings used for this column.
-    pub fn encodings(&self) -> &Vec<Encoding> {
-        &self.encodings
+    pub fn encodings(&self) -> Vec<Encoding> {


How about we also mark this function as #[deprecated] to help people know they should use the more efficient encoding_mask if they are using it?

alamb · 2025-10-10T18:40:30Z

parquet/src/file/metadata/mod.rs

        Self(ColumnChunkMetaData {
            column_descr,
-            encodings: Vec::new(),
+            encodings: 0,


No allocations

parquet/src/basic.rs

alamb · 2025-10-10T18:42:16Z

parquet/src/file/metadata/mod.rs

+    }
+
+    /// All encodings used for this column, returned as a bitmask.
+    pub fn encodings_mask(&self) -> i32 {


I think this will be very hard to use as a user given the lack of documentation / export of how to interpret the values.

I left a suggestion above about how to improve it

Thanks. Speed first, then usability 😆

alamb · 2025-10-10T18:44:20Z

parquet/src/file/metadata/mod.rs

    /// All encodings used for this column.
-    pub fn encodings(&self) -> &Vec<Encoding> {
-        &self.encodings
+    pub fn encodings(&self) -> Vec<Encoding> {


I also recommend changing the signature to avoid forcing an allocation -- for example how about returning an iterator like this?

Suggested change

pub fn encodings(&self) -> Vec<Encoding> {

pub fn encodings(&self) -> impl Iterator<Item = &Encoding> {

alamb · 2025-10-10T18:45:46Z

parquet/src/basic.rs

    }
 }

+const MAX_ENCODING: i32 = Encoding::BYTE_STREAM_SPLIT as i32;


It is probably my OOP / old age, but I would find this API easier to deal with if it were wrapped in a type that had a documented API

Something like

struct EncodingMask(i32); impl EncodingMask { fn new(val: i32) -> Self { ... } fn as_i32(&self) -> i32 { self.0 } } // then you can implement From impls to/from i32 as well as a no allocation iterator

I am happy to help hack this out if you like the idea

Sounds good. I started down that path before but got lazy 😅. I'll add it to the queue.

shall I file a ticket? (I personally get an endorphin hit for each ticket that I close, so want to offer you the same incentive 😆 )

filed [parquet] Improve encoding mask API (wrap bare i32 in a struct w/ docs) #8588

First stab in 591d1ec

@alamb please take a look and see if this is on the right track

alamb · 2025-10-10T18:48:25Z

parquet/src/file/metadata/encryption.rs

Since this is thrift parsing related, perhaps it would be better organized as a submodule of thrift_get?

In other words, how about renaming it into parquet/src/file/metadata/thrift_gen/encryption.rs

Sounds good. And rename thrift_gen to just thrift?

alamb · 2025-10-10T18:48:50Z

parquet/src/file/metadata/encryption.rs

+// specific language governing permissions and limitations
+// under the License.
+
+// a collection of generated structs used to parse thrift metadata


technically speaking they aren't generated anymore

True. Also this is just cut-and-paste, so I should update it to match what's actually in the module.

Also, we (I) need to rename thrift_gen at some point, since that's not strictly true either.

alamb · 2025-10-10T18:49:34Z

I also queued up a benchmark run to confirm the performance improvement (I am sure it is there!)

etseidl · 2025-10-10T19:13:51Z

Here's a before and after look at the allocation burden. Getting rid of the page encoding stats allocation should have about the same benefit as this change.

Before

After

alamb · 2025-10-10T19:28:18Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing encryption_and_encodings (a886995) to b51a000 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=encryption_and_encodings
Results will be posted here when complete

alamb · 2025-10-10T19:31:17Z

🤖: Benchmark completed

Details

group                             encryption_and_encodings               main
-----                             ------------------------               ----
decode parquet metadata           1.00     11.0±0.03µs        ? ?/sec    1.08     12.0±0.03µs        ? ?/sec
decode parquet metadata (wide)    1.00     60.6±7.84ms        ? ?/sec    1.01     61.2±0.85ms        ? ?/sec
open(default)                     1.00     10.8±0.28µs        ? ?/sec    1.07     11.6±0.04µs        ? ?/sec
open(page index)                  1.00    201.1±0.91µs        ? ?/sec    1.00    201.5±0.67µs        ? ?/sec

suggested in review.

alamb · 2025-10-10T19:34:21Z

Bencmark results look consistent with yours (though I didn't run with just default features)

etseidl · 2025-10-10T19:35:27Z

The wide results are surprising (and disappointing 😢). I'll test on a better machine.

etseidl · 2025-10-10T22:08:51Z

Benchmark on my workstation makes me feel a little better

group                             encodings_encr                         main
-----                             --------------                         -------
decode parquet metadata           1.00      5.9±0.06µs        ? ?/sec    1.07      6.3±0.05µs        ? ?/sec
decode parquet metadata (wide)    1.00     25.4±0.48ms        ? ?/sec    1.10     27.8±1.08ms        ? ?/sec
open(default)                     1.00      6.2±0.10µs        ? ?/sec    1.05      6.5±0.06µs        ? ?/sec
open(page index)                  1.00    107.0±1.30µs        ? ?/sec    1.01    108.3±1.11µs        ? ?/sec

alamb · 2025-10-11T11:16:21Z

parquet/src/basic.rs

+///
+/// [`ColumnMetaData`]: https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L875
+#[derive(Debug, Clone, Default, PartialEq, Eq)]
+pub struct EncodingMask(i32);


alamb · 2025-10-11T11:17:26Z

parquet/src/file/metadata/mod.rs

    }

    /// Sets list of encodings for this column chunk.
    pub fn set_encodings(mut self, encodings: Vec<Encoding>) -> Self {


As a follow on PR it may be worth deprecating the old APIs

jhorstmann · 2025-10-11T12:57:06Z

parquet/src/basic.rs

+        let list_ident = prot.read_list_begin()?;
+        for _ in 0..list_ident.size {
+            let val = i32::read_thrift(prot)?;
+            if (0..=Self::MAX_ENCODING).contains(&val) {


I've been thinking about this representation some more since I proposed it. The main usecase is to detect whether any used encoding is not yet supported. I think this means we also need to handle bits outside the current MAX_ENCODING, and also iterate over all set bits in the encodings method. It's very unlikely there will ever be over 32 encodings, but maybe we should also clamp the shift value to 31 to avoid any overflow.

Thanks @jhorstmann, that's an excellent point. I think for each enum/union we should figure out what we want to do when we encounter an unknown field. For now they all (with the exception of LogicalType, EdgeInerpolationAlgorithm, and ColumnOrder) return an error.

Encoding likely should return an error since the column will not be decodable. I think I'll change the line above to Encoding::read_thrift which will detect the error so we don't have to check for it here. The test on the range can then be removed as well.

Oh, and I was wondering if i32 was too large, but we're already up to 10 bits with the current encodings, and with @alamb starting the discussion about adding more, I think we're likely to exhaust an i16 in the not-to-distant future.

etseidl · 2025-10-13T19:33:53Z

Merging this to move on to some refactoring.

#8599) # Which issue does this PR close? - Part of #5853. # Rationale for this change Earlier work had introduced some code duplication dealing with decoding of the `ColumnMetaData` Thrift struct. This PR addresses that, and also addresses earlier review comments (#8587 (comment)). # What changes are included in this PR? This PR changes how some metadata structures are parsed, utilizing a flag for required fields rather than relying on `Option::is_some`. This allows for passing around partially initialized `ColumnChunkMetaData` structs which in turn allows for sharing of the `ColumnMetaData` parsing code between the encrypted and unencrypted code paths. This PR also moves the `file/metadata/{encryption,thrift_gen}.rs` files to a new `file::metadata::thrift` module. # Are these changes tested? Covered by existing tests. # Are there any user-facing changes? No, only makes changes to private APIs.

etseidl added 3 commits October 10, 2025 10:35

move thrift encryption to its own module

09fc5ea

store encodings as bitmask

b430682

add encodings test

a886995

etseidl added parquet Changes to the parquet crate api-change Changes to the arrow API performance next-major-release the PR has API changes and it waiting on the next major version labels Oct 10, 2025

etseidl commented Oct 10, 2025

View reviewed changes

alamb approved these changes Oct 10, 2025

View reviewed changes

alamb mentioned this pull request Oct 10, 2025

[parquet] Improve encoding mask API (wrap bare i32 in a struct w/ docs) #8588

Closed

change ColumnChunkMetaData::encodings to return an iterator as

c94c263

suggested in review.

start on EncodingMask

591d1ec

etseidl added 2 commits October 10, 2025 19:45

add some documentation

6a84472

make a few changes to EncodingMask api

c5dd00d

alamb approved these changes Oct 11, 2025

View reviewed changes

jhorstmann reviewed Oct 11, 2025

View reviewed changes

etseidl added 2 commits October 11, 2025 12:51

make EncodingMask a bit more robust

8568892

clippy

78744ab

etseidl added 2 commits October 11, 2025 15:35

disallow encoding==1

34ad674

oops, forgot to turn 1 into mask position

7d6e13b

etseidl added this to the 57.0.0 milestone Oct 13, 2025

etseidl merged commit 891d31d into apache:main Oct 13, 2025
16 checks passed

etseidl mentioned this pull request Oct 13, 2025

[thrift-remodel] Refactor Parquet Thrift code into new thrift module #8599

Merged

etseidl deleted the encryption_and_encodings branch October 13, 2025 21:24

alamb mentioned this pull request Oct 23, 2025

Upgrade DataFusion to arrow/parquet 57.0.0 apache/datafusion#17888

Merged

		#[derive(Clone, Debug, Eq, PartialEq)]
		struct ColumnMetaData<'a> {

	pub fn encodings(&self) -> Vec<Encoding> {
	pub fn encodings(&self) -> impl Iterator<Item = &Encoding> {

[thrift-remodel] Refactor Thrift encryption and store encodings as bitmask #8587

[thrift-remodel] Refactor Thrift encryption and store encodings as bitmask #8587

Uh oh!

Conversation

etseidl commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Oct 10, 2025

Uh oh!

etseidl commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 10, 2025

Uh oh!

etseidl commented Oct 10, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 10, 2025

Uh oh!

etseidl commented Oct 10, 2025

Uh oh!

alamb commented Oct 10, 2025

Uh oh!

alamb commented Oct 10, 2025

Uh oh!

alamb commented Oct 10, 2025

Uh oh!

etseidl commented Oct 10, 2025

Uh oh!

etseidl commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl commented Oct 10, 2025 •

edited

Loading