Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 10, 2025

Which issue does this PR close?

Rationale for this change

Reduce code clutter (too many [cfg(feature = "encryption")]s).

Also reduces the number of vector allocations using a strategy proposed by @jhorstmann (#8518 (comment)).

What changes are included in this PR?

Refactors (most) of the Thrift decoding involving encryption into a new module.

Replaces the encodings vector in ColumnChunkMetaData with a bitmask.

Are these changes tested?

Covered by existing and new tests.

Are there any user-facing changes?

Yes. This changes the return type of ColumnChunkMetaData::encodings from &Vec<Encoding> to Vec<Encoding> (the vector is now created on demand rather than stored).

@etseidl etseidl added parquet Changes to the parquet crate api-change Changes to the arrow API performance next-major-release the PR has API changes and it waiting on the next major version labels Oct 10, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Oct 10, 2025

This PR shows the value of reducing vector allocations. Two major offenders during decoding are the encodings and page_encoding_stats vectors in ColumnChunkMetaData. This PR gets rid of the former. The biggest impact is in the wide schema test.

default features
group                             encodings_default                      main
-----                             -----------------                      -------
decode parquet metadata           1.00     16.2±0.28µs        ? ?/sec    1.08     17.4±0.31µs        ? ?/sec
decode parquet metadata (wide)    1.00     62.0±1.92ms        ? ?/sec    1.20     74.2±1.58ms        ? ?/sec
open(default)                     1.00     16.9±0.25µs        ? ?/sec    1.07     18.1±0.41µs        ? ?/sec
open(page index)                  1.01    255.8±6.08µs        ? ?/sec    1.00    252.9±4.98µs        ? ?/sec

all features
group                             encodings_encr                         main
-----                             --------------                         ------------
decode parquet metadata           1.00     16.9±0.29µs        ? ?/sec    1.07     18.0±0.32µs        ? ?/sec
decode parquet metadata (wide)    1.00     67.6±1.37ms        ? ?/sec    1.16     78.5±1.69ms        ? ?/sec
open(default)                     1.00     17.5±0.35µs        ? ?/sec    1.07     18.8±0.39µs        ? ?/sec
open(page index)                  1.00    254.1±4.33µs        ? ?/sec    1.00    254.6±5.31µs        ? ?/sec

@etseidl
Copy link
Contributor Author

etseidl commented Oct 10, 2025

Note to reviewers: this PR is actually two things, a refactor and a performance enhancement. I've left them as two commits to make it easier to see what changed in each part.

Comment on lines +377 to +378
#[derive(Clone, Debug, Eq, PartialEq)]
struct ColumnMetaData<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this diff is code moved here from thrift_gen.rs. From this point on is code added to support the change to encodings.

/// All encodings used for this column.
pub fn encodings(&self) -> &Vec<Encoding> {
&self.encodings
pub fn encodings(&self) -> Vec<Encoding> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the breaking change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we also mark this function as #[deprecated] to help people know they should use the more efficient encoding_mask if they are using it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend changing the signature to avoid forcing an allocation -- for example how about returning an iterator like this?

Suggested change
pub fn encodings(&self) -> Vec<Encoding> {
pub fn encodings(&self) -> impl Iterator<Item = &Encoding> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c94c263

@alamb
Copy link
Contributor

alamb commented Oct 10, 2025

This PR shows the value of reducing vector allocations. Two major offenders during decoding are the encodings and page_encoding_stats vectors in ColumnChunkMetaData. This PR gets rid of the former. The biggest impact is in the wide schema test.

I have a good feeling that this thrift decoder is going to be the gift that keeps on giving performance wise 🎁

@etseidl
Copy link
Contributor Author

etseidl commented Oct 10, 2025

I have a good feeling that this thrift decoder is going to be the gift that keeps on giving performance wise 🎁

There is more to come 😁

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etseidl -- this looks great!

I have some suggestions for the interface design of the Encodings mask, though I think we could do it as a follow on PR as well

/// All encodings used for this column.
pub fn encodings(&self) -> &Vec<Encoding> {
&self.encodings
pub fn encodings(&self) -> Vec<Encoding> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we also mark this function as #[deprecated] to help people know they should use the more efficient encoding_mask if they are using it?

Self(ColumnChunkMetaData {
column_descr,
encodings: Vec::new(),
encodings: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No allocations

giphy

}

/// All encodings used for this column, returned as a bitmask.
pub fn encodings_mask(&self) -> i32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be very hard to use as a user given the lack of documentation / export of how to interpret the values.

I left a suggestion above about how to improve it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Speed first, then usability 😆

/// All encodings used for this column.
pub fn encodings(&self) -> &Vec<Encoding> {
&self.encodings
pub fn encodings(&self) -> Vec<Encoding> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend changing the signature to avoid forcing an allocation -- for example how about returning an iterator like this?

Suggested change
pub fn encodings(&self) -> Vec<Encoding> {
pub fn encodings(&self) -> impl Iterator<Item = &Encoding> {

}
}

const MAX_ENCODING: i32 = Encoding::BYTE_STREAM_SPLIT as i32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably my OOP / old age, but I would find this API easier to deal with if it were wrapped in a type that had a documented API

Something like

struct EncodingMask(i32);

impl EncodingMask {
  fn new(val: i32) -> Self { 
   ...
  }
  fn as_i32(&self) -> i32 {
    self.0
  }
} 

// then you can implement From impls to/from i32 as well as a no allocation iterator

I am happy to help hack this out if you like the idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I started down that path before but got lazy 😅. I'll add it to the queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall I file a ticket? (I personally get an endorphin hit for each ticket that I close, so want to offer you the same incentive 😆 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First stab in 591d1ec

@alamb please take a look and see if this is on the right track

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is thrift parsing related, perhaps it would be better organized as a submodule of thrift_get?

In other words, how about renaming it into parquet/src/file/metadata/thrift_gen/encryption.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. And rename thrift_gen to just thrift?

// specific language governing permissions and limitations
// under the License.

// a collection of generated structs used to parse thrift metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically speaking they aren't generated anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Also this is just cut-and-paste, so I should update it to match what's actually in the module.

Also, we (I) need to rename thrift_gen at some point, since that's not strictly true either.

@alamb
Copy link
Contributor

alamb commented Oct 10, 2025

I also queued up a benchmark run to confirm the performance improvement (I am sure it is there!)

@etseidl
Copy link
Contributor Author

etseidl commented Oct 10, 2025

Here's a before and after look at the allocation burden. Getting rid of the page encoding stats allocation should have about the same benefit as this change.

Before
Screen Shot 2025-10-10 at 12 09 13 PM

After
Screen Shot 2025-10-10 at 12 04 13 PM

@alamb
Copy link
Contributor

alamb commented Oct 10, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing encryption_and_encodings (a886995) to b51a000 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=encryption_and_encodings
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 10, 2025

🤖: Benchmark completed

Details

group                             encryption_and_encodings               main
-----                             ------------------------               ----
decode parquet metadata           1.00     11.0±0.03µs        ? ?/sec    1.08     12.0±0.03µs        ? ?/sec
decode parquet metadata (wide)    1.00     60.6±7.84ms        ? ?/sec    1.01     61.2±0.85ms        ? ?/sec
open(default)                     1.00     10.8±0.28µs        ? ?/sec    1.07     11.6±0.04µs        ? ?/sec
open(page index)                  1.00    201.1±0.91µs        ? ?/sec    1.00    201.5±0.67µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Oct 10, 2025

Bencmark results look consistent with yours (though I didn't run with just default features)

@etseidl
Copy link
Contributor Author

etseidl commented Oct 10, 2025

The wide results are surprising (and disappointing 😢). I'll test on a better machine.

@etseidl
Copy link
Contributor Author

etseidl commented Oct 10, 2025

Benchmark on my workstation makes me feel a little better

group                             encodings_encr                         main
-----                             --------------                         -------
decode parquet metadata           1.00      5.9±0.06µs        ? ?/sec    1.07      6.3±0.05µs        ? ?/sec
decode parquet metadata (wide)    1.00     25.4±0.48ms        ? ?/sec    1.10     27.8±1.08ms        ? ?/sec
open(default)                     1.00      6.2±0.10µs        ? ?/sec    1.05      6.5±0.06µs        ? ?/sec
open(page index)                  1.00    107.0±1.30µs        ? ?/sec    1.01    108.3±1.11µs        ? ?/sec

///
/// [`ColumnMetaData`]: https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L875
#[derive(Debug, Clone, Default, PartialEq, Eq)]
pub struct EncodingMask(i32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

}

/// Sets list of encodings for this column chunk.
pub fn set_encodings(mut self, encodings: Vec<Encoding>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on PR it may be worth deprecating the old APIs

let list_ident = prot.read_list_begin()?;
for _ in 0..list_ident.size {
let val = i32::read_thrift(prot)?;
if (0..=Self::MAX_ENCODING).contains(&val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this representation some more since I proposed it. The main usecase is to detect whether any used encoding is not yet supported. I think this means we also need to handle bits outside the current MAX_ENCODING, and also iterate over all set bits in the encodings method. It's very unlikely there will ever be over 32 encodings, but maybe we should also clamp the shift value to 31 to avoid any overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jhorstmann, that's an excellent point. I think for each enum/union we should figure out what we want to do when we encounter an unknown field. For now they all (with the exception of LogicalType, EdgeInerpolationAlgorithm, and ColumnOrder) return an error.

Encoding likely should return an error since the column will not be decodable. I think I'll change the line above to Encoding::read_thrift which will detect the error so we don't have to check for it here. The test on the range can then be removed as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I was wondering if i32 was too large, but we're already up to 10 bits with the current encodings, and with @alamb starting the discussion about adding more, I think we're likely to exhaust an i16 in the not-to-distant future.

@etseidl etseidl added this to the 57.0.0 milestone Oct 13, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Oct 13, 2025

Merging this to move on to some refactoring.

@etseidl etseidl merged commit 891d31d into apache:main Oct 13, 2025
16 checks passed
@etseidl etseidl deleted the encryption_and_encodings branch October 13, 2025 21:24
etseidl added a commit that referenced this pull request Oct 14, 2025
#8599)

# Which issue does this PR close?

- Part of #5853.

# Rationale for this change

Earlier work had introduced some code duplication dealing with decoding
of the `ColumnMetaData` Thrift struct. This PR addresses that, and also
addresses earlier review comments
(#8587 (comment)).

# What changes are included in this PR?

This PR changes how some metadata structures are parsed, utilizing a
flag for required fields rather than relying on `Option::is_some`. This
allows for passing around partially initialized `ColumnChunkMetaData`
structs which in turn allows for sharing of the `ColumnMetaData` parsing
code between the encrypted and unencrypted code paths.

This PR also moves the `file/metadata/{encryption,thrift_gen}.rs` files
to a new `file::metadata::thrift` module.

# Are these changes tested?
Covered by existing tests.

# Are there any user-facing changes?
No, only makes changes to private APIs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[parquet] Improve encoding mask API (wrap bare i32 in a struct w/ docs)

3 participants