Skip to content

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 6, 2025

Which issue does this PR close?

Note: this targets a feature branch, not main

Rationale for this change

Next step.

What changes are included in this PR?

Attempt to make use of macros originally developed by @jhorstmann to implement some thrift enums and unions in basic.rs. Also adds yet another spin on a thrift decoder based heavily on TCompactSliceInputProtocol. The current approach is to use this new decoder with TryFrom impls on the data structures in question.

This PR does not yet complete the process far enough to parse a parquet footer, but I'm putting it out here to get early feedback on the design.

Some structures already deviate enough from their thrift counterparts that the macro based parsing is not practical. Likewise, the macro approach doesn't work for structs that need lifetime annotations (necessary for reading binary fields as slices).

Are these changes tested?

Not yet.

Are there any user-facing changes?

Yes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 6, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Aug 6, 2025

@jhorstmann if you could take a look at what I've done and provide any ideas for improvements, esp to the macros, I'd appreciate it. Right now I'm having trouble with thrift unions that mix empty structs with stateful structs. I'd like the resulting rust enums to use unit variants for the empty structs just to cut down on code ugliness. I have a macro thrift_union_all_empty that does what I want for enums with all unit variants.

@alamb, the next step after this is to get a complete decode of (most of) FileMetaData working so we can start doing some benchmarking.

@etseidl etseidl added the api-change Changes to the arrow API label Aug 6, 2025
// ----------------------------------------------------------------------
// Mirrors thrift union `crate::format::LogicalType`

// private structs for decoding logical type
Copy link
Contributor Author

@etseidl etseidl Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are only used when parsing the LogicalType union, so they're kept private to this module. The LogicalType rust enum uses struct variants, so these don't need to be exposed.

Float16,
/// A Variant value.
Variant,
Variant {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally catching up to the latest parquet spec

@jhorstmann
Copy link
Contributor

I'd like the resulting rust enums to use unit variants for the empty structs just to cut down on code ugliness. I have a macro thrift_union_all_empty that does what I want for enums with all unit variants.

Maybe the field type could be made optional in the macro definition like this:

union LogicalType {
  1:  STRING // was StringType STRING
...
  5:  DecimalType DECIMAL // DecimalType contains scale and precision
}

But that looses the nice feature of being able to do a clean diff of everything inside the macro against the actual thrift definitions from the spec. I don't think a declarative macro could keep track that StringType (which could be defined anywhere) was an empty struct. The issue is similar to the one about lifetimes, ideally I'd want to generate a lifetime parameter if any of the fields need one.

@etseidl
Copy link
Contributor Author

etseidl commented Aug 7, 2025

Maybe the field type could be made optional in the macro definition like this:

I tried down that path, but my rust macro skills weren't up to implementing it 😅.

@etseidl
Copy link
Contributor Author

etseidl commented Aug 8, 2025

FWIW, here are some benchmark numbers. In a separate branch I've completed reading (mostly) directly to rust structures. This is from a modified metadata bench.
"open(default)" is opening a SerializedFileReader to get the ParquetMetaData, "parquet metadata" is just using ParquetMetaDataReader::decode_metadata to parse the footer (returns a ParquetMetaData as well), "decode file metadata" uses the thrift code to a raw decode to format::FileMetaData, "decode new" uses the new parser to go straight to ParquetMetaData. The "(wide)" variants are for a synthetic 1000 column schema.

open(default)           time:   [37.663 µs 37.840 µs 38.011 µs]
parquet metadata        time:   [36.357 µs 36.453 µs 36.564 µs]
decode file metadata    time:   [22.177 µs 22.224 µs 22.279 µs]
decode new              time:   [20.758 µs 20.797 µs 20.836 µs]
parquet metadata (wide) time:   [219.16 ms 219.86 ms 220.56 ms]
decode file metadata (wide)
                        time:   [110.76 ms 111.26 ms 111.82 ms]
decode new (wide)       time:   [78.140 ms 78.468 ms 78.802 ms]

It's encouraging to see that the time to fully decode to ParquetMetaData with the new code is faster than the current decoder going to format objects.

I'm finding some unfortunate dependencies on knowledge of the schema that lead to still having to parse to intermediate structures before creating the final results. It's true that the schema is usually the first thing encoded in the metadata, but there is no guarantee that this will be so. Thrift structures could conceivably be written out of order, even though this is pretty unlikely. I'll probably add a fast path that assumes the schema will be available when decoding the row group metadata...that will save some more unnecessary duplication of effort.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed this PR and I think it looks quite cool -- thank you @etseidl 🙏

My biggest suggestion / concern with the endeavor to write a custom thrift decoder is the ability of future coders / maintainers to update the implementation as the parquet.thrift file evolves (e.g. imagine adding a new encoding or index type, and now we want to update the rust implementation)

Maybe we (and when I say "we" I really mean "you" / @jhorstmann 😆) could write a tutoral / example of how to parse the various thrift structures using these structures / macros 🤔


impl<'a> TryFrom<&mut ThriftCompactInputProtocol<'a>> for LogicalType {
type Error = ParquetError;
fn try_from(prot: &mut ThriftCompactInputProtocol<'a>) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an example of what it takes to parse a thrift definition using this new structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for now it's patterned on how the thrift compiler works. I know it's not the most maintainable way of doing things. I'd like to hide as much complexity in the macros as possible, but where the crate structures differ too much from the format structures, there's really no way around some hand written parsing code. I'm open to suggestions as to how to make this more maintainable.

One early thought I had was to try to reproduce the parser cudf uses (example here). They use a tuple of functors, one for each field in a struct/union and then run recursively through them for each field_id that's pulled out of the thrift stream. Once you get the hang of it, adding new structs is pretty easy. I'd really like to reproduce that here, but I don't know if that way of processing is possible in rust.

In the mean time, once I've got processing of the ParquetMetaData as fast as I can, I want to swing back and macro-ize as much as possible. I like the model @jhorstmann came up with of just pasting snippets of thrift IDL to generate code. That should be much more maintainable.

@etseidl
Copy link
Contributor Author

etseidl commented Aug 8, 2025

Maybe we (and when I say "we" I really mean "you" / @jhorstmann 😆) could write a tutoral / example of how to parse the various thrift structures using these structures / macros 🤔

For sure. Once an approach is settled on I'll document the process in a markdown file.

@jhorstmann
Copy link
Contributor

I added some comments to the macros at https://github.com/jhorstmann/compact-thrift/blob/main/runtime/src/macros.rs, feel free to copy those over into this PR. The main pattern behind the macros is also described in the The Little Book of Rust Macros - Incremental TT Munchers, that whole site and the Macros By Example section in the Rust Reference was very helpful while developing. I was planning to do a presentation on those patterns in our internal rust meetup, but haven't gotten around to preparing anything for that yet.

@etseidl
Copy link
Contributor Author

etseidl commented Aug 11, 2025

Thanks @jhorstmann I'll take a look and see what I can steal. 😄

I'm about to merge this so that I can move on to the meat of the changes...getting ParquetMetaData straight from the thrift. I've done more mods to your macros to deal with my way of reading things and also figured out a way to add lifetimes where needed. I'll get that all pushed today or tomorrow.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Aug 11, 2025
@github-actions github-actions bot removed arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Aug 11, 2025
@etseidl etseidl merged commit 50f3323 into apache:gh5854_thrift_remodel Aug 11, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants