-
Notifications
You must be signed in to change notification settings - Fork 1k
[thrift-remodel] Redo thrift enums and unions #8072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[thrift-remodel] Redo thrift enums and unions #8072
Conversation
@jhorstmann if you could take a look at what I've done and provide any ideas for improvements, esp to the macros, I'd appreciate it. Right now I'm having trouble with thrift unions that mix empty structs with stateful structs. I'd like the resulting rust enums to use unit variants for the empty structs just to cut down on code ugliness. I have a macro @alamb, the next step after this is to get a complete decode of (most of) |
// ---------------------------------------------------------------------- | ||
// Mirrors thrift union `crate::format::LogicalType` | ||
|
||
// private structs for decoding logical type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are only used when parsing the LogicalType
union, so they're kept private to this module. The LogicalType
rust enum uses struct variants, so these don't need to be exposed.
Float16, | ||
/// A Variant value. | ||
Variant, | ||
Variant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally catching up to the latest parquet spec
Maybe the field type could be made optional in the macro definition like this:
But that looses the nice feature of being able to do a clean diff of everything inside the macro against the actual thrift definitions from the spec. I don't think a declarative macro could keep track that |
I tried down that path, but my rust macro skills weren't up to implementing it 😅. |
FWIW, here are some benchmark numbers. In a separate branch I've completed reading (mostly) directly to rust structures. This is from a modified metadata bench.
It's encouraging to see that the time to fully decode to I'm finding some unfortunate dependencies on knowledge of the schema that lead to still having to parse to intermediate structures before creating the final results. It's true that the schema is usually the first thing encoded in the metadata, but there is no guarantee that this will be so. Thrift structures could conceivably be written out of order, even though this is pretty unlikely. I'll probably add a fast path that assumes the schema will be available when decoding the row group metadata...that will save some more unnecessary duplication of effort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed this PR and I think it looks quite cool -- thank you @etseidl 🙏
My biggest suggestion / concern with the endeavor to write a custom thrift decoder is the ability of future coders / maintainers to update the implementation as the parquet.thrift file evolves (e.g. imagine adding a new encoding or index type, and now we want to update the rust implementation)
Maybe we (and when I say "we" I really mean "you" / @jhorstmann 😆) could write a tutoral / example of how to parse the various thrift structures using these structures / macros 🤔
|
||
impl<'a> TryFrom<&mut ThriftCompactInputProtocol<'a>> for LogicalType { | ||
type Error = ParquetError; | ||
fn try_from(prot: &mut ThriftCompactInputProtocol<'a>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an example of what it takes to parse a thrift definition using this new structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now it's patterned on how the thrift compiler works. I know it's not the most maintainable way of doing things. I'd like to hide as much complexity in the macros as possible, but where the crate structures differ too much from the format structures, there's really no way around some hand written parsing code. I'm open to suggestions as to how to make this more maintainable.
One early thought I had was to try to reproduce the parser cudf uses (example here). They use a tuple of functors, one for each field in a struct/union and then run recursively through them for each field_id that's pulled out of the thrift stream. Once you get the hang of it, adding new structs is pretty easy. I'd really like to reproduce that here, but I don't know if that way of processing is possible in rust.
In the mean time, once I've got processing of the ParquetMetaData
as fast as I can, I want to swing back and macro-ize as much as possible. I like the model @jhorstmann came up with of just pasting snippets of thrift IDL to generate code. That should be much more maintainable.
For sure. Once an approach is settled on I'll document the process in a markdown file. |
I added some comments to the macros at https://github.com/jhorstmann/compact-thrift/blob/main/runtime/src/macros.rs, feel free to copy those over into this PR. The main pattern behind the macros is also described in the The Little Book of Rust Macros - Incremental TT Munchers, that whole site and the Macros By Example section in the Rust Reference was very helpful while developing. I was planning to do a presentation on those patterns in our internal rust meetup, but haven't gotten around to preparing anything for that yet. |
Thanks @jhorstmann I'll take a look and see what I can steal. 😄 I'm about to merge this so that I can move on to the meat of the changes...getting |
Which issue does this PR close?
Note: this targets a feature branch, not main
Rationale for this change
Next step.
What changes are included in this PR?
Attempt to make use of macros originally developed by @jhorstmann to implement some thrift enums and unions in
basic.rs
. Also adds yet another spin on a thrift decoder based heavily onTCompactSliceInputProtocol
. The current approach is to use this new decoder withTryFrom
impls on the data structures in question.This PR does not yet complete the process far enough to parse a parquet footer, but I'm putting it out here to get early feedback on the design.
Some structures already deviate enough from their thrift counterparts that the macro based parsing is not practical. Likewise, the macro approach doesn't work for structs that need lifetime annotations (necessary for reading binary fields as slices).
Are these changes tested?
Not yet.
Are there any user-facing changes?
Yes.