-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support DurationType
in cudf parquet reader via arrow:schema
#15617
Support DurationType
in cudf parquet reader via arrow:schema
#15617
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early comments. That's a lot of code for one data type 😄. Do we want to include all the google code, or add a library dependency? Or follow the thrift/protobuf model and roll our own parser?
Have you an idea of how long it takes to parse the schema? I'm wondering if it's better to make it optional (like add a use_arrow_schema
reader option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github ate some of my comments
Thank you for looking at this @etseidl. I am also thinking of making it optional as well in the updates I am working on. I am also thinking of removing all the pushed flatbuffer code and adding it as dependency instead. |
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
DurationType
in cudf parquet via arrow:schema
DurationType
in cudf parquet reader via arrow:schema
Thank you for the helpful input @etseidl and @galipremsagar. I am going to remove the |
Sounds good to me 👍 Thanks for helping me understand @etseidl ! |
@mroeschke Apologies for the back and forth but since we have decided to remove the option from Python side altogether (above discussion) , I am going to remove the 2nd part of the tests (which was using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just do this.
Co-authored-by: GALI PREM SAGAR <sagarprem75@gmail.com>
/okay to test |
…a-support-pq-reader
/ok to test |
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
/okay to test |
/ok to test |
/merge |
Description
This PR adds the support for reading and using the
arrow:schema
struct from the serializedarrow:ipc
message written at the key-value metadata section of the Parquet file withARROW:schema
key. This allows cudf to read and interop with arrow for non-standard parquet types (DurationType
in this PR).Arrow uses Google flatbuffers (inside Schema.fbs) to serialize the
arrow:Schema
structure (containing column descriptors) and puts it (padded for 8 byte alignment) into the header of an emptyipc:Message
(also a flatbuffer-serialized structure inside Message.fbs). Theipc:Message
is prepended with two integers containing avalidity
message and thesize of the header
(thearrow:Schema
+ padding). The final message is endoded as a base64 string and written to Parquet file footer key-value metadata using"ARROW:schema"
key.In this PR, we base64-decode the
ipc:Message
, then we decode thevalidity
message and the header size, and offset pointers to thearrow:Schema
flatbuffer. We then use Flatbuffer structs to walk thearrow:Schema
and collect information on columns of interest as an unordered_map (using column name as key). This unordered_map is used insideselect_columns
function to build cudf Table columns and get the correctdtype
.Closes #13410
Checklist