Support `DurationType` in cudf parquet reader via `arrow:schema` #15617

mhaseeb123 · 2024-04-30T03:19:40Z

Description

This PR adds the support for reading and using the arrow:schema struct from the serialized arrow:ipc message written at the key-value metadata section of the Parquet file with ARROW:schema key. This allows cudf to read and interop with arrow for non-standard parquet types (DurationType in this PR).

Arrow uses Google flatbuffers (inside Schema.fbs) to serialize the arrow:Schema structure (containing column descriptors) and puts it (padded for 8 byte alignment) into the header of an empty ipc:Message (also a flatbuffer-serialized structure inside Message.fbs). The ipc:Message is prepended with two integers containing a validity message and the size of the header (the arrow:Schema + padding). The final message is endoded as a base64 string and written to Parquet file footer key-value metadata using "ARROW:schema" key.

In this PR, we base64-decode the ipc:Message, then we decode the validity message and the header size, and offset pointers to the arrow:Schema flatbuffer. We then use Flatbuffer structs to walk the arrow:Schema and collect information on columns of interest as an unordered_map (using column name as key). This unordered_map is used inside select_columns function to build cudf Table columns and get the correct dtype.

Closes #13410

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

mhaseeb123 · 2024-04-30T03:19:56Z

CC: @GregoryKimball @vuule

etseidl

Some early comments. That's a lot of code for one data type 😄. Do we want to include all the google code, or add a library dependency? Or follow the thrift/protobuf model and roll our own parser?

Have you an idea of how long it takes to parse the schema? I'm wondering if it's better to make it optional (like add a use_arrow_schema reader option).

cpp/src/io/parquet/reader_impl.cpp

cpp/src/io/parquet/reader_impl_helpers.cpp

cpp/src/io/parquet/reader_impl_helpers.hpp

etseidl

Github ate some of my comments ☹️ A note about the column naming.

cpp/src/io/parquet/reader_impl_helpers.cpp

mhaseeb123 · 2024-04-30T18:48:20Z

Some early comments. That's a lot of code for one data type 😄. Do we want to include all the google code, or add a library dependency? Or follow the thrift/protobuf model and roll our own parser?

Have you an idea of how long it takes to parse the schema? I'm wondering if it's better to make it optional (like add a use_arrow_schema reader option).

Thank you for looking at this @etseidl. I am also thinking of making it optional as well in the updates I am working on. I am also thinking of removing all the pushed flatbuffer code and adding it as dependency instead.

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

cpp/src/io/parquet/reader_impl_helpers.cpp

cpp/src/io/parquet/reader_impl_helpers.hpp

mhaseeb123 · 2024-05-14T02:07:47Z

Thank you for the helpful input @etseidl and @galipremsagar. I am going to remove the use_arrow_schema from the public Python APIs as I also agree keeping our args consistent with other readers. I will keep a default=True use_arrow_schema option in the Cython code in case we need to expose it at a later point.

galipremsagar · 2024-05-14T02:08:41Z

Thank you for the helpful input @etseidl and @galipremsagar. I am going to remove the use_arrow_schema from the public Python APIs as I also agree keeping our args consistent with other readers. I will keep a default=True use_arrow_schema option in the Cython code in case we need to expose it at a later point.

Sounds good to me 👍 Thanks for helping me understand @etseidl !

mhaseeb123 · 2024-05-14T02:10:09Z

@mroeschke

Thank you for the helpful input @etseidl and @galipremsagar. I am going to remove the use_arrow_schema from the public Python APIs as I also agree keeping our args consistent with other readers. I will keep a default=True use_arrow_schema option in the Cython code in case we need to expose it at a later point.

@mroeschke Apologies for the back and forth but since we have decided to remove the option from Python side altogether (above discussion) , I am going to remove the 2nd part of the tests (which was using the assert_neq) altogether.

python/cudf/cudf/tests/test_parquet.py

galipremsagar

We can just do this.

python/cudf/cudf/_lib/parquet.pyx

Co-authored-by: GALI PREM SAGAR <sagarprem75@gmail.com>

galipremsagar · 2024-05-14T02:24:34Z

/okay to test

…a-support-pq-reader

mhaseeb123 · 2024-05-14T18:56:08Z

/ok to test

.pre-commit-config.yaml

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

galipremsagar · 2024-05-14T19:45:51Z

/okay to test

mhaseeb123 · 2024-05-15T06:59:45Z

/ok to test

mhaseeb123 · 2024-05-15T16:17:40Z

/merge

Read duration type in cudf parquet via arrow:schema

053f7da

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 30, 2024

mhaseeb123 added cuIO cuIO issue 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function breaking Breaking change Reliability labels Apr 30, 2024

mhaseeb123 added 2 commits April 30, 2024 03:44

reverting an inadvertently removed code line.

aa4e9bb

clang-format changes

6c67c28

etseidl reviewed Apr 30, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved

mhaseeb123 and others added 5 commits April 30, 2024 11:50

Update cpp/src/io/parquet/reader_impl_helpers.cpp

0e6fc4a

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

Co-walk arrow and parquet schema

a6eca13

fixing copyrights

ced5dd9

fix the hardcoded if conditions for duration type

b192352

add boolean check for arrow type columns

18d5e6c

etseidl reviewed May 1, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved

mhaseeb123 changed the title ~~Support DurationType in cudf parquet via arrow:schema~~ Support DurationType in cudf parquet reader via arrow:schema May 1, 2024

add basic testing for duration type

8f55983

github-actions bot added the Python Affects Python cuDF API. label May 1, 2024

mhaseeb123 added 4 commits May 1, 2024 19:33

revert clangd induced formatting

6883c7e

more reverting clangd

ab5cacd

remove raw for loops, verify equal fields at each schema level

649148c

Remove flatbuffer files. Add flatbuffers via CMake

416dbbd

github-actions bot added the CMake CMake build issue label May 2, 2024

mhaseeb123 commented May 2, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits May 2, 2024 23:41

Make arrow schema use in PQ reader optional. Add tests.

c5a7b0e

minor updates for better readability

6f18766

Remove use_arrow_schema from public Python APIs.

a80f562

mhaseeb123 requested a review from galipremsagar May 14, 2024 02:13

galipremsagar reviewed May 14, 2024

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

galipremsagar requested changes May 14, 2024

View reviewed changes

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved

mhaseeb123 and others added 2 commits May 13, 2024 19:19

Remove use_arrow_schema from Cython API args as well

4e368d8

Co-authored-by: GALI PREM SAGAR <sagarprem75@gmail.com>

Throw some Nulls in python tests

93ec789

galipremsagar approved these changes May 14, 2024

View reviewed changes

mhaseeb123 requested a review from mroeschke May 14, 2024 02:24

Merge branch 'branch-24.06' into arrow-schema-support-pq-reader

09eadcf

mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 14, 2024

Merge remote-tracking branch 'upstream/branch-24.06' into arrow-schem…

1d94cc8

…a-support-pq-reader

mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels May 14, 2024

bdice approved these changes May 14, 2024

View reviewed changes

.pre-commit-config.yaml Outdated Show resolved Hide resolved

Update .pre-commit-config.yaml

50d0b77

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels May 14, 2024

Merge branch 'branch-24.06' into arrow-schema-support-pq-reader

56b2edc

rapids-bot bot merged commit c5c95b7 into rapidsai:branch-24.06 May 15, 2024
75 checks passed

mhaseeb123 deleted the arrow-schema-support-pq-reader branch May 15, 2024 17:04

mhaseeb123 mentioned this pull request May 23, 2024

[FEA] Support arrow:Schema in Parquet writer for faithful roundtrip with Arrow via Parquet #15847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `DurationType` in cudf parquet reader via `arrow:schema` #15617

Support `DurationType` in cudf parquet reader via `arrow:schema` #15617

mhaseeb123 commented Apr 30, 2024 •

edited

Loading

mhaseeb123 commented Apr 30, 2024

etseidl left a comment

etseidl left a comment

mhaseeb123 commented Apr 30, 2024 •

edited

Loading

mhaseeb123 commented May 14, 2024

galipremsagar commented May 14, 2024

mhaseeb123 commented May 14, 2024

galipremsagar left a comment

galipremsagar commented May 14, 2024

mhaseeb123 commented May 14, 2024

galipremsagar commented May 14, 2024

mhaseeb123 commented May 15, 2024

mhaseeb123 commented May 15, 2024

Support DurationType in cudf parquet reader via arrow:schema #15617

Support DurationType in cudf parquet reader via arrow:schema #15617

Conversation

mhaseeb123 commented Apr 30, 2024 • edited Loading

Description

Checklist

mhaseeb123 commented Apr 30, 2024

etseidl left a comment

Choose a reason for hiding this comment

etseidl left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Apr 30, 2024 • edited Loading

mhaseeb123 commented May 14, 2024

galipremsagar commented May 14, 2024

mhaseeb123 commented May 14, 2024

galipremsagar left a comment

Choose a reason for hiding this comment

galipremsagar commented May 14, 2024

mhaseeb123 commented May 14, 2024

galipremsagar commented May 14, 2024

mhaseeb123 commented May 15, 2024

mhaseeb123 commented May 15, 2024

Support `DurationType` in cudf parquet reader via `arrow:schema` #15617

Support `DurationType` in cudf parquet reader via `arrow:schema` #15617

mhaseeb123 commented Apr 30, 2024 •

edited

Loading

mhaseeb123 commented Apr 30, 2024 •

edited

Loading