Rewrite remaining Python Arrow interop conversions using the C Data Interface #16548

vyasr · 2024-08-13T20:00:30Z

Description

This PR rewrites all remaining parts of the Python interop code previously using Arrow C++ types to instead use the C Data Interface. With this change, we no longer require pyarrow in that part of the Cython code. There are further improvements that we should make to streamline the internals, but I would like to keep this changeset minimal since getting it merged unblocks progress on multiple fronts so that we can progress further in parallel.

Contributes to #15193

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vyasr · 2024-08-13T20:01:29Z

@zeroshade nice work on to_arrow_host! Everything looks good here, just one small bug with nested list types that would have required a somewhat convoluted test case to catch.

bdice

Things seem fine to me, I don't have any significant comments. The complexity of this stuff is so high and there's a lot of "raw" objects like PyCapsule and void* everywhere... but I assume that's just the cost we have to pay for dealing with a C data interface.

cpp/src/interop/to_arrow_schema.cpp

zeroshade · 2024-08-13T21:22:35Z

@zeroshade nice work on to_arrow_host! Everything looks good here, just one small bug with nested list types that would have required a somewhat convoluted test case to catch.

Thanks! Could we add that "somewhat convoluted test case" to the unit tests so that it doesn't get missed in future changes? 😄

vyasr · 2024-08-15T17:22:46Z

@zeroshade @paleolimbot I believe that this PR has uncovered a bug in nanoarrow. I raised apache/arrow-nanoarrow#587 to discuss that, but let me know if anything seems wrong here in how we've implemented things in libcudf. The failing test can be run with python -m pytest python/cudf/cudf/pylibcudf_tests/test_datetime.py::test_extract_year.

vyasr · 2024-08-15T18:31:32Z

Based on @paleolimbot's comment here, perhaps what we should be doing is adding DATE32 and DATE64 to this switch statement? I assumed that @zeroshade had a reason for putting TIMESTAMP_DAYS where it is, was that just a typo perhaps Matt? Or maybe in that case it should be mapping to a different storage type, not in64?

paleolimbot · 2024-08-15T19:13:25Z

I think so! (Also will be fixed by apache/arrow-nanoarrow#588, which basically adds the same switch statement to nanoarrow so that you and/or future users don't have to deal with that 🙂 )

vyasr · 2024-08-15T19:40:18Z

It seems like we need to map TIMESTAMP_DAYS to NANOARROW_TYPE_INT32 as a storage type in id_to_arrow_storage_type, but we also need to keep the id_to_arrow_type mapping for TIMESTAMP_DAYS because to_arrow_schema relies on that. It seems strange to keep both around, but it doesn't seem incorrect to me. @zeroshade let me know if that strikes you as something we should streamline.

That said, adding the mapping does appear sufficient to resolve the issue.

zeroshade · 2024-08-15T19:52:37Z

Based on @paleolimbot's comment apache/arrow-nanoarrow#587 (comment), perhaps what we should be doing is adding DATE32 and DATE64 to this switch statement? I assumed that @zeroshade had a reason for putting TIMESTAMP_DAYS where it is, was that just a typo perhaps Matt? Or maybe in that case it should be mapping to a different storage type, not in64?

Not a typo, as you surmised we need to map TIMESTAMP_DAYS -> DATE32 for schema purposes, but to INT32 for the storage type of Array.

If you can think of a way to streamline it, then that would be awesome, but otherwise this sounds like the right solution until we can incorporate just calling it on the nanoarrow side via the PR that @paleolimbot linked to

vyasr · 2024-08-16T00:13:53Z

/merge

…C Data Interface (rapidsai#16548)" This reverts commit f955dd7.

vyasr added 7 commits August 13, 2024 19:54

First pass at enabling conversion using the C data interface

60e24dc

Fix metadata passing

40d4ef5

Cleanup

f1275ed

Use new interface for scalars and fix a small bug

a76ad94

More cleanup

e76b459

Change from_arrow scalar implementation to just call the column version

8c094af

Remove pyarrow Cython/build dependency altogether from interop

71cbe33

vyasr added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 13, 2024

vyasr self-assigned this Aug 13, 2024

vyasr requested review from a team as code owners August 13, 2024 20:00

vyasr requested review from isVoid, charlesbluca, PointKernel and davidwendt August 13, 2024 20:00

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Aug 13, 2024

bdice approved these changes Aug 13, 2024

View reviewed changes

PointKernel reviewed Aug 13, 2024

View reviewed changes

cpp/src/interop/to_arrow_schema.cpp Outdated Show resolved Hide resolved

vyasr mentioned this pull request Aug 14, 2024

Reenable arrow tests #16556

Merged

3 tasks

vyasr added 3 commits August 14, 2024 22:05

Switch to using the child column index for clarity

51b65d5

Disable checking field nullability by default

849c09e

Fix handling of null empty list scalar

9f7bda2

PointKernel approved these changes Aug 15, 2024

View reviewed changes

Add storage type mapping for TIMESTAMP_DAYS

c1f8fb9

Merge branch 'branch-24.10' into feat/to_arrow_python

06de6f6

rapids-bot bot merged commit f955dd7 into rapidsai:branch-24.10 Aug 16, 2024
82 checks passed

vyasr deleted the feat/to_arrow_python branch August 16, 2024 00:18

This was referenced Aug 16, 2024

Setup pylibcudf package #16299

Merged

Consider changing the column_metadata expectations when converting list types to arrow #16600

Open

[FEA] Consider default nullability setting when converting cudf data to arrow C Data interface #16621

Open

vyasr added a commit to vyasr/cudf that referenced this pull request Oct 14, 2024

Revert "Rewrite remaining Python Arrow interop conversions using the …

6dad588

…C Data Interface (rapidsai#16548)" This reverts commit f955dd7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite remaining Python Arrow interop conversions using the C Data Interface #16548

Rewrite remaining Python Arrow interop conversions using the C Data Interface #16548

vyasr commented Aug 13, 2024 •

edited

Loading

vyasr commented Aug 13, 2024

bdice left a comment

zeroshade commented Aug 13, 2024

vyasr commented Aug 15, 2024

vyasr commented Aug 15, 2024 •

edited

Loading

paleolimbot commented Aug 15, 2024

vyasr commented Aug 15, 2024

zeroshade commented Aug 15, 2024

vyasr commented Aug 16, 2024

Rewrite remaining Python Arrow interop conversions using the C Data Interface #16548

Rewrite remaining Python Arrow interop conversions using the C Data Interface #16548

Conversation

vyasr commented Aug 13, 2024 • edited Loading

Description

Checklist

vyasr commented Aug 13, 2024

bdice left a comment

Choose a reason for hiding this comment

zeroshade commented Aug 13, 2024

vyasr commented Aug 15, 2024

vyasr commented Aug 15, 2024 • edited Loading

paleolimbot commented Aug 15, 2024

vyasr commented Aug 15, 2024

zeroshade commented Aug 15, 2024

vyasr commented Aug 16, 2024

vyasr commented Aug 13, 2024 •

edited

Loading

vyasr commented Aug 15, 2024 •

edited

Loading