Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite remaining Python Arrow interop conversions using the C Data Interface #16548

Merged
merged 12 commits into from
Aug 16, 2024

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Aug 13, 2024

Description

This PR rewrites all remaining parts of the Python interop code previously using Arrow C++ types to instead use the C Data Interface. With this change, we no longer require pyarrow in that part of the Cython code. There are further improvements that we should make to streamline the internals, but I would like to keep this changeset minimal since getting it merged unblocks progress on multiple fronts so that we can progress further in parallel.

Contributes to #15193

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 13, 2024
@vyasr vyasr self-assigned this Aug 13, 2024
@vyasr vyasr requested review from a team as code owners August 13, 2024 20:00
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Aug 13, 2024
@vyasr
Copy link
Contributor Author

vyasr commented Aug 13, 2024

@zeroshade nice work on to_arrow_host! Everything looks good here, just one small bug with nested list types that would have required a somewhat convoluted test case to catch.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things seem fine to me, I don't have any significant comments. The complexity of this stuff is so high and there's a lot of "raw" objects like PyCapsule and void* everywhere... but I assume that's just the cost we have to pay for dealing with a C data interface.

@zeroshade
Copy link
Contributor

@zeroshade nice work on to_arrow_host! Everything looks good here, just one small bug with nested list types that would have required a somewhat convoluted test case to catch.

Thanks! Could we add that "somewhat convoluted test case" to the unit tests so that it doesn't get missed in future changes? 😄

@vyasr vyasr mentioned this pull request Aug 14, 2024
3 tasks
@vyasr
Copy link
Contributor Author

vyasr commented Aug 15, 2024

@zeroshade @paleolimbot I believe that this PR has uncovered a bug in nanoarrow. I raised apache/arrow-nanoarrow#587 to discuss that, but let me know if anything seems wrong here in how we've implemented things in libcudf. The failing test can be run with python -m pytest python/cudf/cudf/pylibcudf_tests/test_datetime.py::test_extract_year.

@vyasr
Copy link
Contributor Author

vyasr commented Aug 15, 2024

Based on @paleolimbot's comment here, perhaps what we should be doing is adding DATE32 and DATE64 to this switch statement? I assumed that @zeroshade had a reason for putting TIMESTAMP_DAYS where it is, was that just a typo perhaps Matt? Or maybe in that case it should be mapping to a different storage type, not in64?

@paleolimbot
Copy link

I think so! (Also will be fixed by apache/arrow-nanoarrow#588, which basically adds the same switch statement to nanoarrow so that you and/or future users don't have to deal with that 🙂 )

@vyasr
Copy link
Contributor Author

vyasr commented Aug 15, 2024

It seems like we need to map TIMESTAMP_DAYS to NANOARROW_TYPE_INT32 as a storage type in id_to_arrow_storage_type, but we also need to keep the id_to_arrow_type mapping for TIMESTAMP_DAYS because to_arrow_schema relies on that. It seems strange to keep both around, but it doesn't seem incorrect to me. @zeroshade let me know if that strikes you as something we should streamline.

That said, adding the mapping does appear sufficient to resolve the issue.

@zeroshade
Copy link
Contributor

Based on @paleolimbot's comment apache/arrow-nanoarrow#587 (comment), perhaps what we should be doing is adding DATE32 and DATE64 to this switch statement? I assumed that @zeroshade had a reason for putting TIMESTAMP_DAYS where it is, was that just a typo perhaps Matt? Or maybe in that case it should be mapping to a different storage type, not in64?

Not a typo, as you surmised we need to map TIMESTAMP_DAYS -> DATE32 for schema purposes, but to INT32 for the storage type of Array.

If you can think of a way to streamline it, then that would be awesome, but otherwise this sounds like the right solution until we can incorporate just calling it on the nanoarrow side via the PR that @paleolimbot linked to

@vyasr
Copy link
Contributor Author

vyasr commented Aug 16, 2024

/merge

@rapids-bot rapids-bot bot merged commit f955dd7 into rapidsai:branch-24.10 Aug 16, 2024
82 checks passed
@vyasr vyasr deleted the feat/to_arrow_python branch August 16, 2024 00:18
vyasr added a commit to vyasr/cudf that referenced this pull request Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants