Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyCapsule support for Arrow import and export #825

Merged
merged 19 commits into from
Aug 30, 2024

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Aug 20, 2024

Which issue does this PR close?

Closes #752

Rationale for this change

User requested.

What changes are included in this PR?

With this change you can import any arrow table that implements the PyCapsule Interface using the SessionContext.from_arrow_table function. Additionally, PyCapsule export of DataFrame is added. Now any python based project that uses python arrow with the pycapsule interface can directly consume a datafusion dataframe.

The DataFrame will be executed at the point of export.

You can see a minimal example in the issue ticket.

Are there any user-facing changes?

This PR adds SessionContext.from_arrow which served the same purpose as from_arrow_table except that it now takes any object that implements the required PyCapsule functions. from_arrow_table is now an alias to from_arrow.

Still to do:

  • Add support for the requested schema
  • Add user examples
  • Add unit tests

python/datafusion/dataframe.py Show resolved Hide resolved
src/context.rs Outdated Show resolved Hide resolved
src/context.rs Outdated Show resolved Hide resolved
src/context.rs Outdated Show resolved Hide resolved
src/dataframe.rs Outdated Show resolved Hide resolved
@kylebarron
Copy link
Contributor

Add support for the requested schema

You can see my casting implementation here. But note that's slightly different as I'm using an ArrayReader not a RecordBatchReader.

@timsaucer
Copy link
Contributor Author

Ok, added the requested schema and unit tests for export. All that I think is left is unit tests on the import. I'm not sure if I should keep the nanoarrow in the unit test. If so I need to add it to the requirements files.

src/context.rs Show resolved Hide resolved
src/context.rs Outdated Show resolved Hide resolved
src/context.rs Show resolved Hide resolved
src/dataframe.rs Show resolved Hide resolved
src/dataframe.rs Outdated Show resolved Hide resolved
@timsaucer timsaucer marked this pull request as ready for review August 25, 2024 12:26
Comment on lines 62 to 64
important to note that this will cause the DataFrame execution to happen, which may be
a time consuming task. That is, you will cause a :py:func:`datafusion.dataframe.DataFrame.collect`
operation call to occur.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest putting this into an "admonition" box with a warning color to make this clearer. I'm not sure how to do that in sphinx, but this is what I'm referring to in mkdocs-material: https://squidfunk.github.io/mkdocs-material/reference/admonitions/#supported-types

Comment on lines +593 to +594
``__arrow_c_stream__`` or ``__arrow_c_array__``. For the latter, it must return
a struct array. Common examples of sources from pyarrow include
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both they must emit a struct array. Any Arrow array can be passed through an __arrow_c_stream__. Canonically, to transfer a DataFrame you have a stream of struct arrays where each one is unpacked to be the columns of a RecordBatch. But it doesn't have to a struct array: you can also transfer a Series through an __arrow_c_stream__, where each batch in the stream iterator is just a primitive array.

python/datafusion/context.py Show resolved Hide resolved
src/dataframe.rs Outdated Show resolved Hide resolved
src/dataframe.rs Outdated Show resolved Hide resolved
src/dataframe.rs Outdated Show resolved Hide resolved
merged_schema.project(&project_indices)
}

fn record_batch_into_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that arrow-rs nor datafusion have such a utility for converting a record-batch, but I did take a quick look around and didn't find anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there is cast. Cast works on struct arrays, so you could make a simple wrapper around cast to work on RecordBatch by creating a struct array from the record batch. This is what I do in pyo3-arrow.

The main difference is that cast doesn't also project. It's not clear to me whether the PyCapsule Interface intends to support projection or not. I don't think anyone has asked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the user isn't calling the pycapsule interface directly, it's also not clear how the user API would look to ask for a projection via pycapsules.

@timsaucer
Copy link
Contributor Author

I applied your suggestions. Do either of you want a re-review before we ask to merge?

Copy link
Contributor

@Michael-J-Ward Michael-J-Ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent.

python/datafusion/tests/test_context.py Outdated Show resolved Hide resolved
@andygrove andygrove merged commit 69ed7fe into apache:main Aug 30, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Arrow PyCapsule Interface
4 participants