Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Support PyCapsule Interface in DataFrame & Series constructors #17693

Merged
merged 12 commits into from
Jul 25, 2024

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Jul 18, 2024

Progress towards the import side of #12530.

This adds a check in the constructor of DataFrame and Series for input objects that have an __arrow_c_array__ or __arrow_c_stream__. This means that polars can import a variety of Arrow-based objects via the Arrow PyCapsule interface.

For reference, this table shows the various pyarrow objects that implement each method, but the pyarrow objects are only as examples, and crucially this also works with any other Python Arrow implementation, like ibis.Table, pandas.DataFrame v2.2 and later, nanoarrow objects, etc.

Relevant pyarrow object dunder attribute Series DataFrame
pyarrow.Array __arrow_c_array__
pyarrow.RecordBatch __arrow_c_array__ ✅ (as struct)
pyarrow.ChunkedArray __arrow_c_stream__
pyarrow.Table __arrow_c_stream__ ✅ (as struct)
pyarrow.RecordBatchReader __arrow_c_stream__ ✅ (as struct)

Note that this short-circuits pyarrow-specific handling. If desired, this could be checked after known pyarrow objects.

The code can be cleaned up a bit (and some unwraps removed/fixed) but it's working, so I figure it's worth putting this up for feedback on the overall approach.

import polars as pl
import pyarrow as pa

table = pa.table({"a": [1, 2, 3, 4], "b": ["a", "b", "c", "d"]})


# Add an indirection class to ensure that this example is indeed using the pycapsule
# interface, and not the direct `pyarrow.Table` conversion
class PyCapsuleStreamHolder:
    capsule: object

    def __init__(self, capsule: object) -> None:
        self.capsule = capsule

    def __arrow_c_stream__(self, requested_schema: object = None) -> object:
        return self.capsule


s = pl.Series(PyCapsuleStreamHolder(table.__arrow_c_stream__(None)))
s
# shape: (4,)
# Series: '' [struct[2]]
# [
# 	{1,"a"}
# 	{2,"b"}
# 	{3,"c"}
# 	{4,"d"}
# ]

pdf = pl.DataFrame(PyCapsuleStreamHolder(table.__arrow_c_stream__(None)))
pdf
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# │ 4   ┆ d   │
# └─────┴─────┘


class PyCapsuleArrayHolder:
    capsules: object

    def __init__(self, capsules: object) -> None:
        self.capsules = capsules

    def __arrow_c_array__(self, requested_schema: object = None) -> object:
        return self.capsules


record_batch = table.to_batches()[0]
record_batch
# pyarrow.RecordBatch
# a: int64
# b: string
# ----
# a: [1,2,3,4]
# b: ["a","b","c","d"]

s = pl.Series(PyCapsuleArrayHolder(record_batch.__arrow_c_array__(None)))
s
# shape: (4,)
# Series: '' [struct[2]]
# [
# 	{1,"a"}
# 	{2,"b"}
# 	{3,"c"}
# 	{4,"d"}
# ]

pdf = pl.DataFrame(PyCapsuleArrayHolder(record_batch.__arrow_c_array__(None)))
pdf
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# │ 4   ┆ d   │
# └─────┴─────┘

@eitsupi
Copy link
Contributor

eitsupi commented Jul 18, 2024

Wouldn't it be simpler to import it as a Series and then convert it to a DetaFrame with to_frame().unnest() instead of implementing the C Stream interface directly in the DataFrame? (Like this pola-rs/r-polars#1078)

@kylebarron kylebarron changed the title Support PyCapsule Interface in DataFrame constructor feat(python) Support PyCapsule Interface in DataFrame & Series constructors Jul 18, 2024
@kylebarron
Copy link
Contributor Author

Wouldn't it be simpler to import it as a Series and then convert it to a DetaFrame with to_frame().unnest() instead of implementing the C Stream interface directly in the DataFrame?

Thanks for the advice! That is indeed easier, because we only have to touch the capsules from the Series impl.

@ritchie46
Copy link
Member

Thanks @kylebarron. Can you fix the tests?

@kylebarron kylebarron changed the title feat(python) Support PyCapsule Interface in DataFrame & Series constructors feat(python): Support PyCapsule Interface in DataFrame & Series constructors Jul 20, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars and removed title needs formatting labels Jul 20, 2024
Copy link

codecov bot commented Jul 21, 2024

Codecov Report

Attention: Patch coverage is 83.01887% with 18 lines in your changes missing coverage. Please review.

Project coverage is 80.49%. Comparing base (1df3b0b) to head (1618d1d).
Report is 30 commits behind head on main.

Files Patch % Lines
py-polars/src/series/import.rs 82.02% 16 Missing ⚠️
py-polars/polars/_typing.py 50.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17693      +/-   ##
==========================================
+ Coverage   80.40%   80.49%   +0.09%     
==========================================
  Files        1502     1504       +2     
  Lines      197041   197139      +98     
  Branches     2794     2810      +16     
==========================================
+ Hits       158439   158696     +257     
+ Misses      38088    37921     -167     
- Partials      514      522       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kylebarron
Copy link
Contributor Author

I had to implement a workaround for empty streams because impl TryFrom<(&Field, Vec<Box<dyn Array>>)> for Series fails for an empty Vec<> of chunks.

It would be ideal if that impl was fixed; I tried to but there are quite a few assumptions there that the vec is non-empty. For now the import code just calls to Series::new_empty instead.

@kylebarron
Copy link
Contributor Author

Note that this short-circuits pyarrow-specific handling. If desired, this could be checked after known pyarrow objects.

The checks for pycapsule objects were also moved after pyarrow and pandas-specific checks. Since those are already in place, it means that pyarrow and pandas objects will always be imported in the same way (with existing pyarrow and pandas-specific APIs) regardless of those versions. While any other libraries' objects will go through the new pycapsule API

) -> PyResult<(arrow::datatypes::Field, Box<dyn Array>)> {
validate_pycapsule_name(schema_capsule, "arrow_schema")?;
validate_pycapsule_name(array_capsule, "arrow_array")?;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a // SAFETY comment explaining which invariants must hold here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look and see if those safety comments are ok

py-polars/src/series/import.rs Show resolved Hide resolved
py-polars/tests/unit/constructors/test_constructors.py Outdated Show resolved Hide resolved
@ritchie46 ritchie46 merged commit 1d5ef5c into pola-rs:main Jul 25, 2024
16 of 17 checks passed
@kylebarron kylebarron deleted the kyle/c-stream-import branch July 25, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants