Skip to content

[Python] Add Python protocol for the Arrow C (Data/Stream) Interface #35531

Closed
@jorisvandenbossche

Description

Context: we want that Arrow can be used as the format to share data between (Python) libraries/applications, ideally in a generic way that doesn't need to hardcode for specific libraries.
We already have __arrow_array__ for objects that know how to convert itself to a pyarrow.Array or ChunkedArray. But this protocol is for actual pyarrow objects (so a better name might have been __pyarrow_array__ ..), thus tied to the pyarrow library (and also only for arrays, not for tables/batches). For projects that have an (optional) dependency on pyarrow, that is fine, but we want to avoid that this is required (e.g. nanoarrow). However, we also have the Arrow C Data Interface as a more generic way to share Arrow data in-memory focusing on the actual Arrow spec without relying on a specific library implementation.

Right now, the way to use the C Interface are the _export_to_c and _import_from_c methods.
But those methods are 1) private, advanced APIs (although we can of course decide to make them "official", since many projects are already using them, and document them that way), and 2) again specific to pyarrow (I don't think other projects have adopted the same names).
So other projects (polars, datafusion, duckdb, etc) use those to convert from pyarrow to their own representation. But those projects don't have a similar API to use the C Data Interface to share their data with another (eg to pyarrow, or polars to duckdb, ...).
If we would have a standard Python protocol (dunder) method for this, libraries could implement support for consuming (and producing) objects that expose their data through the Arrow C Interface without having to hard code for specific implementations (such as those libraries currently do for pyarrow).

The most generic protocol would be one supporting the Stream interface, and that could look something like this:

class MyArrowCompatibleObject:

    def __arrow_c_stream__(self) -> PyCapsule:
        """
        Returning a PyCapsule wrapping an ArrowArrayStream struct
        """
        ...

And in addition we could have variants that do the same for the other structs, such __arrow_c_data__ or __arrow_c_array__, __arrow_c_schema__, ..

Some design questions:

  • For the mechanics of the method, I would propose to use PyCapsules instead of raw pointers as described here: [Python] Use PyCapsule for communicating C Data Interface pointers at the Python level #34031
  • Which set of protocol methods do we need? Is only a stream version sufficient (since a single array can always be put in a stream of one array)? Or would it be useful (and simpler for some applications) to also have an Array version?
    • But what would an array version return exactly? (since it needs to return both the ArrowArray as the ArrowSchema)
  • With the ongoing discussion about generalizing the C Interface to other devices (GH-34971: [Format] Add non-CPU version of C Data Interface #34972), should we focus here on the current interfaces, or should we directly use the Device versions?
  • Do we want to distinguish between an array and a tabular version? From the C Interface point of view, that's all the same, it's just a ArrowArray. But for example, we currently define _export_to_c on a RecordBatch and RecordBatchReader, where you know this will always return a StructArray representation of one batch, vs the same method on Array where it can return an array of any type. It could be nice to distinguish those use cases for consumers.

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions