[Python] Add Python protocol for the Arrow C (Data/Stream) Interface

**Context**: we want that Arrow can be used as the format to share data between (Python) libraries/applications, ideally in a generic way that doesn't need to hardcode for specific libraries. 
We already have `__arrow_array__` for objects that know how to convert itself to a `pyarrow.Array` or ChunkedArray. But this protocol is for actual *py*arrow objects (so a better name might have been `__pyarrow_array__` ..), thus tied to the pyarrow library (and also only for arrays, not for tables/batches). For projects that have an (optional) dependency on pyarrow, that is fine, but we want to avoid that this is required (e.g. nanoarrow). However, we also have the Arrow C Data Interface as a more generic way to share Arrow data in-memory focusing on the actual Arrow spec without relying on a specific library implementation. 

Right now, the way to use the C Interface are the `_export_to_c` and `_import_from_c` methods. 
But those methods are 1) private, advanced APIs (although we can of course decide to make them "official", since many projects are already using them, and document them that way), and 2) again specific to pyarrow (I don't think other projects have adopted the same names). 
So other projects (polars, datafusion, duckdb, etc) _use_ those to convert from pyarrow to their own representation. But those projects don't have a similar API to use the C Data Interface to share their data with another (eg to pyarrow, or polars to duckdb, ...). 
If we would have a standard Python protocol (dunder) method for this, libraries could implement support for consuming (and producing) objects that expose their data through the Arrow C Interface without having to hard code for specific implementations (such as those libraries currently do for pyarrow).

The most generic protocol would be one supporting the Stream interface, and that could look something like this:

```python
class MyArrowCompatibleObject:

    def __arrow_c_stream__(self) -> PyCapsule:
        """
        Returning a PyCapsule wrapping an ArrowArrayStream struct
        """
        ...
```

And in addition we _could_ have variants that do the same for the other structs, such `__arrow_c_data__` or `__arrow_c_array__`, `__arrow_c_schema__`, ..


Some design questions:

* For the mechanics of the method, I would propose to use PyCapsules instead of raw pointers as described here: https://github.com/apache/arrow/issues/34031
* Which set of protocol methods do we need? Is only a stream version sufficient (since a single array can always be put in a stream of one array)? Or would it be useful (and simpler for some applications) to also have an Array version? 
	* But what would an array version return exactly? (since it needs to return both the ArrowArray as the ArrowSchema)
* With the ongoing discussion about generalizing the C Interface to other devices (https://github.com/apache/arrow/pull/34972), should we focus here on the current interfaces, or should we directly use the Device versions?
* Do we want to distinguish between an array and a tabular version? From the C Interface point of view, that's all the same, it's just a ArrowArray. But for example, we currently define `_export_to_c` on a RecordBatch and RecordBatchReader, where you _know_ this will always return a StructArray representation of one batch, vs the same method on Array where it can return an array of any type. It could be nice to distinguish those use cases for consumers.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] Add Python protocol for the Arrow C (Data/Stream) Interface #35531

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Add Python protocol for the Arrow C (Data/Stream) Interface #35531

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions