[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010
Description
#37797 is adding official dunder methods to expose the Arrow C Data/Stream Interface in Python using PyCapsules (#34031 / #35531).
In addition to official dunders to expose this to other libraries, we also need public APIs in pyarrow to import / consume such PyCapsules (or rather the objects implementing the dunders to give you the PyCapsule).
#37797 already added this to the pa.array(..)
, pa.record_batch(..)
and pa.schema(..)
constructors, such that you can for example create a pyarrow array with pa.array(obj)
given any object obj
that supports the interface by defining __arrow_c_array__
.
But that's not fully complete: we certainly need a way to construct a RecordBatchReader
as well, where we don't have such a factory function available. For this, we could add a from_
function (similar to the existing from_batches
) like RecordBatchReader.from_stream
?
- [Python] RecordBatchReader constructor from stream object implementing the PyCapsule Protocol #39217
(in addition there is also the Table, Field and DataType constructors, both those all have factory functions that could support this, similar to pa.array(..)
et al)
Secondly, I am also wondering if we want to provide APIs that accept PyCapsules directly, instead of an object that implements the dunders. For example, if you are a library that has data in Arrow compatible memory, and you want to convert this to pyarrow through the C Data Interface, you might want to use a PyCapsule directly if your library doesn't expose a Python class that represents that data (to avoid that you need to create a small wrapper class just with the dunder to pass to the pyarrow constructor, although this is of course not difficult).