Skip to content

[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010

Closed
@jorisvandenbossche

Description

#37797 is adding official dunder methods to expose the Arrow C Data/Stream Interface in Python using PyCapsules (#34031 / #35531).

In addition to official dunders to expose this to other libraries, we also need public APIs in pyarrow to import / consume such PyCapsules (or rather the objects implementing the dunders to give you the PyCapsule).
#37797 already added this to the pa.array(..), pa.record_batch(..) and pa.schema(..) constructors, such that you can for example create a pyarrow array with pa.array(obj) given any object obj that supports the interface by defining __arrow_c_array__.

But that's not fully complete: we certainly need a way to construct a RecordBatchReader as well, where we don't have such a factory function available. For this, we could add a from_ function (similar to the existing from_batches) like RecordBatchReader.from_stream?

(in addition there is also the Table, Field and DataType constructors, both those all have factory functions that could support this, similar to pa.array(..) et al)


Secondly, I am also wondering if we want to provide APIs that accept PyCapsules directly, instead of an object that implements the dunders. For example, if you are a library that has data in Arrow compatible memory, and you want to convert this to pyarrow through the C Data Interface, you might want to use a PyCapsule directly if your library doesn't expose a Python class that represents that data (to avoid that you need to create a small wrapper class just with the dunder to pass to the pyarrow constructor, although this is of course not difficult).

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions