Skip to content

[Python] Arrow PyCapsule Protocol: standard way to get the schema of a "data" (array of stream) object? #39689

Open
@jorisvandenbossche

Description

@jorisvandenbossche

Follow-up discussion on the Arrow PyCapsule Protocol semantics added in #37797 (and overview issue promoting it: #39195). Current docs: https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html

This topic came up on the PR itself as well. I brought it up in #37797 (review)), and then we mostly discussed this (with eventually removing __arrow_c_schema__ from the array) in the thread at #37797 (comment).
Rephrasing my question from in the PR discussion:

Should "data" objects also expose their schema through adding a __arrow_c_schema__? (in addition to __arrow_c_array/stream__, on the same object)

So in the merged implementation of the protocol in pyarrow itself, we cleanly separated this: the Array/ChunkedArray/RecordBatch/Table classes have __arrow_c_data/stream__, and the DataType/Field/Schema classes have __arrow_c_schema__.

But not all libraries have a clear concept of a "schema", or at least not as an accessible/dedicated Python object.

For example, for two cases for which I have an open PR to add the protocol: a pandas.DataFrame does have a .dtypes attribute, but that's not a custom object that can expose the schema protocol (it's just a plain Series with data types as the values) (pandas-dev/pandas#56587); and the interchange protocol DataFrame object only exposes column names, and you need to access a column itself to get the dtype, which then is a plain python tuple (so again not something to which the dunder could be added, and it is also not at the dataframe level) (data-apis/dataframe-api#342).

Personally I think it would be useful that one has the ability to inspect the schema of a "data" object, before asking for the actual data. For pyarrow objects you could check the .type or .schema attributes, and then get __arrow_c_schema__, but that gives again something library-specific in the middle, which we want to avoid.

Summarizing the different arguments from our earlier thread about having __arrow_c_schema__ on an array/stream object:

Pro:

  • Library agnostic way to get the schema of an Arrow(Array/Stream)Exportable object, before getting the actual data
  • Reasons you might want to do this:
    • To be able to inspect the schema without data conversions, because getting the data is not necessarily zero-copy (for libraries that are not exactly 1:1 aligned with the Arrow format)
    • If you want to pass a requested_schema, you first need to know the schema you would get, before you can create your desired schema to pass to __arrow_c_array/stream__

Con:

  • Being able to pass an array or stream where a schema is expected is a bit too loose (Quote from Antoine); e.g. it is weird that passing an Array or RecordBatch to pa.schema(..) would work and return a schema (although sidenote from myself: if we want, we can still disallow this, and only accept objects that only have __arrow_c_schema__ in pa.schema(..))
  • Getting the schema of a stream may involve I/O and is a fallible operation, so I think that's more reason to separate them (Quote from David)

I think it would be nice if we can have some guidance for projects about what the best practice is.
(right now I was planning to add __arrow_c_schema__ in the above mentioned PRs because those projects don't have a "schema" object, but ideally I can follow a recommendation, so that consumer libraries can base their usage on such expectation of a schema being available or not)

cc @wjones127 @pitrou @lidavidm

and also cc @kylebarron and @WillAyd as I know you both have been experimenting with the capsule protocol and might have some user experience with it

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions