Description
Follow-up discussion on the Arrow PyCapsule Protocol semantics added in #37797 (and overview issue promoting it: #39195). Current docs: https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html
This topic came up on the PR itself as well. I brought it up in #37797 (review)), and then we mostly discussed this (with eventually removing __arrow_c_schema__
from the array) in the thread at #37797 (comment).
Rephrasing my question from in the PR discussion:
Should "data" objects also expose their schema through adding a
__arrow_c_schema__
? (in addition to__arrow_c_array/stream__
, on the same object)
So in the merged implementation of the protocol in pyarrow itself, we cleanly separated this: the Array/ChunkedArray/RecordBatch/Table classes have __arrow_c_data/stream__
, and the DataType/Field/Schema classes have __arrow_c_schema__
.
But not all libraries have a clear concept of a "schema", or at least not as an accessible/dedicated Python object.
For example, for two cases for which I have an open PR to add the protocol: a pandas.DataFrame does have a .dtypes
attribute, but that's not a custom object that can expose the schema protocol (it's just a plain Series with data types as the values) (pandas-dev/pandas#56587); and the interchange protocol DataFrame object only exposes column names, and you need to access a column itself to get the dtype, which then is a plain python tuple (so again not something to which the dunder could be added, and it is also not at the dataframe level) (data-apis/dataframe-api#342).
Personally I think it would be useful that one has the ability to inspect the schema of a "data" object, before asking for the actual data. For pyarrow objects you could check the .type
or .schema
attributes, and then get __arrow_c_schema__
, but that gives again something library-specific in the middle, which we want to avoid.
Summarizing the different arguments from our earlier thread about having __arrow_c_schema__
on an array/stream object:
Pro:
- Library agnostic way to get the schema of an Arrow(Array/Stream)Exportable object, before getting the actual data
- Reasons you might want to do this:
- To be able to inspect the schema without data conversions, because getting the data is not necessarily zero-copy (for libraries that are not exactly 1:1 aligned with the Arrow format)
- If you want to pass a
requested_schema
, you first need to know the schema you would get, before you can create your desired schema to pass to__arrow_c_array/stream__
Con:
- Being able to pass an array or stream where a schema is expected is a bit too loose (Quote from Antoine); e.g. it is weird that passing an Array or RecordBatch to
pa.schema(..)
would work and return a schema (although sidenote from myself: if we want, we can still disallow this, and only accept objects that only have__arrow_c_schema__
inpa.schema(..)
) - Getting the schema of a stream may involve I/O and is a fallible operation, so I think that's more reason to separate them (Quote from David)
I think it would be nice if we can have some guidance for projects about what the best practice is.
(right now I was planning to add __arrow_c_schema__
in the above mentioned PRs because those projects don't have a "schema" object, but ideally I can follow a recommendation, so that consumer libraries can base their usage on such expectation of a schema being available or not)
cc @wjones127 @pitrou @lidavidm
and also cc @kylebarron and @WillAyd as I know you both have been experimenting with the capsule protocol and might have some user experience with it