Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

Open
kylebarron opened this issue Aug 14, 2024 · 14 comments · Fixed by #59587
Open

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

kylebarron opened this issue Aug 14, 2024 · 14 comments · Fixed by #59587
Labels
Arrow pyarrow functionality Enhancement

Comments

@kylebarron
Copy link
Contributor

kylebarron commented Aug 14, 2024

Problem Description

Similar to existing DataFrame export in #56587, I would like Pandas to export Series via the Arrow PyCapsule Interface.

Feature Description

Implementation would be quite similar to #56587. I'm not sure what the ideal API is to convert a pandas Series to a pyarrow Array/ChunkedArray.

Alternative Solutions

Require users to pass a series into a DataFrame before passing to an external consumer?

Additional Context

Pyarrow has implemented the PyCapsule Interface on ChunkedArrays for the last couple versions. Ref apache/arrow#40818

If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement __arrow_c_array__ instead?

cc @jorisvandenbossche as he implemented #56587

@kylebarron kylebarron added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 14, 2024
@mroeschke mroeschke added Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 14, 2024
@jorisvandenbossche
Copy link
Member

Yes, this is on my to do list for the coming weeks! :)

If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement __arrow_c_array__ instead?

For our numpy-based dtypes, this is indeed the case. But the problem is that for the pyarrow-based columns, we actually store chunked arrays.

So when implementing one protocol, it should be __arrow_c_stream__ I think. I am not entirely sure how helpful it would be to also implement __arrow_c_array__ and let that either only work if there is only one chunk (and error otherwise) or let that concatenate implicitly when there are multiple chunks.

@kylebarron
Copy link
Contributor Author

So when implementing one protocol, it should be __arrow_c_stream__ I think.

Yes that makes sense and I'd agree. I'd suggest always exporting a stream.

@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2024

Why wouldn't we support both methods? The fact that pyarrow arrays use chunked storage behind the scenes is an implementation detail, but for everything else (and from a front-end perspective) array makes more sense than array_stream

@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2024

N.B. I'm also not sure the decision that went into us using chunked array storage for pyarrow-backed types

@kylebarron
Copy link
Contributor Author

kylebarron commented Aug 15, 2024

I think the question is: what should data consumers be able to infer based on the presence of one or more of these methods. See apache/arrow#40648

  • If an object implements __arrow_c_array__, then the object's internal storage should already be contiguous. You shouldn't implement __arrow_c_array__ if calling that would (even sometimes) require a concatenation to happen.
  • If an object's internal storage may be one or more arrays, then it should implement __arrow_c_stream__, where the stream will sometimes include only a single array, but where it can be zero-copy as much as possible (in the pandas case: always except from non-arrow storage)

@jorisvandenbossche
Copy link
Member

There is still some discussion about using array vs stream interface, so reopening this issue.

@WillAyd 's comment from #59587 (comment):

To talk more practically, I am wondering about a scenario where we have a series that holds a chunked array. In this PR, we convert that to a singular array before then exposing it back as a stream, but the other point of view is that we could allow the stream to iterate over each chunk.

But then the question becomes what happens when that same Series gets used in a Dataframe? Does the dataframe iterate its chunks? In most cases, it seems highly unlikely that this is possible (unless all other Series of the Dataframe share the same chunked array size), so you get a rather interesting scenario where iterating by dataframe could be potentially far more expensive than the Series iteration.

The more I think through it I am leaning towards -1 on supporting the stream interface for the Series; I think we should just expose as an array for now

For the specific example of a Series with multiple chunks, and then put in a DataFrame: if you access that through the DataFrame's stream interface, you will also get multiple chunks.

In pyarrow, a Table is also not required to have equally chunked columns, but when converting to a stream of batches, it will kind of merge the chunking of the different columns (I think by default in such a way that everything can be zero-copy slices from the original data, i.e. so the smallest batches needed to do that for all columns).

Code example to illustrate:

In [21]: pd.options.future.infer_string = True

In [22]: ser = pd.concat([pd.Series(["a"]), pd.Series(["b"])], ignore_index=True)

In [23]: ser.array._pa_array
Out[23]: 
<pyarrow.lib.ChunkedArray object at 0x7f096d29b650>
[
  [
    "a"
  ],
  [
    "b"
  ]
]

In [24]: df = pd.DataFrame({"A": [1, 2], "B": ser})

In [25]: df
Out[25]: 
   A  B
0  1  a
1  2  b

In [26]: table = pa.Table.from_pandas(df)

In [27]: table
Out[27]: 
pyarrow.Table
A: int64
B: large_string
----
A: [[1,2]]
B: [["a"],["b"]]

In [28]: table["B"]
Out[28]: 
<pyarrow.lib.ChunkedArray object at 0x7f096cdd7650>
[
  [
    "a"
  ],
  [
    "b"
  ]
]

In [29]: list(table.to_reader())
Out[29]: 
[pyarrow.RecordBatch
 A: int64
 B: large_string
 ----
 A: [1]
 B: ["a"],
 pyarrow.RecordBatch
 A: int64
 B: large_string
 ----
 A: [2]
 B: ["b"]]

So in practice, while we are discussing the usage of the Stream interface for Series, I think effectively it would be for both DataFrame and Series (or would be there a reason to handle those differently? maybe for tabular data people generally expect more to work with streams, as in contrast to single columns/arrays?)

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

@jorisvandenbossche that's super cool - I did not realize Arrow created RecordBatches in that fashion. And that's zero-copy for the integral array in your example?

In that case then maybe we should export both the stream and array interfaces. I guess there's some ambiguity as a consumer if that is a zero-copy exchange or not, but maybe that doesn't matter (?)

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

Another consideration point is how we want to act as consumers of Arrow data, not just as a producer. If we push developers towards preferring stream data in this interface, the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays.

I'm not sure that is a bad thing, but its different than where we are today

@kylebarron
Copy link
Contributor Author

My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented. Otherwise, implementing the C Array interface as well could allow consumers to unknowingly cause pandas to concatenate a series into an array.

IMO, the C Array interface should only be implemented if a pandas Series always stores a contiguous array under the hood. It's easy for consumers of a C stream to see if the stream only holds a single array (assuming the consumer materializes the full stream)

@jorisvandenbossche
Copy link
Member

And that's zero-copy for the integral array in your example?

AFAIK yes

the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays.

That's already the case for dtypes backed by pyarrow, I think. If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking.

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented.

Thanks, this is good input. I'm bought in!

If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking.

Cool! I don't think I voiced it well, but my major concern is around our usage of NumPy by default. For the I/O operation you described, I think the dtype_mapper will in many cases still end up copying that chunked data into a single array to fit into the NumPy view of the world.

Not trying to solve that here, but I think with I/O methods there is some control over how you want that conversion to happen. With this interface, I don't think we get that control? So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed?

@jorisvandenbossche
Copy link
Member

So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed?

We don't yet have the consumption part in pandas (I am planning to look into that in the near future, but let that not stop anyone to already do that), but if we have a method to generally convert Arrow data to a pandas DataFrame, that in theory could have a similar keyword as other IO methods to determine which dtypes to use.

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

Makes sense. So for where we are on main now I think we just need to make it so that the implementation doesn't always copy (?)

@jorisvandenbossche
Copy link
Member

Yes, and fix requested_schema handling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants