Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Arrow C Stream interface containing stream of Array #6586

Open
kylebarron opened this issue Oct 18, 2024 · 0 comments
Open

Support Arrow C Stream interface containing stream of Array #6586

kylebarron opened this issue Oct 18, 2024 · 0 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted

Comments

@kylebarron
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It is not currently possible to use arrow-rs's FFI to exchange something like an ArrayStream or ChunkedArray when those arrays do not represent RecordBatches. ffi_stream::ArrowArrayStreamReader will error if the data type of the stream is not Struct.

This makes it impossible in the general case to interop with a pyarrow.ChunkedArray or polars.Series (via Python).

The Arrow C Stream Interface does support non-struct array types. get_next() of ArrowArrayStream returns an ArrowArray, and an ArrowArray can be any generic Arrow array. That Arrow array is often a StructArray, with the understanding that the StructArray represents a RecordBatch, but it doesn't have to be.

Here:

let result = unsafe {
from_ffi_and_data_type(array, DataType::Struct(self.schema().fields().clone()))
};
Some(result.map(|data| RecordBatch::from(StructArray::from(data))))

you assume that the data type of the stream is struct (and also assume that you can interpret the C Schema as a Schema), but that isn't required by the spec. To be more generic, you can use the data type of the C Schema directly.

Describe the solution you'd like

Some way to transfer a stream of Array via FFI.

Describe alternatives you've considered

There's currently no way to exchange a stream of generic arrays with arrow-rs, as far as I can tell.

Additional context

For full disclosure, I've already implemented this in my own library, pyo3-arrow. I have an ArrayReader trait to parallel arrow::RecordBatchReader, and vendored a derived copy of ffi_stream.rs to make it possible to handle this interop (while not necessarily materializing the entire stream as a ChunkedArray.

I'm currently fine with my vendored copy of FFI, but others may have the same issue.

Previous discussion in #5295 (comment)

@kylebarron kylebarron added the enhancement Any new improvement worthy of a entry in the changelog label Oct 18, 2024
@kylebarron kylebarron changed the title Support reading Arrow C Stream interface that does not yield RecordBatch Support Arrow C Stream interface containing stream of Array Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted
Projects
None yet
Development

No branches or pull requests

2 participants