Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ffi_enforce_no_offset feature as a workaround hack to unavailable offset support in Arrow Java #5958

Closed
wants to merge 1 commit into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jun 25, 2024

Which issue does this PR close?

Closes #5959.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 25, 2024
@viirya
Copy link
Member Author

viirya commented Jun 25, 2024

cc @andygrove @alamb

viirya added a commit to viirya/arrow-datafusion-comet that referenced this pull request Jun 25, 2024
@tustvold
Copy link
Contributor

I'm not sure about this, we'd be committing to supporting a workaround for arrow Java for at least 3 months (next breaking release). It feels like we should just fix this in arrow Java, no?

@viirya
Copy link
Member Author

viirya commented Jun 25, 2024

I'm not sure about this, we'd be committing to supporting a workaround for arrow Java for at least 3 months (next breaking release). It feels like we should just fix this in arrow Java, no?

The change for Arrow Java is considered to be big one (it doesn't support offset at all in all arrays and C Data interface). We don't have the bandwidth to fix it, but can simply rely on Arrow Java community to fix it.

Instead of keeping using a forked branch of arrow-rs and DataFusion just for the issue, after discussed with @andygrove, we'd like to have this workaround in arrow-rs hopefully, and guard it with a feature with clear comment.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this, we'd be committing to supporting a workaround for arrow Java for at least 3 months (next breaking release). It feels like we should just fix this in arrow Java, no?

I believe the rationale for putting this into arrow-rs is that this issue is blocking the first release of DataFusion comet.

The only alternative I can think of is for comet to rely on a patched version of arrow-rs.

I agree that this is a hack and that the real fix is in Arrow java -- this is tracked by apache/arrow#42156 and it seems there is agreement upstream that it should be fixed in Arrow. However, even once it is fixed in Arrow java it will take some non trivial time to make it into Spark.

I think this is one of those times when the workaround is non ideal, but I think it is worth it for compatibility and help the community along.

Maybe we could file a ticket to track removing the flag once the arrow java fix has propagates sufficiently in the ecosystem

# of data buffer, we assume the first element of the offset buffer is always 0
# (this is how Arrow Java and arrow-rs do). This is a hacky solution and
# should be removed once Arrow Java supports `offset` field in the FFI output.
ffi_enforce_no_offset = ["arrow-data/ffi_enforce_no_offset"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also potentially call this flag something like workaround_java_bug or something that makes it clearer when it would be used

However, given this documentation I think this name is ok too

@@ -450,7 +459,16 @@ impl<'a> ImportedArrowArray<'a> {
let start = (unsafe { *offset_buffer.add(0) }) as usize;
// get last offset
let end = (unsafe { *offset_buffer.add(len / size_of::<i64>() - 1) }) as usize;
end - start

if cfg!(feature = "ffi_enforce_no_offset") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if cfg!(feature = "ffi_enforce_no_offset") {
// Workaround bug in Arrow Java
// https://github.com/apache/arrow/issues/42156
if cfg!(feature = "ffi_enforce_no_offset") {

@@ -152,7 +152,9 @@ impl FFI_ArrowArray {
_ => None,
};

let offset = if let Some(offset) = offset_offset {
let offset = if cfg!(feature = "ffi_enforce_no_offset") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let offset = if cfg!(feature = "ffi_enforce_no_offset") {
// Workaround bug in Arrow Java
// https://github.com/apache/arrow/issues/42156
let offset = if cfg!(feature = "ffi_enforce_no_offset") {

@tustvold
Copy link
Contributor

tustvold commented Jun 25, 2024

I'm a little confused by this, since we moved away from ArrayData we no longer store offsets in arrays other than BooleanArray
https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/list_array.rs#L489

https://github.com/apache/arrow-rs/blob/master/arrow-array%2Fsrc%2Farray%2Fbyte_array.rs#L456

The only way to end up with an offset is manually slicing ArrayData directly. If you convert back to an array using make_array it'll "materialize" the offset into the underlying buffer(s).

I wonder if comet could simply consistently use Arrays to avoid this class of issue, or call make_array at FFI boundaries?

TBC I'm not against hacky workarounds on principle but I wonder if there are less hacky ways to unblock comet

@viirya
Copy link
Member Author

viirya commented Jun 26, 2024

I wonder if comet could simply consistently use Arrays to avoid this class of issue, or call make_array at FFI boundaries?

Hm? Comet never uses ArrayData. Come only uses Arrays. The slicing is happened on arrays (e.g., string array) and of course it is materialized the offset to underlying buffer (offset buffer for string array).

Such sliced offset buffer, when exposed through C Data interface, will make the consumer to calculate a incorrect length of data buffer (fixed in #5895).

After #5895, the offset of offset buffer is represented in offset field in C Data interface. It makes the sliced array could be imported again at least by this crate.

However, it is not enough when working with (current) Arrow Java.

Arrow Java doesn't support this offset field. Moreover, it doesn't support offset in all its Arrays. So even we correctly fill offset field, Arrow Java doesn't read it and materializes on its buffer or arrays. And, when Arrow Java re-expose the array through C Data interface to arrow-rs, unexpected results happen.

This workaround is proposed for that. Before a complete support of offset in Arrow Java, this helps Comet pass sliced arrays through arrow-rs -> Arrow Java -> arrow-rs...

@tustvold
Copy link
Contributor

tustvold commented Jun 26, 2024

I've filed #5964 which I think fixes the actual issue at play here., namely a regression caused by #5741.

I suspect there is still an issue in arrow-java, in that the way it calculates the length of the data buffer is simply wrong, and was incorrectly changed to match in arrow-rs in #5741. However, I think we should fix the regression caused by #5741 and then open further discussions about any remaining compatibility issues with arrow-java and how we resolve them

@viirya
Copy link
Member Author

viirya commented Jun 26, 2024

Closing this as I made incorrect change to arrow-rs to match Arrow Java. #5964 should fix the issue.

@viirya viirya closed this Jun 26, 2024
andygrove pushed a commit to apache/datafusion-comet that referenced this pull request Jun 29, 2024
* fix: Fix range out of index error by using custom arrow-rs repo

* Add custom Java Arrow classes

* Add a hack

* Update

* Update

* Update to use apache/arrow-rs#5958

* Use tustvold's branch

* Use official arrow-rs repo
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…#584)

* fix: Fix range out of index error by using custom arrow-rs repo

* Add custom Java Arrow classes

* Add a hack

* Update

* Update

* Update to use apache/arrow-rs#5958

* Use tustvold's branch

* Use official arrow-rs repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A new feature as a workaround hack to unavailable offset support in Arrow Java
3 participants