Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix range out of index error with a temporary workaround #584

Merged
merged 8 commits into from
Jun 29, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Jun 18, 2024

Which issue does this PR close?

Closes #540.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@viirya viirya marked this pull request as draft June 18, 2024 05:45
@viirya
Copy link
Member Author

viirya commented Jun 18, 2024

I will add a custom Java importer implementation to use the fixed offset field from arrow-rs.

@viirya
Copy link
Member Author

viirya commented Jun 19, 2024

Looks like I need to copy more Arrow Java classes to make it work...

@viirya
Copy link
Member Author

viirya commented Jun 20, 2024

Looks like I need to copy more Arrow Java classes to make it work...

Instead of a real fix which requires adding offset support to Arrow Java, I added a hacky workaround in my arrow-rs repo, and added custom Java importer implementation.

I re-ran the query and verified this workaround fixes the range out of index error.

@viirya viirya marked this pull request as ready for review June 20, 2024 07:20
Comment on lines +240 to +245
// HACK: For the issue https://github.com/apache/datafusion-comet/issues/540
// As Arrow Java doesn't support `offset` in C Data interface, we cannot correctly import
// a slice of string from arrow-rs to Java Arrow and then export it to arrow-rs again.
// So we add this hack to always take full length of data buffer by assuming the first offset
// is always 0 which is true for Arrow Java and arrow-rs.
final int len = end;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the workaround added in Arrow Java side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar workaround is also added in the custom arrow-rs branch.

Copy link
Member Author

@viirya viirya Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround assumes that the first element in offset buffer is always 0. This is true for arrow-rs and Arrow Java although Arrow spec just recommends it but not enforces it. As we only use arrow-rs and Arrow Java, this workaround should work for us without correctness issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes in the custom arrow-rs branch different to the fix that went into arrow-rs master branch recently?

Copy link
Member Author

@viirya viirya Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix into arrow-rs repo is real fix that correctly reflecting offset of offset buffer into offset field in C Data interface. But only the fix is not enough, we still need the offset support in Arrow Java.

So the workaround here is different but just hacky solution. Instead filing correct offset in C Data interface, we calculate correct data buffer length in a hacky way by assuming the first offset value in offset buffer is always 0 (this is held in both arrow-rs and Arrow Java). So we get correct data buffer length even offset buffer is a slice (i.e., offset-ed in arrow-rs).

@viirya viirya changed the title fix: Fix range out of index error by using custom arrow-rs repo fix: Fix range out of index error with a temporary workaround Jun 20, 2024
@viirya
Copy link
Member Author

viirya commented Jun 20, 2024

cc @andygrove This uses a temporary workaround to fix the issue. I'm not sure how long it will take to have the Arrow Java fix, so I think a workaround but not a real fix might be good for us to move forward.

@andygrove
Copy link
Member

Looks like I need to copy more Arrow Java classes to make it work...

Instead of a real fix which requires adding offset support to Arrow Java, I added a hacky workaround in my arrow-rs repo, and added custom Java importer implementation.

I re-ran the query and verified this workaround fixes the range out of index error.

I also ran the benchmarks and can confirm that this fixes the issue.

@viirya
Copy link
Member Author

viirya commented Jun 25, 2024

I proposed the workaround fix to arrow-rs at apache/arrow-rs#5958. If it could be merged in the upstream, we can just use official arrow-rs and DataFusion repos/releases.

@viirya
Copy link
Member Author

viirya commented Jun 25, 2024

@andygrove Could you verify again if the latest change still resolves the issue? I changed it to use the commit proposed for arrow-rs apache/arrow-rs#5958 behind a feature flag.

Thanks.

@codecov-commenter
Copy link

codecov-commenter commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 0% with 192 lines in your changes missing coverage. Please review.

Project coverage is 34.31%. Comparing base (2fc45a2) to head (7c6f7f4).
Report is 19 commits behind head on main.

Files Patch % Lines
...g/apache/arrow/c/CometBufferImportTypeVisitor.java 0.00% 132 Missing ⚠️
...in/java/org/apache/arrow/c/CometArrayImporter.java 0.00% 58 Missing ⚠️
...rc/main/java/org/apache/arrow/c/ArrowImporter.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #584      +/-   ##
============================================
+ Coverage     34.13%   34.31%   +0.17%     
  Complexity      809      809              
============================================
  Files           106      108       +2     
  Lines         38586    38742     +156     
  Branches       8566     8568       +2     
============================================
+ Hits          13172    13294     +122     
- Misses        22674    22708      +34     
  Partials       2740     2740              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@viirya
Copy link
Member Author

viirya commented Jun 26, 2024

I changed the arrow-rs repo to use the change at apache/arrow-rs#5964. See if everything in CI can pass.

@andygrove
Copy link
Member

@viirya I confirmed that this fixes the issue when I run the benchmarks locally.

@viirya
Copy link
Member Author

viirya commented Jun 27, 2024

@viirya I confirmed that this fixes the issue when I run the benchmarks locally.

Thank you, @andygrove. Then we can merge this and wait for new arrow-rs and DataFusion releases.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Hopefully we only need to use the forks for 2-3 weeks.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can depend on tustvold's repo because the branch could get deleted. Could you create a branch in your repo instead?

core/Cargo.toml Outdated
arrow-string = { version = "52.0.0" }
parquet = { version = "52.0.0", default-features = false, features = ["experimental"] }
arrow = { git = "https://github.com/tustvold/arrow-rs.git", rev = "78b6139", features = ["prettyprint", "ffi", "chrono-tz"] }
arrow-array = { git = "https://github.com/tustvold/arrow-rs.git", rev = "78b6139" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the PR is merged, can we switch this to a revision of the official arrow-rs repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can point to arrow-rs repo.

@viirya
Copy link
Member Author

viirya commented Jun 29, 2024

@andygrove This uses official arrow-rs repo now.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya

@andygrove andygrove merged commit 27288a0 into apache:main Jun 29, 2024
66 checks passed
@viirya
Copy link
Member Author

viirya commented Jun 29, 2024

Thanks @andygrove

@viirya viirya deleted the fix_range_index2 branch June 29, 2024 17:36
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…#584)

* fix: Fix range out of index error by using custom arrow-rs repo

* Add custom Java Arrow classes

* Add a hack

* Update

* Update

* Update to use apache/arrow-rs#5958

* Use tustvold's branch

* Use official arrow-rs repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

range end index 294912 out of range for slice of length 147456
3 participants