Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40038: [Java] Export non empty offset buffer for variable-size layout through C Data Interface #40043

Merged
merged 7 commits into from
Apr 2, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 12, 2024

Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow BaseVariableWidthVector class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow spec for variable size binary layout:

The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is 0).

What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

Are these changes tested?

Added test cases.

Are there any user-facing changes?

No

Copy link

⚠️ GitHub issue #40038 has been automatically assigned in GitHub to PR creator.

Comment on lines 237 to 239
if (conflictPolicy == ConflictPolicy.CONFLICT_REPLACE && vectors.containsKey(childName)) {
vectors.getAll(childName).forEach(c -> c.close());
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This address test failure in TestStructVector. When duplicate field names exist in initializeChildrenFromFields call, the initial offset buffer in previous vectors will cause memory leak. So these duplicate vectors are closed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be done in putVector instead?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 12, 2024
Comment on lines 105 to 107
if (!init) {
offsetAllocationSizeInBytes = curSize;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid changing offsetAllocationSizeInBytes to the initial 4 byte size. Otherwise, reallocOffsetBuffer will take it and cause some tests failed.

@@ -50,6 +50,7 @@ public void testZeroRowResultSet() throws Exception {
assertNotNull("VectorSchemaRoot from first next() result should never be null", root);
assertEquals("VectorSchemaRoot from empty ResultSet should have zero rows", 0, root.getRowCount());
assertFalse("hasNext() should return false on empty ResultSets after initial next() call", iter.hasNext());
root.close();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to release initial offset buffer for empty array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can use a try instead of calling close explicitly?

Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run vector tests locally and they are passed. The tests in other modules might still have failure. If any, I will look at them later.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

Ah, okay, I see. I've not found the ticket. Looks like both cases (empty or single zero element) are acceptable. I will close this. Thanks for the info.

@viirya viirya closed this Feb 12, 2024
@lidavidm
Copy link
Member

Are you using arrow2? It's possible it was never updated to fix this, but arrow-rs appears to have been fixed

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

I'm using arrow-rs. Although its array data allows empty offset, the issue happens in ffi module which doesn't allow empty offsets for now. I proposed a fix there.

@lidavidm
Copy link
Member

Hmm. While we allow that for IPC, I think I'm wrong about C Data: we explicitly specify this is not allowable https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowArray.buffers

@pitrou presumably Java should actually be fixed here, but what do other implementations do/we appear to be missing tests for this case?

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

The buffer pointers MAY be null only in two situations:
for the null bitmap buffer, if ArrowArray.null_count is 0;
for any buffer, if the size in bytes of the corresponding buffer would be 0.

Isn't it allowed for second situation?

@lidavidm
Copy link
Member

Ah, fair. It reads as a bit ambiguous to me: would the size of the buffer be 0, or is it that it should not be 0, but we allowed 0 as an exception before? I guess both interpretations lead to the buffer being 0-sized, though, and so it applies. Maybe we should spell out the case...

Regardless, it does seem we're missing an integration test then.

@pitrou
Copy link
Member

pitrou commented Feb 12, 2024

@pitrou presumably Java should actually be fixed here, but what do other implementations do/we appear to be missing tests for this case?

Yes, I think it should be fixed. Zero-length columns are tested in the integration test, but whether they actually miss an offsets buffer depends on how the JSON reader behaves.

@lidavidm
Copy link
Member

Sorry, this is for the C Data integration tests

@lidavidm
Copy link
Member

Ah ok, now I see what you mean

@@ -50,6 +50,7 @@ public void testZeroRowResultSet() throws Exception {
assertNotNull("VectorSchemaRoot from first next() result should never be null", root);
assertEquals("VectorSchemaRoot from empty ResultSet should have zero rows", 0, root.getRowCount());
assertFalse("hasNext() should return false on empty ResultSets after initial next() call", iter.hasNext());
root.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can use a try instead of calling close explicitly?

Comment on lines 237 to 239
if (conflictPolicy == ConflictPolicy.CONFLICT_REPLACE && vectors.containsKey(childName)) {
vectors.getAll(childName).forEach(c -> c.close());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be done in putVector instead?

@pitrou
Copy link
Member

pitrou commented Feb 12, 2024

It seems this fix might be a bit heavy-handed if it adds a heap allocation for every empty var-width vector? AFAIU, Arrow Java uses empty vectors quite liberally.

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

Ah, fair. It reads as a bit ambiguous to me: would the size of the buffer be 0, or is it that it should not be 0, but we allowed 0 as an exception before? I guess both interpretations lead to the buffer being 0-sized, though, and so it applies. Maybe we should spell out the case...

I think it should mean that the buffer size is 0, the buffer pointers may be null, for the second situation.

@pitrou
Copy link
Member

pitrou commented Feb 12, 2024

I think it should mean that the buffer size is 0, the buffer pointers may be null, for the second situation.

Yes, but it's the buffer size, not the array size. For an empty array, the buffer size should be 4 (or 8 for a large offsets type), so the buffer can't be null.

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

It seems this fix might be a bit heavy-handed if it adds a heap allocation for every empty var-width vector? AFAIU, Arrow Java uses empty vectors quite liberally.

Yea, and I believe that @lidavidm meant that these empty offsets for empty var-width vector are valid. That's why I closed this and proposed a fix at ffi module at Rust instead.

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

Yes, but it's the buffer size, not the array size. For an empty array, the buffer size should be 4 (or 8 for a large offsets type), so the buffer can't be null.

That's what I thought at the beginning and the reason I proposed this fix.

But somehow from the context @lidavidm provided:
https://lists.apache.org/thread/w7g1zfqrjxx0bvrct0mt5zwxvdnc9nob

I think he means that it is a valid case (empty offset).

@viirya
Copy link
Member Author

viirya commented Feb 12, 2024

I guess that this might be something we really need to fix (at least for the C data export part for var-size arrays). Reopened it.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 3, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 3, 2024
@viirya
Copy link
Member Author

viirya commented Mar 26, 2024

cc @sunchao for another eyes on this. Thanks.

@lidavidm lidavidm merged commit 5ddef63 into apache:main Apr 2, 2024
18 checks passed
@lidavidm lidavidm removed the awaiting changes Awaiting changes label Apr 2, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label Apr 2, 2024
@andygrove
Copy link
Member

Thanks for the review @lidavidm

@viirya
Copy link
Member Author

viirya commented Apr 2, 2024

Thank you @lidavidm @pitrou @vibhatha @andygrove for the review

Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 5ddef63.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 11 possible false positives for unstable benchmarks that are known to sometimes produce them.

tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ze layout through C Data Interface (apache#40043)

### Rationale for this change

We encountered an error when exchanging string array from Java to Rust through Arrow C data interface. At Rust side, it complains that the buffer at position 1 (offset buffer) is null. After tracing down and some debugging, it looks like the issue is Java Arrow `BaseVariableWidthVector` class assigns an empty offset buffer if the array is empty (value count 0).

According to Arrow [spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) for variable size binary layout:

> The offsets buffer contains length + 1 signed integers ...

So for an empty string array, its offset buffer should be a buffer with one element (generally it is `0`).

### What changes are included in this PR?

This patch replaces current empty offset buffer in variable-size layout vector classes when exporting arrays through C Data Interface.

### Are these changes tested?

Added test cases.

### Are there any user-facing changes?

No

* Closes: apache#40038

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants