-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40038: [Java] Export non empty offset buffer for variable-size layout through C Data Interface #40043
GH-40038: [Java] Export non empty offset buffer for variable-size layout through C Data Interface #40043
Changes from 3 commits
5eb34e1
c9c0522
5d0475f
47ddac2
0d004c4
aae9bcd
7f4343c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -355,6 +355,37 @@ public List<ArrowBuf> getFieldBuffers() { | |
return result; | ||
} | ||
|
||
/** | ||
* Get the buffers for exporting this vector through C Data Interface. | ||
* @return the buffers ready for exporting. | ||
*/ | ||
@Override | ||
public List<ArrowBuf> getCDataBuffers() { | ||
// before flight/IPC, we must bring the vector to a consistent state. | ||
// this is because, it is possible that the offset buffers of some trailing values | ||
// are not updated. this may cause some data in the data buffer being lost. | ||
// for details, please see TestValueVector#testUnloadVariableWidthVector. | ||
fillHoles(valueCount); | ||
|
||
List<ArrowBuf> result = new ArrayList<>(3); | ||
setReaderAndWriterIndex(); | ||
result.add(validityBuffer); | ||
|
||
if (offsetBuffer.capacity() == 0) { | ||
// Empty offset buffer is allowed for historical reason. | ||
// To export it through C Data interface, we need to allocate a buffer with one offset. | ||
result.add(allocateOffsetBuffer(OFFSET_WIDTH)); | ||
} else { | ||
result.add(offsetBuffer); | ||
} | ||
|
||
result.add(valueBuffer); | ||
|
||
return result; | ||
} | ||
|
||
|
||
|
||
/** | ||
* Set the reader and writer indexes for the inner buffers. | ||
*/ | ||
|
@@ -476,11 +507,12 @@ private void allocateBytes(final long valueBufferSize, final int valueCount) { | |
} | ||
|
||
/* allocate offset buffer */ | ||
private void allocateOffsetBuffer(final long size) { | ||
private ArrowBuf allocateOffsetBuffer(final long size) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the intent of this change to avoid changing the existing offset buffer, and only change it for the exported version? Then I must ask (1) why? If we're allocating the buffer, we may as well use it for the vector as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For 2, this is only for exporting the vectors through C Data Interface. By its means, it is intended to be released by the importer side. At Java Arrow, I remember ArrayExporter only increases the reference count for the buffers to be exported to prevent them to be released automatically. And the release callback of exported structure will reduce the reference count to release it eventually. But it is triggered at imported side. I think this is what the C Data Interface defines. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For 1, they are allocated for using only at imported side. I think as per discussed with @pitrou above, the idea is these empty buffers exist internally and it is okay. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think Antoine's idea was that you could statically allocate one such buffer to use in such situations, if you're talking about this
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And yes, it seems ArrayExporter will increase the reference count by 1. But since it starts with a reference count of 1, that means these buffers will never be freed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vibhatha I've added some more checks (buffer ref count, allocated memory) into the existing roundtrip tests ( As this issue blocks some important feature in apache/datafusion-comet#95, we really like to move this forward. Thanks a lot. 🙏 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will take a look at this as soon as possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @vibhatha, sorry for pinging you. Do you have a chance to look at this recently? One important feature is blocked by the issue and we are looking forward to move this into Java Arrow library. Thank you. 🙏 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @viirya I couldn't take a look. But let me see if I can allocate some time to do this probably earlier next week? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vibhatha It will be great! Thank you so much! |
||
final int curSize = (int) size; | ||
offsetBuffer = allocator.buffer(curSize); | ||
ArrowBuf offsetBuffer = allocator.buffer(curSize); | ||
offsetBuffer.readerIndex(0); | ||
initOffsetBuffer(); | ||
return offsetBuffer; | ||
} | ||
|
||
/* allocate validity buffer */ | ||
|
@@ -805,7 +837,7 @@ private void splitAndTransferOffsetBuffer(int startIndex, int length, BaseVariab | |
(1 + length) * ((long) OFFSET_WIDTH)); | ||
target.offsetBuffer = transferBuffer(slicedOffsetBuffer, target.allocator); | ||
} else { | ||
target.allocateOffsetBuffer((long) (length + 1) * OFFSET_WIDTH); | ||
target.offsetBuffer = target.allocateOffsetBuffer((long) (length + 1) * OFFSET_WIDTH); | ||
for (int i = 0; i < length + 1; i++) { | ||
final int relativeSourceOffset = getStartOffset(startIndex + i) - start; | ||
target.offsetBuffer.setInt((long) i * OFFSET_WIDTH, relativeSourceOffset); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -60,6 +60,17 @@ public interface FieldVector extends ValueVector { | |
*/ | ||
List<ArrowBuf> getFieldBuffers(); | ||
|
||
|
||
/** | ||
* Get the buffers for C Data Interface, (same size as getFieldVectors() since it is their content). | ||
* By default, it returns the same as getFieldBuffers(). | ||
* | ||
* @return the buffers containing the data for this vector (ready for exporting through C Data Interface) | ||
*/ | ||
default List<ArrowBuf> getCDataBuffers() { | ||
return getFieldBuffers(); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought a few approaches to achieve the goal to fix this issue. This is less intrusive one I can think of. I've tried to modify |
||
|
||
/** | ||
* Get the inner vectors. | ||
* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should add a test for this? Especially for the arrays with custom implementation. 🤔