-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1619: [Java] Set lastSet in JsonFileReader #1140
ARROW-1619: [Java] Set lastSet in JsonFileReader #1140
Conversation
switch (vector.getMinorType()) { | ||
case LIST: | ||
// ListVector starts lastSet from index 0 | ||
((ListVector) vector).getMutator().setLastSet(count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was really confusing to me at first, I thought something was wrong until I confirmed with this comment
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java#L184
Is there any reason why ListVector.lastSet
is different than the lastSet
of other VariableWidthVectors? Is it possible to change this to be consistent?
// if children | ||
|
||
// Set lastSet before valueCount to prevent setValueCount from filling empty values | ||
switch (vector.getMinorType()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not crazy about this because case matching makes it hard to maintain for new types of Vectors.
I think the proper way to do this is to use loadFieldBuffers
in this class, since loadFieldBuffers should initialize vectors correctly. I think ArrowFileReader uses it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not ideal in that sense but loadFieldBuffers
uses nodes that come from FB data in ArrowRecordBatch
s. I think it was designed only for that type of data, so I'm not sure how to apply it here for json.
@icexelloss would you be ok with this for now and I'll put the todo comment back in to look into using |
Yes I think that's fair. |
((ListVector) vector).getMutator().setLastSet(count); | ||
break; | ||
case VARBINARY: | ||
((NullableVarBinaryVector) vector).getMutator().setLastSet(count - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining count
vs count -1
difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a comment on the ListVector case, but I can elaborate
|
||
// Set lastSet before valueCount to prevent setValueCount from filling empty values | ||
switch (vector.getMinorType()) { | ||
case LIST: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the case matching here complete? I think so but can you confirm these are the only vectors that have "lastSet"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I double checked and this is it.
@BryanCutler can you add a test for this issue? |
I think the real test for this is that you can call |
Sorry I am a bit confused. The current |
If you remove the calls to |
|
||
// NullableVarBinaryVector lastSet should be the index of last value | ||
NullableVarBinaryVector binaryVector = (NullableVarBinaryVector) listVector.getChildrenFromFields().get(0); | ||
Assert.assertEquals(binaryVector.getMutator().getLastSet(), numVarBinaryValues - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss I added tests for ListVector with NullableVarBinaryVector that also checks lastSet
values. Please take another look when you get a chance, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the tests were already there covering this scenario in TestListVector.java. I remember adding tests for setLastSet().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharthteotia I think the issue is TestListVector doesn't test the case that a ListVector is created from json reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1
Thanks @BryanCutler ! |
Thanks @BryanCutler and @icexelloss! merging |
When reading a vector in JsonFileReader, lastSet should be set in VariableWidthVectors after reading inner vectors or else subsequent operations could corrupt the offsets. This also allows to simplify some of the related code. Additionally, ListVector.lastSet should be explicitly initialized to 0, which is it's starting offset. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#1140 from BryanCutler/java-JsonReader-setLast-ARROW-1619 and squashes the following commits: 8f97a3d [Bryan Cutler] added test for VarBinaryVector that checks lastSet after reading 70df0cc [Bryan Cutler] set lastSet in JsonFileReader and initialize lastSet for ListVector
When reading a vector in JsonFileReader, lastSet should be set in VariableWidthVectors after reading inner vectors or else subsequent operations could corrupt the offsets. This also allows to simplify some of the related code. Additionally, ListVector.lastSet should be explicitly initialized to 0, which is it's starting offset.