Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1619: [Java] Set lastSet in JsonFileReader #1140

Conversation

BryanCutler
Copy link
Member

When reading a vector in JsonFileReader, lastSet should be set in VariableWidthVectors after reading inner vectors or else subsequent operations could corrupt the offsets. This also allows to simplify some of the related code. Additionally, ListVector.lastSet should be explicitly initialized to 0, which is it's starting offset.

@BryanCutler
Copy link
Member Author

cc @icexelloss @siddharthteotia

switch (vector.getMinorType()) {
case LIST:
// ListVector starts lastSet from index 0
((ListVector) vector).getMutator().setLastSet(count);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really confusing to me at first, I thought something was wrong until I confirmed with this comment
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java#L184

Is there any reason why ListVector.lastSet is different than the lastSet of other VariableWidthVectors? Is it possible to change this to be consistent?

// if children

// Set lastSet before valueCount to prevent setValueCount from filling empty values
switch (vector.getMinorType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not crazy about this because case matching makes it hard to maintain for new types of Vectors.

I think the proper way to do this is to use loadFieldBuffers in this class, since loadFieldBuffers should initialize vectors correctly. I think ArrowFileReader uses it too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not ideal in that sense but loadFieldBuffers uses nodes that come from FB data in ArrowRecordBatchs. I think it was designed only for that type of data, so I'm not sure how to apply it here for json.

@BryanCutler
Copy link
Member Author

@icexelloss would you be ok with this for now and I'll put the todo comment back in to look into using loadFieldBuffers for future reference?

@icexelloss
Copy link
Contributor

Yes I think that's fair.

((ListVector) vector).getMutator().setLastSet(count);
break;
case VARBINARY:
((NullableVarBinaryVector) vector).getMutator().setLastSet(count - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining count vs count -1 difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a comment on the ListVector case, but I can elaborate


// Set lastSet before valueCount to prevent setValueCount from filling empty values
switch (vector.getMinorType()) {
case LIST:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the case matching here complete? I think so but can you confirm these are the only vectors that have "lastSet"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I double checked and this is it.

@icexelloss
Copy link
Contributor

@BryanCutler can you add a test for this issue?

@BryanCutler
Copy link
Member Author

can you add a test for this issue?

I think the real test for this is that you can call setValueCount on the vector at the end and it won't fill it with empty values for the VarChar/Binary cases or corrupt the offset vector for ListVector. The existing JSON tests cover this pretty well, but looks like there's nothing for VarBinaryVector so I can add that.

@icexelloss
Copy link
Contributor

I think the real test for this is that you can call setValueCount on the vector at the end and it won't fill it
with empty values for the VarChar/Binary cases or corrupt the offset vector for ListVector. The existing JSON tests cover this pretty well, but looks like there's nothing for VarBinaryVector so I can add that.

Sorry I am a bit confused. The current TestJSONFile doesn't cover this bug, right?

@BryanCutler
Copy link
Member Author

If you remove the calls to setLastSet here, a number of the tests fail that use ListVector and VarCharVector. It's because setValueCount for these will modify the values/offsets according to the what lastSet is.


// NullableVarBinaryVector lastSet should be the index of last value
NullableVarBinaryVector binaryVector = (NullableVarBinaryVector) listVector.getChildrenFromFields().get(0);
Assert.assertEquals(binaryVector.getMutator().getLastSet(), numVarBinaryValues - 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icexelloss I added tests for ListVector with NullableVarBinaryVector that also checks lastSet values. Please take another look when you get a chance, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the tests were already there covering this scenario in TestListVector.java. I remember adding tests for setLastSet().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia I think the issue is TestListVector doesn't test the case that a ListVector is created from json reader.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1

@icexelloss
Copy link
Contributor

Thanks @BryanCutler !

@wesm
Copy link
Member

wesm commented Sep 29, 2017

Thanks @BryanCutler and @icexelloss! merging

@asfgit asfgit closed this in f8cf91d Sep 29, 2017
wesm pushed a commit to wesm/arrow that referenced this pull request Oct 3, 2017
When reading a vector in JsonFileReader, lastSet should be set in VariableWidthVectors after reading inner vectors or else subsequent operations could corrupt the offsets.  This also allows to simplify some of the related code.  Additionally, ListVector.lastSet should be explicitly initialized to 0, which is it's starting offset.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#1140 from BryanCutler/java-JsonReader-setLast-ARROW-1619 and squashes the following commits:

8f97a3d [Bryan Cutler] added test for VarBinaryVector that checks lastSet after reading
70df0cc [Bryan Cutler] set lastSet in JsonFileReader and initialize lastSet for ListVector
@BryanCutler BryanCutler deleted the java-JsonReader-setLast-ARROW-1619 branch November 7, 2017 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants