-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-6022: [Java] Support equals API in ValueVector to compare two vectors equal #4933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
383a56b to
6033ac8
Compare
java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java
Outdated
Show resolved
Hide resolved
|
CC @BryanCutler who I think reviewed the dictionary.equals. |
Codecov Report
@@ Coverage Diff @@
## master #4933 +/- ##
==========================================
+ Coverage 87.57% 89.68% +2.11%
==========================================
Files 1008 670 -338
Lines 143790 99342 -44448
Branches 1418 0 -1418
==========================================
- Hits 125924 89095 -36829
+ Misses 17504 10247 -7257
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
|
@emkornfield @pravindra Thanks a lot for your comments, I updated this PR, and would you and @BryanCutler mind take a look? |
java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java
Outdated
Show resolved
Hide resolved
|
the change lgtm. @BryanCutler, can you please take a look too - especially the checks for the complex type vectors ? |
|
@BryanCutler Mind taking a look when you are available? thanks |
java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StructVector extends from this one. it should compare the validity bits too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java
Outdated
Show resolved
Hide resolved
|
dumb question - does it make sense to move the equals/hash code out of the value vectors and use visitors like cpp ? I think that'll give more flexibility to do custom stuff (eg. compare approx for floating point, skip compares for names, ..) |
Thanks for your comments. I am not sure, if it makes sense, I would like to do it as a follow-up in a seperate issue. What do you think on this proposal? @emkornfield |
|
my primary concern is that the current implementation is not very efficient since the "vector equals" (i.e vv->equals()) is computed by iterating over each element in the vector (vv->equals(i, other, i). This means -
|
java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
Outdated
Show resolved
Hide resolved
Thanks for pointing, you are right, do you mean we implement something like this?
|
|
@tianchen92 - yes, I think with the approach you pointed out, we could do the type checking just once per vector and make the vector-compare/range-compare efficient. |
I updated this PR, please help take a look, thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can have fast paths for all values being non-null, but they can be addressed in follow up patches..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for your comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good in general, thanks for doing this @tianchen92 ! I have a couple questions:
-
There seems to be some vectors that implement
equalsand the same code is also inRangeEqualsVisitor. Is this planned to be merged? -
It's a little confusing to have both visitor and equals APIs for the same thing too. Should we keep the visitor API private?
-
It looks like we check names of child vectors for complex types only, and not the actual FieldVector name. Is that consistent with other languages like C++? I think I saw some discussion on this, but want to make sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a unit test to check if ZeroVector == ZeroVector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to the ExtensionTypeVector class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if index or toIndex is out of range? Can you have a unit test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, there will be some problems, if the index > valueCount it won't throw exception which is not right. I added index check in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we have equals that checks a range of values and accpet a RangeEqualsVisitor? That seems redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equals(int index, ValueVector vector, int toIndex) compares single value and RangeEqualsVisitor compares a range of values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have missed the discussion when equals was added to compare a single value. It doesn't seem like that would be very useful to me, was there a specific use case to support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we did a refactor for dictionary encoding few weeks ago, hashCode/equals API are used for DictionaryHashTable (https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryHashTable.java#L138) to avoid memory copy introduced by previous implementation.
PR is here #4846.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @BryanCutler - this causes duplication of code/tests. Is there a reason to not use the RangeEquals API (with range = 1) instead ?
I'm fine if you want to do this in a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pravindra Ah, I have already made it reuse RangeEquals API(range ==1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant can we remove the equals() API from ValueVector, and it's callers be modified to instead use the RangeEquals() API ? I think keeping the minimal interface in ValueVector is ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, right, could be done in a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a duplicate of what is already in BaseFixedWidthVector.equals, can you consolidate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, removed to CompareUtility for reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better yet, you don't even need to build the left/rightChildren lists. just loop over children names here and make a visitor new RangeEqualsVisitor(rightVector.getChild(child), leftStart, rightStart, length) and call left.getChild(child).accept(visitor)`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, fixed, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you rename these start -> startByteLeft etc instead of 1/2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also with the instances below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't use 1 and 2 in variable names..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an extra space here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment - I think these could all be done in a single overridden compareValueVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
Thanks for your valuable comments, some clarifies as follows:
I updated this PR and please help take a look again :) @BryanCutler @pravindra Thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tianchen92 . I think for now, just stick to comparing vector values and leave the field name as a followup. Also, I don't think we should add another class CompareUtility since you could make use of the visitor in equals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about adding another class to compare vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making CompareUtility to put common code, what about just making a RangeEqualsVisitor with a range of 1 and calling this from equals? or the other way around to call equals for an index from RangeEqualsVisitor? I slightly prefer the first, but I definitely don't think we should add more classes for this..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion, looks reasonable, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have missed the discussion when equals was added to compare a single value. It doesn't seem like that would be very useful to me, was there a specific use case to support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the VectorEqualsVisitor should also compare Fields, specifically the name. But it shouldn't be done here, but in compareFieldVectors(FieldVector left...). Here, I think you should just call the super method. We can discuss comparing the names as a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, fixed.
Thanks for your quick reply, I updated this PR, please help take another look, thanks! |
| int offsetWidth = BaseRepeatedValueVector.OFFSET_WIDTH; | ||
|
|
||
| if (!isNull) { | ||
| final int startByteLeft = left.getOffsetBuffer().getInt(leftIndex * offsetWidth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (pls ignore if you don't agree): names are a bit confusing - startIndexLeft/endIndexLeft ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, would also be resolved together in follow-ups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened new jiras for the follow up items
https://issues.apache.org/jira/browse/ARROW-6210
https://issues.apache.org/jira/browse/ARROW-6211
Thanks a lot! |
|
thanks @tianchen92 for the change, and for meticulously following up on all the review comments ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tianchen92 this LGTM
Related to [ARROW-6210](https://issues.apache.org/jira/browse/ARROW-6210). This is a follow-up from #4933 The callers should be fixed to use the RangeEquals API instead. Closes #5065 from tianchen92/ARROW-6210 and squashes the following commits: cae440a <tianchen> resolve comments ad2277b <tianchen> resolve comments and port tests fb1510e <tianchen> add ValueMatcher 14b3e67 <tianchen> ARROW-6210: remove equals API from ValueVector Authored-by: tianchen <niki.lj@alibaba-inc.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
…Vector interface Related to [ARROW-6211](https://issues.apache.org/jira/browse/ARROW-6211). This is a follow-up from #4933 public interface VectorVisitor<OUT, IN, EX extends Exception> {..} n ValueVector : public <OUT, IN, EX extends Exception> OUT accept(VectorVisitor<OUT, IN, EX> visitor, IN value) throws EX; Closes #5091 from tianchen92/ARROW-6211 and squashes the following commits: 15a5231 <tianchen> fix bce258e <tianchen> add equals in RangeEqualsVisitor dbaa81f <tianchen> ARROW-6211: Remove dependency on RangeEqualsVisitor from ValueVector interface Authored-by: tianchen <niki.lj@alibaba-inc.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
…Vector interface Related to [ARROW-6211](https://issues.apache.org/jira/browse/ARROW-6211). This is a follow-up from apache/arrow#4933 public interface VectorVisitor<OUT, IN, EX extends Exception> {..} n ValueVector : public <OUT, IN, EX extends Exception> OUT accept(VectorVisitor<OUT, IN, EX> visitor, IN value) throws EX; Closes #5091 from tianchen92/ARROW-6211 and squashes the following commits: 15a523177 <tianchen> fix bce258efc <tianchen> add equals in RangeEqualsVisitor dbaa81fc8 <tianchen> ARROW-6211: Remove dependency on RangeEqualsVisitor from ValueVector interface Authored-by: tianchen <niki.lj@alibaba-inc.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
…ectors equal Related to [ARROW-6022](https://issues.apache.org/jira/browse/ARROW-6022). In some case, this feature is useful. In ARROW-1184, Dictionary#equals not work due to the lack of this API. Moreover, we already implemented equals(int index, ValueVector target, int targetIndex), so this new added API could reuse it. Closes apache#4933 from tianchen92/ARROW-6022 and squashes the following commits: 7e20f79 <tianchen> remove CompareUtility a5d22fd <tianchen> fix variable names 226a20f <tianchen> refactor 694d9f6 <tianchen> make equals to visitor mode 0dfa943 <tianchen> compare struct child names and add UT c7081c2 <tianchen> check list validity bit b942794 <tianchen> use ArrowType for equal 1d95c9c <tianchen> fix Decimal equals 3c9f066 <tianchen> fix 0026882 <tianchen> use MinorType and check isSet e58c158 <tianchen> refactor Dictionary#equals 10dca2c <tianchen> fix 6bc3f68 <tianchen> ARROW-6022: Support equals API in ValueVector to compare two vectors equal Authored-by: tianchen <niki.lj@alibaba-inc.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
Related to [ARROW-6210](https://issues.apache.org/jira/browse/ARROW-6210). This is a follow-up from apache#4933 The callers should be fixed to use the RangeEquals API instead. Closes apache#5065 from tianchen92/ARROW-6210 and squashes the following commits: cae440a <tianchen> resolve comments ad2277b <tianchen> resolve comments and port tests fb1510e <tianchen> add ValueMatcher 14b3e67 <tianchen> ARROW-6210: remove equals API from ValueVector Authored-by: tianchen <niki.lj@alibaba-inc.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
Related to ARROW-6022.
In some case, this feature is useful.
In ARROW-1184, Dictionary#equals not work due to the lack of this API.
Moreover, we already implemented equals(int index, ValueVector target, int targetIndex), so this new added API could reuse it.