Skip to content

Conversation

@tianchen92
Copy link
Contributor

Related to ARROW-6022.

In some case, this feature is useful.
In ARROW-1184, Dictionary#equals not work due to the lack of this API.
Moreover, we already implemented equals(int index, ValueVector target, int targetIndex), so this new added API could reuse it.

@emkornfield
Copy link
Contributor

CC @BryanCutler who I think reviewed the dictionary.equals.

@codecov-io
Copy link

codecov-io commented Jul 29, 2019

Codecov Report

Merging #4933 into master will increase coverage by 2.11%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4933      +/-   ##
==========================================
+ Coverage   87.57%   89.68%   +2.11%     
==========================================
  Files        1008      670     -338     
  Lines      143790    99342   -44448     
  Branches     1418        0    -1418     
==========================================
- Hits       125924    89095   -36829     
+ Misses      17504    10247    -7257     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/arrow/csv/column-builder.cc 95.29% <0%> (-1.77%) ⬇️
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
... and 333 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 229fe33...a4e9c48. Read the comment docs.

@tianchen92
Copy link
Contributor Author

@emkornfield @pravindra Thanks a lot for your comments, I updated this PR, and would you and @BryanCutler mind take a look?

@pravindra
Copy link
Contributor

the change lgtm. @BryanCutler, can you please take a look too - especially the checks for the complex type vectors ?

@tianchen92
Copy link
Contributor Author

@BryanCutler Mind taking a look when you are available? thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructVector extends from this one. it should compare the validity bits too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@pravindra
Copy link
Contributor

dumb question - does it make sense to move the equals/hash code out of the value vectors and use visitors like cpp ? I think that'll give more flexibility to do custom stuff (eg. compare approx for floating point, skip compares for names, ..)

@tianchen92
Copy link
Contributor Author

dumb question - does it make sense to move the equals/hash code out of the value vectors and use visitors like cpp ? I think that'll give more flexibility to do custom stuff (eg. compare approx for floating point, skip compares for names, ..)

Thanks for your comments. I am not sure, if it makes sense, I would like to do it as a follow-up in a seperate issue. What do you think on this proposal? @emkornfield

@pravindra
Copy link
Contributor

my primary concern is that the current implementation is not very efficient since the "vector equals" (i.e vv->equals()) is computed by iterating over each element in the vector (vv->equals(i, other, i). This means -

  • the type/child checks are repeated for each element (for a 4K batch size, the check is done 4K times instead of once)
  • we are not able to do optimisations at the vector level by first comparing the bitmaps (i.e if the bitmaps don't match, the vectors can't be equal).

@tianchen92
Copy link
Contributor Author

my primary concern is that the current implementation is not very efficient since the "vector equals" (i.e vv->equals()) is computed by iterating over each element in the vector (vv->equals(i, other, i). This means -

  • the type/child checks are repeated for each element (for a 4K batch size, the check is done 4K times instead of once)
  • we are not able to do optimisations at the vector level by first comparing the bitmaps (i.e if the bitmaps don't match, the vectors can't be equal).

Thanks for pointing, you are right, do you mean we implement something like this?

public class VectorEqualsVisitor {
protected final ValueVector right;
public VectorEqualsVisitor(ValueVector right) {
this.right = right;
}
private boolean compareBaseFixedWidthVectors(BaseFixedWidthVector left) {}
private boolean compareBaseVariableWidthVectors(BaseVariableWidthVector left) {}
public boolean visit(BaseFixedWidthVector left) {
return compareBaseFixedWidthVectors(left);
}
public boolean visit(BaseVariableWidthVector left) {
return compareBaseVariableWidthVectors(left);
}
......
}

In ValueVector,
equals(VectorEqualsVisitor visitor) {
return visitor.visit(this);
}

@pravindra
Copy link
Contributor

@tianchen92 - yes, I think with the approach you pointed out, we could do the type checking just once per vector and make the vector-compare/range-compare efficient.

@tianchen92
Copy link
Contributor Author

@tianchen92 - yes, I think with the approach you pointed out, we could do the type checking just once per vector and make the vector-compare/range-compare efficient.

I updated this PR, please help take a look, thanks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have fast paths for all values being non-null, but they can be addressed in follow up patches..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for your comments :)

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good in general, thanks for doing this @tianchen92 ! I have a couple questions:

  1. There seems to be some vectors that implement equals and the same code is also in RangeEqualsVisitor. Is this planned to be merged?

  2. It's a little confusing to have both visitor and equals APIs for the same thing too. Should we keep the visitor API private?

  3. It looks like we check names of child vectors for complex types only, and not the actual FieldVector name. Is that consistent with other languages like C++? I think I saw some discussion on this, but want to make sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test to check if ZeroVector == ZeroVector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the ExtensionTypeVector class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if index or toIndex is out of range? Can you have a unit test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, there will be some problems, if the index > valueCount it won't throw exception which is not right. I added index check in this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we have equals that checks a range of values and accpet a RangeEqualsVisitor? That seems redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equals(int index, ValueVector vector, int toIndex) compares single value and RangeEqualsVisitor compares a range of values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed the discussion when equals was added to compare a single value. It doesn't seem like that would be very useful to me, was there a specific use case to support this?

Copy link
Contributor Author

@tianchen92 tianchen92 Aug 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we did a refactor for dictionary encoding few weeks ago, hashCode/equals API are used for DictionaryHashTable (https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryHashTable.java#L138) to avoid memory copy introduced by previous implementation.
PR is here #4846.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @BryanCutler - this causes duplication of code/tests. Is there a reason to not use the RangeEquals API (with range = 1) instead ?

I'm fine if you want to do this in a different PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravindra Ah, I have already made it reuse RangeEquals API(range ==1).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant can we remove the equals() API from ValueVector, and it's callers be modified to instead use the RangeEquals() API ? I think keeping the minimal interface in ValueVector is ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, right, could be done in a different PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplicate of what is already in BaseFixedWidthVector.equals, can you consolidate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, removed to CompareUtility for reuse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better yet, you don't even need to build the left/rightChildren lists. just loop over children names here and make a visitor new RangeEqualsVisitor(rightVector.getChild(child), leftStart, rightStart, length) and call left.getChild(child).accept(visitor)`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you rename these start -> startByteLeft etc instead of 1/2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also with the instances below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use 1 and 2 in variable names..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an extra space here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment - I think these could all be done in a single overridden compareValueVector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@tianchen92
Copy link
Contributor Author

Looks pretty good in general, thanks for doing this @tianchen92 ! I have a couple questions:

  1. There seems to be some vectors that implement equals and the same code is also in RangeEqualsVisitor. Is this planned to be merged?
  2. It's a little confusing to have both visitor and equals APIs for the same thing too. Should we keep the visitor API private?
  3. It looks like we check names of child vectors for complex types only, and not the actual FieldVector name. Is that consistent with other languages like C++? I think I saw some discussion on this, but want to make sure

Thanks for your valuable comments, some clarifies as follows:

  1. about equals API, we already has equals(index, vector, toIndex) to compare single values (now mainly used for dictionary encoding). And in this PR I first though implement equals(ValueVector vector) to compare two vectors, and according to @pravindra suggestion,use RangeEqualsVisitor to replace equals(ValueVector vector). So no duplicated API here.
  2. You are right, some code in RangeEqualsVisitor and in equals(index, vector, toIndex) are similar, I have extract a new class CompareUtility to make these logic reusable.
  3. FieldVector names should be considered also, in VectorEqualsVisitor it compares Field which contains names and in RangeEqualsVisitor it compares FieldType without names. Do you think this is ok?

I updated this PR and please help take a look again :) @BryanCutler @pravindra Thanks a lot!

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tianchen92 . I think for now, just stick to comparing vector values and leave the field name as a followup. Also, I don't think we should add another class CompareUtility since you could make use of the visitor in equals

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about adding another class to compare vectors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making CompareUtility to put common code, what about just making a RangeEqualsVisitor with a range of 1 and calling this from equals? or the other way around to call equals for an index from RangeEqualsVisitor? I slightly prefer the first, but I definitely don't think we should add more classes for this..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion, looks reasonable, fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed the discussion when equals was added to compare a single value. It doesn't seem like that would be very useful to me, was there a specific use case to support this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the VectorEqualsVisitor should also compare Fields, specifically the name. But it shouldn't be done here, but in compareFieldVectors(FieldVector left...). Here, I think you should just call the super method. We can discuss comparing the names as a followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed.

@tianchen92
Copy link
Contributor Author

Thanks @tianchen92 . I think for now, just stick to comparing vector values and leave the field name as a followup. Also, I don't think we should add another class CompareUtility since you could make use of the visitor in equals

Thanks for your quick reply, I updated this PR, please help take another look, thanks!

int offsetWidth = BaseRepeatedValueVector.OFFSET_WIDTH;

if (!isNull) {
final int startByteLeft = left.getOffsetBuffer().getInt(leftIndex * offsetWidth);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (pls ignore if you don't agree): names are a bit confusing - startIndexLeft/endIndexLeft ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, would also be resolved together in follow-ups.

Copy link
Contributor

@pravindra pravindra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianchen92
Copy link
Contributor Author

I've opened new jiras for the follow up items

https://issues.apache.org/jira/browse/ARROW-6210
https://issues.apache.org/jira/browse/ARROW-6211

Thanks a lot!

@pravindra pravindra closed this in 6c1fccd Aug 12, 2019
@pravindra
Copy link
Contributor

thanks @tianchen92 for the change, and for meticulously following up on all the review comments !

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tianchen92 this LGTM

BryanCutler pushed a commit that referenced this pull request Aug 14, 2019
Related to [ARROW-6210](https://issues.apache.org/jira/browse/ARROW-6210).
This is a follow-up from #4933
The callers should be fixed to use the RangeEquals API instead.

Closes #5065 from tianchen92/ARROW-6210 and squashes the following commits:

cae440a <tianchen> resolve comments
ad2277b <tianchen> resolve comments and port tests
fb1510e <tianchen> add ValueMatcher
14b3e67 <tianchen> ARROW-6210:  remove equals API from ValueVector

Authored-by: tianchen <niki.lj@alibaba-inc.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
pravindra pushed a commit that referenced this pull request Aug 19, 2019
…Vector interface

Related to [ARROW-6211](https://issues.apache.org/jira/browse/ARROW-6211).

This is a follow-up from #4933

public interface VectorVisitor<OUT, IN, EX extends Exception> {..}
n ValueVector :

public <OUT, IN, EX extends Exception> OUT accept(VectorVisitor<OUT, IN, EX> visitor, IN value) throws EX;

Closes #5091 from tianchen92/ARROW-6211 and squashes the following commits:

15a5231 <tianchen> fix
bce258e <tianchen> add equals in RangeEqualsVisitor
dbaa81f <tianchen> ARROW-6211:  Remove dependency on RangeEqualsVisitor from ValueVector interface

Authored-by: tianchen <niki.lj@alibaba-inc.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
kou pushed a commit to apache/arrow-java that referenced this pull request Nov 25, 2024
…Vector interface

Related to [ARROW-6211](https://issues.apache.org/jira/browse/ARROW-6211).

This is a follow-up from apache/arrow#4933

public interface VectorVisitor<OUT, IN, EX extends Exception> {..}
n ValueVector :

public <OUT, IN, EX extends Exception> OUT accept(VectorVisitor<OUT, IN, EX> visitor, IN value) throws EX;

Closes #5091 from tianchen92/ARROW-6211 and squashes the following commits:

15a523177 <tianchen> fix
bce258efc <tianchen> add equals in RangeEqualsVisitor
dbaa81fc8 <tianchen> ARROW-6211:  Remove dependency on RangeEqualsVisitor from ValueVector interface

Authored-by: tianchen <niki.lj@alibaba-inc.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…ectors equal

Related to [ARROW-6022](https://issues.apache.org/jira/browse/ARROW-6022).

In some case, this feature is useful.
In ARROW-1184, Dictionary#equals not work due to the lack of this API.
Moreover, we already implemented equals(int index, ValueVector target, int targetIndex), so this new added API could reuse it.

Closes apache#4933 from tianchen92/ARROW-6022 and squashes the following commits:

7e20f79 <tianchen> remove CompareUtility
a5d22fd <tianchen> fix variable names
226a20f <tianchen> refactor
694d9f6 <tianchen> make equals to visitor mode
0dfa943 <tianchen> compare struct child names and add UT
c7081c2 <tianchen> check list validity bit
b942794 <tianchen> use ArrowType for equal
1d95c9c <tianchen> fix Decimal equals
3c9f066 <tianchen> fix
0026882 <tianchen> use MinorType and check isSet
e58c158 <tianchen> refactor Dictionary#equals
10dca2c <tianchen> fix
6bc3f68 <tianchen> ARROW-6022:  Support equals API in ValueVector to compare two vectors equal

Authored-by: tianchen <niki.lj@alibaba-inc.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
Related to [ARROW-6210](https://issues.apache.org/jira/browse/ARROW-6210).
This is a follow-up from apache#4933
The callers should be fixed to use the RangeEquals API instead.

Closes apache#5065 from tianchen92/ARROW-6210 and squashes the following commits:

cae440a <tianchen> resolve comments
ad2277b <tianchen> resolve comments and port tests
fb1510e <tianchen> add ValueMatcher
14b3e67 <tianchen> ARROW-6210:  remove equals API from ValueVector

Authored-by: tianchen <niki.lj@alibaba-inc.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants