-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-6738: [Java] Fix problems with current union comparison logic #5544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5544 +/- ##
==========================================
+ Coverage 88.96% 89.97% +1%
==========================================
Files 991 739 -252
Lines 135132 115054 -20078
Branches 1501 0 -1501
==========================================
- Hits 120225 103520 -16705
+ Misses 14542 11534 -3008
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
java/vector/src/main/java/org/apache/arrow/vector/compare/RangeEqualsVisitor.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/compare/TestRangeEqualsVisitor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it is correct to flip since the scale is different between the two vectors. @pravindra ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general question.
A numeric can be represented as different decimal values, with different scales and precisions.
The question is: is the same numeric value equal when represented in different format?
For BigDecimal in standard java, there are two ways to compare:
- The equals method compares both value and scale.
- The compareTo method comapres the underlying value.
From a performance view point, comparing by the underlying value is more performant for Arrow's memory layout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression is the comparison is actually .1 == 1 and .2 == 2 (i.e not the same as java BigDecimal, I might be misreading it but if you look at BigDecimal.getObject, I think this agrees with my assessment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield You are right.
I am wrong because Arrow is different from BigDecimal. Arrow stores unscaled values.
Sorry for my mistake.
I have updated the code to revert this flip. Please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield After more thoughts, I think UnionVector is a special case: for union vectors, it is possible that two vectors have different fields, but have equal values in a certain range.
So the changes for the RangeEqualsVisitor are like this: for union vectors, we only compare minor types, for other vectors, we compare fields.
The problem for DecimalVector no longer exisits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this only return on inequality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch. Thank you so much.
|
this needs a rebase and I'm not sure all the changes are correct. |
f3c07d1 to
3ebbf4c
Compare
Rebase done. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by this argument, wouldn't we need to always return true here and validate types in equalities.
i.e. if left is a union and right is an intvector it union[1] happens to be an int vector the comparison should happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the argument only applies when both vectors are union vectors.
An IntVector and a UnionVector can not have equal range, even the values in the range are actually equal.
In C++, it also requires that both vectors are union vectors:
bool CompareUnions(const UnionArray& left) const {
const auto& right = checked_cast<const UnionArray&>(right_);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its been a while since I looked at the C++ code, but are you sure there isn't a check for union equality first in C++? It doesn't seem unreasonable to require the types be identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check for union type euqlity (and other types of arrays) in C++, and it requres the type to be identical.
To see this, note in the above code snippet that the left vector must be a UnionArray (passed in as the parameter), and the right vector must also be a UnionArray, and this is ensured by
checked_cast<const UnionArray&>(right_);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe I was not answering your question, and there is something confusing here. So let me try to make it clearer:
There are two types of type equality:
shallow equality - which only requires the minor types to be the same
deep equality - which requires the minor type, minor types of child types, and order of child types be equal.
For Java, we check deep equality for other types of vectors, and for union vector, we only check shallow equality.
This is because, for union vector, it is possible that two union vectors may have different child vectors, but their values are equal in a certain range. This is not possible for other types of vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments. I think some points are a little clearer for me.
I think there are two (somewhat) independent problems here:
- Should child vector types being equal be a necessary condition for vector ranges being equal.
- Should we consider the above 3 (or other alternative) semantics proposed by you.
For me, 1 is unacceptable, as it causes practical problems (e.g. encoding a UnionVector). For 2, I do not prefer it, but it is acceptable for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure they are independent. But at this point, if you still feel strongly about 1, please start a discussion on the mailing list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thanks for your suggestion. I will start a discussion in the ML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like there was much discussion on this in the mailing list, would you be willing to separate the fixes here, so we can merge the bug fix and maybe try bumping the thread on the mailing list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea. I have revised the PR to undo the union type comparison logic.
It seems like a seprate issue, and may involve multiple languages.
Maybe I will do more investigations and start a new discussion to describe the problem in a clearer way.
This discussion have been for a long time. Thanks a lot for your patience.
1a6756d to
c008289
Compare
|
Rebased to resolve conflicts. |
| uInt4Holder.isSet = 1; | ||
|
|
||
| final NullableIntHolder intHolder = new NullableIntHolder(); | ||
| uInt4Holder.value = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this intended to set intHolder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Sorry about my mistake. Resolved now.
|
|
||
| for (String name : leftChildNames) { | ||
| RangeEqualsVisitor visitor = createInnerVisitor(leftVector.getChild(name), rightVector.getChild(name)); | ||
| RangeEqualsVisitor visitor = createInnerVisitor(leftVector.getChild(name), rightVector.getChild(name), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment for what null represents (/param_name=/) here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments added. Thanks for the good suggestion.
| vector1.setValueCount(3); | ||
|
|
||
| vector2.setType(0, Types.MinorType.UINT4); | ||
| vector2.setSafe(0, uInt4Holder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if I'm missing it but these vectors do look identifical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I have revised the test case so the two vectors are different.
There are some problems with the current union comparison logic. For example: 1. For type check, we should not require fields to be equal. It is possible that two vectors' value ranges are equal but their fields are different. 2. We should not compare the number of sub vectors, as it is possible that two union vectors have different numbers of sub vectors, but have equal values in the range. Closes #5544 from liyafan82/fly_0930_share and squashes the following commits: d6ef3d2 <liyafan82> Refine test case c008289 <liyafan82> Resolve test failure after rebasing c515393 <liyafan82> Rule out the change for union type comparison bab7402 <liyafan82> Compare fields for all vectors except union vectors 5b2225e <liyafan82> Fix the bug with decimal vector 4d8b570 <liyafan82> Fix problems with current union comparison logic Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
There are some problems with the current union comparison logic. For example: 1. For type check, we should not require fields to be equal. It is possible that two vectors' value ranges are equal but their fields are different. 2. We should not compare the number of sub vectors, as it is possible that two union vectors have different numbers of sub vectors, but have equal values in the range. Closes apache#5544 from liyafan82/fly_0930_share and squashes the following commits: d6ef3d2 <liyafan82> Refine test case c008289 <liyafan82> Resolve test failure after rebasing c515393 <liyafan82> Rule out the change for union type comparison bab7402 <liyafan82> Compare fields for all vectors except union vectors 5b2225e <liyafan82> Fix the bug with decimal vector 4d8b570 <liyafan82> Fix problems with current union comparison logic Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
There are some problems with the current union comparison logic. For example: