Skip to content

Conversation

@liyafan82
Copy link
Contributor

There are some problems with the current union comparison logic. For example:

  1. For type check, we should not require fields to be equal. It is possible that two vectors' value ranges are equal but their fields are different.
  2. We should not compare the number of sub vectors, as it is possible that two union vectors have different numbers of sub vectors, but have equal values in the range.

@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Sep 30, 2019

Codecov Report

Merging #5544 into master will increase coverage by 1%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5544      +/-   ##
==========================================
+ Coverage   88.96%   89.97%      +1%     
==========================================
  Files         991      739     -252     
  Lines      135132   115054   -20078     
  Branches     1501        0    -1501     
==========================================
- Hits       120225   103520   -16705     
+ Misses      14542    11534    -3008     
+ Partials      365        0     -365
Impacted Files Coverage Δ
python/pyarrow/plasma.py 54.79% <0%> (-1.37%) ⬇️
cpp/src/arrow/util/thread_pool_test.cc 97.66% <0%> (-0.94%) ⬇️
r/src/recordbatch.cpp
go/arrow/math/uint64_amd64.go
r/R/filesystem.R
r/R/list.R
r/src/symbols.cpp
r/src/array_to_vector.cpp
go/arrow/ipc/file_reader.go
js/src/builder/index.ts
... and 246 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 558263f...f27cec6. Read the comment docs.

@pitrou
Copy link
Member

pitrou commented Oct 16, 2019

@jacques-n @emkornfield

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is correct to flip since the scale is different between the two vectors. @pravindra ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general question.
A numeric can be represented as different decimal values, with different scales and precisions.

The question is: is the same numeric value equal when represented in different format?

For BigDecimal in standard java, there are two ways to compare:

  1. The equals method compares both value and scale.
  2. The compareTo method comapres the underlying value.

From a performance view point, comparing by the underlying value is more performant for Arrow's memory layout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression is the comparison is actually .1 == 1 and .2 == 2 (i.e not the same as java BigDecimal, I might be misreading it but if you look at BigDecimal.getObject, I think this agrees with my assessment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield You are right.
I am wrong because Arrow is different from BigDecimal. Arrow stores unscaled values.
Sorry for my mistake.

I have updated the code to revert this flip. Please take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield After more thoughts, I think UnionVector is a special case: for union vectors, it is possible that two vectors have different fields, but have equal values in a certain range.

So the changes for the RangeEqualsVisitor are like this: for union vectors, we only compare minor types, for other vectors, we compare fields.

The problem for DecimalVector no longer exisits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this only return on inequality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch. Thank you so much.

@emkornfield
Copy link
Contributor

this needs a rebase and I'm not sure all the changes are correct.

@liyafan82
Copy link
Contributor Author

this needs a rebase and I'm not sure all the changes are correct.

Rebase done. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by this argument, wouldn't we need to always return true here and validate types in equalities.
i.e. if left is a union and right is an intvector it union[1] happens to be an int vector the comparison should happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the argument only applies when both vectors are union vectors.
An IntVector and a UnionVector can not have equal range, even the values in the range are actually equal.

In C++, it also requires that both vectors are union vectors:

bool CompareUnions(const UnionArray& left) const {
const auto& right = checked_cast<const UnionArray&>(right_);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its been a while since I looked at the C++ code, but are you sure there isn't a check for union equality first in C++? It doesn't seem unreasonable to require the types be identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check for union type euqlity (and other types of arrays) in C++, and it requres the type to be identical.

To see this, note in the above code snippet that the left vector must be a UnionArray (passed in as the parameter), and the right vector must also be a UnionArray, and this is ensured by

checked_cast<const UnionArray&>(right_);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe I was not answering your question, and there is something confusing here. So let me try to make it clearer:

There are two types of type equality:
shallow equality - which only requires the minor types to be the same
deep equality - which requires the minor type, minor types of child types, and order of child types be equal.

For Java, we check deep equality for other types of vectors, and for union vector, we only check shallow equality.

This is because, for union vector, it is possible that two union vectors may have different child vectors, but their values are equal in a certain range. This is not possible for other types of vectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. I think some points are a little clearer for me.

I think there are two (somewhat) independent problems here:

  1. Should child vector types being equal be a necessary condition for vector ranges being equal.
  2. Should we consider the above 3 (or other alternative) semantics proposed by you.

For me, 1 is unacceptable, as it causes practical problems (e.g. encoding a UnionVector). For 2, I do not prefer it, but it is acceptable for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure they are independent. But at this point, if you still feel strongly about 1, please start a discussion on the mailing list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks for your suggestion. I will start a discussion in the ML.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like there was much discussion on this in the mailing list, would you be willing to separate the fixes here, so we can merge the bug fix and maybe try bumping the thread on the mailing list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea. I have revised the PR to undo the union type comparison logic.
It seems like a seprate issue, and may involve multiple languages.

Maybe I will do more investigations and start a new discussion to describe the problem in a clearer way.

This discussion have been for a long time. Thanks a lot for your patience.

@liyafan82
Copy link
Contributor Author

Rebased to resolve conflicts.

uInt4Holder.isSet = 1;

final NullableIntHolder intHolder = new NullableIntHolder();
uInt4Holder.value = 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this intended to set intHolder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Sorry about my mistake. Resolved now.


for (String name : leftChildNames) {
RangeEqualsVisitor visitor = createInnerVisitor(leftVector.getChild(name), rightVector.getChild(name));
RangeEqualsVisitor visitor = createInnerVisitor(leftVector.getChild(name), rightVector.getChild(name), null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment for what null represents (/param_name=/) here and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added. Thanks for the good suggestion.

vector1.setValueCount(3);

vector2.setType(0, Types.MinorType.UINT4);
vector2.setSafe(0, uInt4Holder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if I'm missing it but these vectors do look identifical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I have revised the test case so the two vectors are different.

kszucs pushed a commit that referenced this pull request Feb 7, 2020
There are some problems with the current union comparison logic. For example:
1. For type check, we should not require fields to be equal. It is possible that two vectors' value ranges are equal but their fields are different.
2. We should not compare the number of sub vectors, as it is possible that two union vectors have different numbers of sub vectors, but have equal values in the range.

Closes #5544 from liyafan82/fly_0930_share and squashes the following commits:

d6ef3d2 <liyafan82>  Refine test case
c008289 <liyafan82>  Resolve test failure after rebasing
c515393 <liyafan82>  Rule out the change for union type comparison
bab7402 <liyafan82>  Compare fields for all vectors except union vectors
5b2225e <liyafan82>  Fix the bug with decimal vector
4d8b570 <liyafan82>  Fix problems with current union comparison logic

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
There are some problems with the current union comparison logic. For example:
1. For type check, we should not require fields to be equal. It is possible that two vectors' value ranges are equal but their fields are different.
2. We should not compare the number of sub vectors, as it is possible that two union vectors have different numbers of sub vectors, but have equal values in the range.

Closes apache#5544 from liyafan82/fly_0930_share and squashes the following commits:

d6ef3d2 <liyafan82>  Refine test case
c008289 <liyafan82>  Resolve test failure after rebasing
c515393 <liyafan82>  Rule out the change for union type comparison
bab7402 <liyafan82>  Compare fields for all vectors except union vectors
5b2225e <liyafan82>  Fix the bug with decimal vector
4d8b570 <liyafan82>  Fix problems with current union comparison logic

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants