-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-6208: [Java] Correct byte order before comparing in ByteFunctionHelpers #5063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the purpose is to compare memory segment in lexicographic order, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I added a unittest, please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pprudhvi thanks for your effort.
I think we need to address two problems here:
- we need to check the byte order before we decide if we need to reserve bytes.
- we must make sure that comparing long values in endian order is identical to comparing 8 bytes in lexicographic order.
I have a little doubtful about 2). If you have an answer to it, please let me know. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following prints 2^24
ArrowBuf left = allocator.buffer(4);
left.setByte(0, 0);
left.setByte(1, 0);
left.setByte(2, 0);
left.setByte(3, 1);
System.out.println(PlatformDependent.getInt(left.memoryAddress()));
left.close();
this means the bytes are assumed to be in little endian format while reading as int. Directly comparing int won't be the same as comparing in lexicographic order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pprudhvi Maybe I did not make it clear enough. Let me explain the two points in more detail:
-
Some machines are big endian while others are little endian. We should check this before reversing the bytes. If the code is running on a big endian machine, there is no need to reverse bytes.
-
I am not sure if the following two results are always identical when comparing 2 pieces of continuous 8 bytes data:
a) read both data as longs (in big endian order), and compare the long values.
b) read both data as 8 bytes, and compare the 8 bytes in sequence, until the first unmatched bytes are found. If all 8 bytes are matched, we return 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the arrow buffers are always encoded in little-endian, and java is always in big-endian. so, I think we don't need to check endianness again.
Codecov Report
@@ Coverage Diff @@
## master #5063 +/- ##
==========================================
+ Coverage 87.63% 89.74% +2.1%
==========================================
Files 1014 672 -342
Lines 144947 100352 -44595
Branches 1437 0 -1437
==========================================
- Hits 127030 90060 -36970
+ Misses 17555 10292 -7263
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
fc8f61a to
a5b490d
Compare
|
Arrow is little endian by definition. No need to deal with big endian.
…On Mon, Aug 12, 2019, 7:09 AM liyafan82 ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
java/memory/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java
<#5063 (comment)>:
> @@ -140,7 +140,7 @@ private static int memcmp(
long leftLong = PlatformDependent.getLong(lPos);
long rightLong = PlatformDependent.getLong(rPos);
if (leftLong != rightLong) {
- return unsignedLongCompare(leftLong, rightLong);
+ return unsignedLongCompare(Long.reverseBytes(leftLong), Long.reverseBytes(rightLong));
@pprudhvi <https://github.com/pprudhvi> Maybe I did not make it clear
enough. Let me explain the two points in more detail:
1.
Some machines are big endian while others are little endian. We should
check this before reversing the bytes. If the code is running on a big
endian machine, there is no need to reverse bytes.
2.
I am not sure if the following two results are always identical when
comparing 2 pieces of continuous 8 bytes data:
a) read both data as longs (in big endian order), and compare the long
values.
b) read both data as 8 bytes, and compare the 8 bytes in sequence,
until the first unmatched bytes are found. If all 8 bytes are matched, we
return 0.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5063?email_source=notifications&email_token=AABMYNTZZ733N7VOWNFJ34DQEFHIDA5CNFSM4IK6LHHKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCBHZIRY#discussion_r312892213>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABMYNQIJRWRW2IYK5ZYEVTQEFHIDANCNFSM4IK6LHHA>
.
|
|
@jacques-n and @pravindra Thanks a lot for your information. I am wondering how is this achieved (Arrow is always in little endian). For example, the call stack for IntVector#getInt is IntVector#getInt -> ArrowBuf#getInt -> PlatformDependent#getInt -> PlatformDependent0#getInt -> UNSAFE#getInt Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent, the behavior of IntVector#getInt is also platform dependent. Please correct me where I am wrong. Thank you very much. |
1eb0d49 to
f923c84
Compare
I think you are right, we probably do all our testing on little endian machines and so, haven't hit this. Per the arrow spec, the buffers should always be LE and if PlatformDependant doesn't always guarantee LE, we need to have wrappers above it. I see that in a similar impl in guava, they do reverseBytes() based on the native order for the platform. https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java This code has been around forever, so it's possible I'm confused - I'll double check and confirm. |
@pravindra Thanks a lot for the effort. |
|
@pprudhvi has added the check for the platform endianness before reversing the bytes. And, tests for this use-case. @liyafan82 can you please give an example for the second issue that you raised ? i.e why would the unsigned-long compare be different from the byte-by-byte compare (after fixing the endianness) ? Regarding the use of PlatformDependant macros, I'll follow up and raise new jiras if required. |
@pravindra Suppose we have two ints: If we compare by int, we have a > b, since a = 255 and b = 127. |
|
@liyafan82 we are doing an unsigned comparison in this functions for both byte-level (compare after 0xff) and int-level (unsignedIntCompare). With the unsigned compare, a > b in either case. |
@pravindra I see. Thanks a lot. |
|
I believe arrow has code (in the allocator?) where we simply throw
unsupported on big endian systems. The Java code doesn't support running on
big endian systems at all. This is unsupported.
…On Tue, Aug 13, 2019, 7:10 AM Pindikura Ravindra ***@***.***> wrote:
Since UNSAFE#getInt's behavior (little endian or big endian) is platform
dependent,
the behavior of IntVector#getInt is also platform dependent.
I think you are right, we probably do all our testing on little endian
machines and so, haven't hit this. Per the arrow spec, the buffers should
always be LE and if PlatformDependant doesn't always guarantee LE, we need
to have wrappers above it.
I see that in a similar impl in guava, they do reverseBytes() based on the
native order for the platform.
https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java
This code has been around forever, so it's possible I'm confused - I'll
double check and confirm.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5063?email_source=notifications&email_token=AABMYNTE7FQ3ZTCD36PG6RDQEKQE5A5CNFSM4IK6LHHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FOSBQ#issuecomment-520808710>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABMYNVEPPVJPRUFLFOFLGTQEKQE5ANCNFSM4IK6LHHA>
.
|
@jacques-n Thanks for the informaton. |
liyafan82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pprudhvi Thanks for fix the problems.
0eeb364 to
86b9b34
Compare
…fHasher According to the discussion in #5063 (comment), Arrow has a mechanism to make sure the data is stored in little-endian, so there is no need to check byte order. Closes #5098 from liyafan82/fly_0816_order and squashes the following commits: 538ce59 <liyafan82> ARROW-6264 There is no need to consider byte order in ArrowBufHasher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
…fHasher According to the discussion in apache/arrow#5063 (comment), Arrow has a mechanism to make sure the data is stored in little-endian, so there is no need to check byte order. Closes #5098 from liyafan82/fly_0816_order and squashes the following commits: 538ce59d7 <liyafan82> ARROW-6264 There is no need to consider byte order in ArrowBufHasher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
…nHelpers Closes apache#5063 from pprudhvi/memcmp and squashes the following commits: ea4a8ee <Prudhvi Porandla> reverse bytes before compare Authored-by: Prudhvi Porandla <prudhvi.porandla@icloud.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
No description provided.