ARROW-6208: [Java] Correct byte order before comparing in ByteFunctionHelpers #5063

pprudhvi · 2019-08-12T06:53:50Z

No description provided.

liyafan82 · 2019-08-12T07:01:13Z

java/memory/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java

the purpose is to compare memory segment in lexicographic order, right?

yes. I added a unittest, please take a look.

@pprudhvi thanks for your effort.
I think we need to address two problems here:

we need to check the byte order before we decide if we need to reserve bytes.

we must make sure that comparing long values in endian order is identical to comparing 8 bytes in lexicographic order.

I have a little doubtful about 2). If you have an answer to it, please let me know. Thanks.

The following prints 2^24

ArrowBuf left = allocator.buffer(4); left.setByte(0, 0); left.setByte(1, 0); left.setByte(2, 0); left.setByte(3, 1); System.out.println(PlatformDependent.getInt(left.memoryAddress())); left.close();

this means the bytes are assumed to be in little endian format while reading as int. Directly comparing int won't be the same as comparing in lexicographic order

@pprudhvi Maybe I did not make it clear enough. Let me explain the two points in more detail:

Some machines are big endian while others are little endian. We should check this before reversing the bytes. If the code is running on a big endian machine, there is no need to reverse bytes.

I am not sure if the following two results are always identical when comparing 2 pieces of continuous 8 bytes data:
a) read both data as longs (in big endian order), and compare the long values.
b) read both data as 8 bytes, and compare the 8 bytes in sequence, until the first unmatched bytes are found. If all 8 bytes are matched, we return 0.

the arrow buffers are always encoded in little-endian, and java is always in big-endian. so, I think we don't need to check endianness again.

codecov-io · 2019-08-12T07:33:32Z

Codecov Report

Merging #5063 into master will increase coverage by 2.1%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5063      +/-   ##
==========================================
+ Coverage   87.63%   89.74%    +2.1%     
==========================================
  Files        1014      672     -342     
  Lines      144947   100352   -44595     
  Branches     1437        0    -1437     
==========================================
- Hits       127030    90060   -36970     
+ Misses      17555    10292    -7263     
+ Partials      362        0     -362

Impacted Files	Coverage Δ
cpp/src/arrow/ipc/writer.cc	`92.83% <0%> (-0.3%)`	⬇️
cpp/src/arrow/ipc/reader.cc	`89.14% <0%> (-0.13%)`	⬇️
cpp/src/arrow/python/serialize.cc	`89.43% <0%> (-0.04%)`	⬇️
python/pyarrow/tests/test_array.py	`96.35% <0%> (-0.01%)`	⬇️
cpp/src/arrow/ipc/read-write-test.cc	`99.83% <0%> (-0.01%)`	⬇️
cpp/src/arrow/flight/client.h	`100% <0%> (ø)`	⬆️
cpp/src/arrow/ipc/message.h	`100% <0%> (ø)`	⬆️
cpp/src/arrow/python/serialize.h	`0% <0%> (ø)`	⬆️
cpp/src/arrow/extension_type-test.cc	`97.54% <0%> (ø)`	⬆️
cpp/src/arrow/flight/server.cc	`88.71% <0%> (ø)`	⬆️
... and 344 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3cc12ab...ea4a8ee. Read the comment docs.

jacques-n · 2019-08-12T12:32:16Z

Arrow is little endian by definition. No need to deal with big endian.

…

On Mon, Aug 12, 2019, 7:09 AM liyafan82 ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In java/memory/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java <#5063 (comment)>: > @@ -140,7 +140,7 @@ private static int memcmp( long leftLong = PlatformDependent.getLong(lPos); long rightLong = PlatformDependent.getLong(rPos); if (leftLong != rightLong) { - return unsignedLongCompare(leftLong, rightLong); + return unsignedLongCompare(Long.reverseBytes(leftLong), Long.reverseBytes(rightLong)); @pprudhvi <https://github.com/pprudhvi> Maybe I did not make it clear enough. Let me explain the two points in more detail: 1. Some machines are big endian while others are little endian. We should check this before reversing the bytes. If the code is running on a big endian machine, there is no need to reverse bytes. 2. I am not sure if the following two results are always identical when comparing 2 pieces of continuous 8 bytes data: a) read both data as longs (in big endian order), and compare the long values. b) read both data as 8 bytes, and compare the 8 bytes in sequence, until the first unmatched bytes are found. If all 8 bytes are matched, we return 0. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5063?email_source=notifications&email_token=AABMYNTZZ733N7VOWNFJ34DQEFHIDA5CNFSM4IK6LHHKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCBHZIRY#discussion_r312892213>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABMYNQIJRWRW2IYK5ZYEVTQEFHIDANCNFSM4IK6LHHA> .

liyafan82 · 2019-08-13T02:28:20Z

@jacques-n and @pravindra Thanks a lot for your information.

I am wondering how is this achieved (Arrow is always in little endian). For example, the call stack for IntVector#getInt is

IntVector#getInt -> ArrowBuf#getInt -> PlatformDependent#getInt -> PlatformDependent0#getInt -> UNSAFE#getInt

Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent, the behavior of IntVector#getInt is also platform dependent.

Please correct me where I am wrong. Thank you very much.

pravindra · 2019-08-13T12:10:53Z

Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent,
the behavior of IntVector#getInt is also platform dependent.

I think you are right, we probably do all our testing on little endian machines and so, haven't hit this. Per the arrow spec, the buffers should always be LE and if PlatformDependant doesn't always guarantee LE, we need to have wrappers above it.

I see that in a similar impl in guava, they do reverseBytes() based on the native order for the platform. https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java

This code has been around forever, so it's possible I'm confused - I'll double check and confirm.

liyafan82 · 2019-08-14T04:11:07Z

Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent,
the behavior of IntVector#getInt is also platform dependent.

I think you are right, we probably do all our testing on little endian machines and so, haven't hit this. Per the arrow spec, the buffers should always be LE and if PlatformDependant doesn't always guarantee LE, we need to have wrappers above it.

I see that in a similar impl in guava, they do reverseBytes() based on the native order for the platform. https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java

This code has been around forever, so it's possible I'm confused - I'll double check and confirm.

@pravindra Thanks a lot for the effort.

pravindra · 2019-08-14T09:06:57Z

@pprudhvi has added the check for the platform endianness before reversing the bytes. And, tests for this use-case.

@liyafan82 can you please give an example for the second issue that you raised ? i.e why would the unsigned-long compare be different from the byte-by-byte compare (after fixing the endianness) ?

Regarding the use of PlatformDependant macros, I'll follow up and raise new jiras if required.

liyafan82 · 2019-08-14T10:34:35Z

@pprudhvi has added the check for the platform endianness before reversing the bytes. And, tests for this use-case.

@liyafan82 can you please give an example for the second issue that you raised ? i.e why would the unsigned-long compare be different from the byte-by-byte compare (after fixing the endianness) ?

Regarding the use of PlatformDependant macros, I'll follow up and raise new jiras if required.

@pravindra
I can think of such an example. Without loss of generality, suppose we compare by int rather than long.

Suppose we have two ints:
int a = 0x000000ff;
int b = 0x0000007f;

If we compare by int, we have a > b, since a = 255 and b = 127.
If we compare them by bytes (in big endian), we have the first 3 bytes equal (all equal to 0).
For the last byte, for a we have 0xff which is -1, and for b we have 0x7f, which is 127.
So we get a < b.

pravindra · 2019-08-14T11:30:31Z

@liyafan82 we are doing an unsigned comparison in this functions for both byte-level (compare after 0xff) and int-level (unsignedIntCompare). With the unsigned compare, a > b in either case.

liyafan82 · 2019-08-14T11:57:13Z

@liyafan82 we are doing an unsigned comparison in this functions for both byte-level (compare after 0xff) and int-level (unsignedIntCompare). With the unsigned compare, a > b in either case.

@pravindra I see. Thanks a lot.
Although we do not have a unsignedByteCompare method, we are actually comparing bytes as unsigned bytes.
So problem 2 is no longer a problem.

jacques-n · 2019-08-14T14:44:38Z

I believe arrow has code (in the allocator?) where we simply throw unsupported on big endian systems. The Java code doesn't support running on big endian systems at all. This is unsupported.

…

On Tue, Aug 13, 2019, 7:10 AM Pindikura Ravindra ***@***.***> wrote: Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent, the behavior of IntVector#getInt is also platform dependent. I think you are right, we probably do all our testing on little endian machines and so, haven't hit this. Per the arrow spec, the buffers should always be LE and if PlatformDependant doesn't always guarantee LE, we need to have wrappers above it. I see that in a similar impl in guava, they do reverseBytes() based on the native order for the platform. https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java This code has been around forever, so it's possible I'm confused - I'll double check and confirm. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5063?email_source=notifications&email_token=AABMYNTE7FQ3ZTCD36PG6RDQEKQE5A5CNFSM4IK6LHHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FOSBQ#issuecomment-520808710>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABMYNVEPPVJPRUFLFOFLGTQEKQE5ANCNFSM4IK6LHHA> .

liyafan82 · 2019-08-15T02:11:30Z

I believe arrow has code (in the allocator?) where we simply throw unsupported on big endian systems. The Java code doesn't support running on big endian systems at all. This is unsupported.
…
On Tue, Aug 13, 2019, 7:10 AM Pindikura Ravindra @.***> wrote: Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent, the behavior of IntVector#getInt is also platform dependent. I think you are right, we probably do all our testing on little endian machines and so, haven't hit this. Per the arrow spec, the buffers should always be LE and if PlatformDependant doesn't always guarantee LE, we need to have wrappers above it. I see that in a similar impl in guava, they do reverseBytes() based on the native order for the platform. https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java This code has been around forever, so it's possible I'm confused - I'll double check and confirm. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5063?email_source=notifications&email_token=AABMYNTE7FQ3ZTCD36PG6RDQEKQE5A5CNFSM4IK6LHHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FOSBQ#issuecomment-520808710>, or mute the thread https://github.com/notifications/unsubscribe-auth/AABMYNVEPPVJPRUFLFOFLGTQEKQE5ANCNFSM4IK6LHHA .

@jacques-n Thanks for the informaton.
The related code is in class UnsafeDirectLittleEndian. If the byte order is big endian, an exception will be thrown.
This class is used in the AllocationManager.

liyafan82

@pprudhvi Thanks for fix the problems.

…fHasher According to the discussion in #5063 (comment), Arrow has a mechanism to make sure the data is stored in little-endian, so there is no need to check byte order. Closes #5098 from liyafan82/fly_0816_order and squashes the following commits: 538ce59 <liyafan82> ARROW-6264 There is no need to consider byte order in ArrowBufHasher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>

…fHasher According to the discussion in apache/arrow#5063 (comment), Arrow has a mechanism to make sure the data is stored in little-endian, so there is no need to check byte order. Closes #5098 from liyafan82/fly_0816_order and squashes the following commits: 538ce59d7 <liyafan82> ARROW-6264 There is no need to consider byte order in ArrowBufHasher Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>

…nHelpers Closes apache#5063 from pprudhvi/memcmp and squashes the following commits: ea4a8ee <Prudhvi Porandla> reverse bytes before compare Authored-by: Prudhvi Porandla <prudhvi.porandla@icloud.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>

liyafan82 reviewed Aug 12, 2019

View reviewed changes

pprudhvi force-pushed the memcmp branch 3 times, most recently from fc8f61a to a5b490d Compare August 12, 2019 10:30

pravindra approved these changes Aug 12, 2019

View reviewed changes

pprudhvi force-pushed the memcmp branch 2 times, most recently from 1eb0d49 to f923c84 Compare August 13, 2019 11:32

pprudhvi force-pushed the memcmp branch from f923c84 to fbf1341 Compare August 14, 2019 06:06

liyafan82 approved these changes Aug 15, 2019

View reviewed changes

pprudhvi force-pushed the memcmp branch 2 times, most recently from 0eeb364 to 86b9b34 Compare August 15, 2019 03:48

reverse bytes before compare

ea4a8ee

pprudhvi force-pushed the memcmp branch from 86b9b34 to ea4a8ee Compare August 15, 2019 07:53

fsaintjacques added the Component: Java label Aug 15, 2019

liyafan82 mentioned this pull request Aug 16, 2019

ARROW-6264: [Java] There is no need to consider byte order in ArrowBufHasher #5098

Closed

pravindra closed this in 3420d30 Aug 16, 2019

pprudhvi deleted the memcmp branch September 5, 2019 07:12

asfimport mentioned this pull request Aug 16, 2019

[Java] Correct byte order before comparing in ByteFunctionHelpers #22597

Closed

asfimport mentioned this pull request Aug 16, 2019

[Java] There is no need to consider byte order in ArrowBufHasher #22648

Closed

ARROW-6208: [Java] Correct byte order before comparing in ByteFunctionHelpers #5063

ARROW-6208: [Java] Correct byte order before comparing in ByteFunctionHelpers #5063

Uh oh!

Conversation

pprudhvi commented Aug 12, 2019

Uh oh!

liyafan82 Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

pprudhvi Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

liyafan82 Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

pprudhvi Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

liyafan82 Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

pravindra Aug 12, 2019

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Aug 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jacques-n commented Aug 12, 2019 via email

Uh oh!

liyafan82 commented Aug 13, 2019

Uh oh!

pravindra commented Aug 13, 2019

Uh oh!

liyafan82 commented Aug 14, 2019

Uh oh!

pravindra commented Aug 14, 2019

Uh oh!

liyafan82 commented Aug 14, 2019

Uh oh!

pravindra commented Aug 14, 2019

Uh oh!

liyafan82 commented Aug 14, 2019

Uh oh!

jacques-n commented Aug 14, 2019 via email

Uh oh!

liyafan82 commented Aug 15, 2019

Uh oh!

liyafan82 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-io commented Aug 12, 2019 •

edited

Loading