Skip to content

Conversation

@pprudhvi
Copy link
Contributor

No description provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the purpose is to compare memory segment in lexicographic order, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I added a unittest, please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pprudhvi thanks for your effort.
I think we need to address two problems here:

  1. we need to check the byte order before we decide if we need to reserve bytes.
  2. we must make sure that comparing long values in endian order is identical to comparing 8 bytes in lexicographic order.

I have a little doubtful about 2). If you have an answer to it, please let me know. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following prints 2^24

    ArrowBuf left = allocator.buffer(4);
    left.setByte(0, 0);
    left.setByte(1, 0);
    left.setByte(2, 0);
    left.setByte(3, 1);
    System.out.println(PlatformDependent.getInt(left.memoryAddress()));
    left.close();

this means the bytes are assumed to be in little endian format while reading as int. Directly comparing int won't be the same as comparing in lexicographic order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pprudhvi Maybe I did not make it clear enough. Let me explain the two points in more detail:

  1. Some machines are big endian while others are little endian. We should check this before reversing the bytes. If the code is running on a big endian machine, there is no need to reverse bytes.

  2. I am not sure if the following two results are always identical when comparing 2 pieces of continuous 8 bytes data:
    a) read both data as longs (in big endian order), and compare the long values.
    b) read both data as 8 bytes, and compare the 8 bytes in sequence, until the first unmatched bytes are found. If all 8 bytes are matched, we return 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the arrow buffers are always encoded in little-endian, and java is always in big-endian. so, I think we don't need to check endianness again.

@codecov-io
Copy link

codecov-io commented Aug 12, 2019

Codecov Report

Merging #5063 into master will increase coverage by 2.1%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5063      +/-   ##
==========================================
+ Coverage   87.63%   89.74%    +2.1%     
==========================================
  Files        1014      672     -342     
  Lines      144947   100352   -44595     
  Branches     1437        0    -1437     
==========================================
- Hits       127030    90060   -36970     
+ Misses      17555    10292    -7263     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/arrow/ipc/writer.cc 92.83% <0%> (-0.3%) ⬇️
cpp/src/arrow/ipc/reader.cc 89.14% <0%> (-0.13%) ⬇️
cpp/src/arrow/python/serialize.cc 89.43% <0%> (-0.04%) ⬇️
python/pyarrow/tests/test_array.py 96.35% <0%> (-0.01%) ⬇️
cpp/src/arrow/ipc/read-write-test.cc 99.83% <0%> (-0.01%) ⬇️
cpp/src/arrow/flight/client.h 100% <0%> (ø) ⬆️
cpp/src/arrow/ipc/message.h 100% <0%> (ø) ⬆️
cpp/src/arrow/python/serialize.h 0% <0%> (ø) ⬆️
cpp/src/arrow/extension_type-test.cc 97.54% <0%> (ø) ⬆️
cpp/src/arrow/flight/server.cc 88.71% <0%> (ø) ⬆️
... and 344 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3cc12ab...ea4a8ee. Read the comment docs.

@pprudhvi pprudhvi force-pushed the memcmp branch 3 times, most recently from fc8f61a to a5b490d Compare August 12, 2019 10:30
@jacques-n
Copy link
Contributor

jacques-n commented Aug 12, 2019 via email

@liyafan82
Copy link
Contributor

@jacques-n and @pravindra Thanks a lot for your information.

I am wondering how is this achieved (Arrow is always in little endian). For example, the call stack for IntVector#getInt is

IntVector#getInt -> ArrowBuf#getInt -> PlatformDependent#getInt -> PlatformDependent0#getInt -> UNSAFE#getInt

Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent, the behavior of IntVector#getInt is also platform dependent.

Please correct me where I am wrong. Thank you very much.

@pprudhvi pprudhvi force-pushed the memcmp branch 2 times, most recently from 1eb0d49 to f923c84 Compare August 13, 2019 11:32
@pravindra
Copy link
Contributor

Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent,
the behavior of IntVector#getInt is also platform dependent.

I think you are right, we probably do all our testing on little endian machines and so, haven't hit this. Per the arrow spec, the buffers should always be LE and if PlatformDependant doesn't always guarantee LE, we need to have wrappers above it.

I see that in a similar impl in guava, they do reverseBytes() based on the native order for the platform. https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java

This code has been around forever, so it's possible I'm confused - I'll double check and confirm.

@liyafan82
Copy link
Contributor

Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent,
the behavior of IntVector#getInt is also platform dependent.

I think you are right, we probably do all our testing on little endian machines and so, haven't hit this. Per the arrow spec, the buffers should always be LE and if PlatformDependant doesn't always guarantee LE, we need to have wrappers above it.

I see that in a similar impl in guava, they do reverseBytes() based on the native order for the platform. https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java

This code has been around forever, so it's possible I'm confused - I'll double check and confirm.

@pravindra Thanks a lot for the effort.

@pravindra
Copy link
Contributor

@pprudhvi has added the check for the platform endianness before reversing the bytes. And, tests for this use-case.

@liyafan82 can you please give an example for the second issue that you raised ? i.e why would the unsigned-long compare be different from the byte-by-byte compare (after fixing the endianness) ?

Regarding the use of PlatformDependant macros, I'll follow up and raise new jiras if required.

@liyafan82
Copy link
Contributor

@pprudhvi has added the check for the platform endianness before reversing the bytes. And, tests for this use-case.

@liyafan82 can you please give an example for the second issue that you raised ? i.e why would the unsigned-long compare be different from the byte-by-byte compare (after fixing the endianness) ?

Regarding the use of PlatformDependant macros, I'll follow up and raise new jiras if required.

@pravindra
I can think of such an example. Without loss of generality, suppose we compare by int rather than long.

Suppose we have two ints:
int a = 0x000000ff;
int b = 0x0000007f;

If we compare by int, we have a > b, since a = 255 and b = 127.
If we compare them by bytes (in big endian), we have the first 3 bytes equal (all equal to 0).
For the last byte, for a we have 0xff which is -1, and for b we have 0x7f, which is 127.
So we get a < b.

@pravindra
Copy link
Contributor

@liyafan82 we are doing an unsigned comparison in this functions for both byte-level (compare after 0xff) and int-level (unsignedIntCompare). With the unsigned compare, a > b in either case.

@liyafan82
Copy link
Contributor

@liyafan82 we are doing an unsigned comparison in this functions for both byte-level (compare after 0xff) and int-level (unsignedIntCompare). With the unsigned compare, a > b in either case.

@pravindra I see. Thanks a lot.
Although we do not have a unsignedByteCompare method, we are actually comparing bytes as unsigned bytes.
So problem 2 is no longer a problem.

@jacques-n
Copy link
Contributor

jacques-n commented Aug 14, 2019 via email

@liyafan82
Copy link
Contributor

I believe arrow has code (in the allocator?) where we simply throw unsupported on big endian systems. The Java code doesn't support running on big endian systems at all. This is unsupported.

On Tue, Aug 13, 2019, 7:10 AM Pindikura Ravindra @.***> wrote: Since UNSAFE#getInt's behavior (little endian or big endian) is platform dependent, the behavior of IntVector#getInt is also platform dependent. I think you are right, we probably do all our testing on little endian machines and so, haven't hit this. Per the arrow spec, the buffers should always be LE and if PlatformDependant doesn't always guarantee LE, we need to have wrappers above it. I see that in a similar impl in guava, they do reverseBytes() based on the native order for the platform. https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/LittleEndianByteArray.java This code has been around forever, so it's possible I'm confused - I'll double check and confirm. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5063?email_source=notifications&email_token=AABMYNTE7FQ3ZTCD36PG6RDQEKQE5A5CNFSM4IK6LHHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FOSBQ#issuecomment-520808710>, or mute the thread https://github.com/notifications/unsubscribe-auth/AABMYNVEPPVJPRUFLFOFLGTQEKQE5ANCNFSM4IK6LHHA .

@jacques-n Thanks for the informaton.
The related code is in class UnsafeDirectLittleEndian. If the byte order is big endian, an exception will be thrown.
This class is used in the AllocationManager.

Copy link
Contributor

@liyafan82 liyafan82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pprudhvi Thanks for fix the problems.

@pprudhvi pprudhvi force-pushed the memcmp branch 2 times, most recently from 0eeb364 to 86b9b34 Compare August 15, 2019 03:48
@pravindra pravindra closed this in 3420d30 Aug 16, 2019
pravindra pushed a commit that referenced this pull request Aug 16, 2019
…fHasher

According to the discussion in #5063 (comment), Arrow has a mechanism to make sure the data is stored in little-endian, so there is no need to check byte order.

Closes #5098 from liyafan82/fly_0816_order and squashes the following commits:

538ce59 <liyafan82> ARROW-6264  There is no need to consider byte order in ArrowBufHasher

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
@pprudhvi pprudhvi deleted the memcmp branch September 5, 2019 07:12
kou pushed a commit to apache/arrow-java that referenced this pull request Nov 25, 2024
…fHasher

According to the discussion in apache/arrow#5063 (comment), Arrow has a mechanism to make sure the data is stored in little-endian, so there is no need to check byte order.

Closes #5098 from liyafan82/fly_0816_order and squashes the following commits:

538ce59d7 <liyafan82> ARROW-6264  There is no need to consider byte order in ArrowBufHasher

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…nHelpers

Closes apache#5063 from pprudhvi/memcmp and squashes the following commits:

ea4a8ee <Prudhvi Porandla> reverse bytes before compare

Authored-by: Prudhvi Porandla <prudhvi.porandla@icloud.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants