Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix when key out of Boundary for MemoryAlignedDataMap #880

Merged
merged 2 commits into from
Jun 18, 2024

Conversation

yagagagaga
Copy link
Contributor

@yagagagaga yagagagaga commented Jun 17, 2024

  • MemoryAlignedDataMap: fix bug when key out of Boundary
    before:
    var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1024));
    map.put(-1L << 60, 1);
    map.put(-1L << 59, 2);
    map.get(-1L << 60); // return 2 rather than 1
    map.put(549755813888L, 6); // throw java.lang.IndexOutOfBoundsException: Index -2147483648 out of bounds for length 5
    after:
    var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1024));
    map.put(-1L << 60, 1); // throw java.lang.IndexOutOfBoundsException: Key should between 0 and 549755813887, but your key is -1152921504606846976
    map.put(-1L << 59, 2); // throw java.lang.IndexOutOfBoundsException: Key should between 0 and 549755813887, but your key is -576460752303423488
    map.put(549755813888L, 6); // throw java.lang.IndexOutOfBoundsException: Key should between 0 and 549755813887, but your key is 549755813888

Copy link

Quality Gate Passed Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
1.0% Duplication on New Code

See analysis details on SonarCloud

@bchapuis
Copy link
Member

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.

Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure

To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

@yagagagaga
Copy link
Contributor Author

yagagagaga commented Jun 17, 2024

Thanks for your reply, I had reformat the code and rollback the Memory.java.

@yagagagaga yagagagaga changed the title bugfix and performance improved for MemoryAlignedDataMap and Memory bugfix when key out of Boundary for MemoryAlignedDataMap Jun 17, 2024
@bchapuis
Copy link
Member

Thank you, could you elaborate a bit more on the computation of the upperBoundary? I'm probably missing something, but I would naivly implement it as follow, which does not give exacly the same result as your implementation.

this.upperBoundary = ((long) memory.segmentSize()) * ((long) Integer.MAX_VALUE) / (long) dataType.size();

@yagagagaga
Copy link
Contributor Author

yagagagaga commented Jun 17, 2024

As all we known, Long is a 64-bits length data type with 63-bits represents the numerical value and 1-bit represents positive or negative.
The key in MemoryAlignedDataMap needs to consider the following limitations:

  1. value's data size. If the value occupies 4 bytes, it means that 2-bits (in 63-bits) are needed to represent.
  2. The calculation of segmentIndex depends on the size of segmentSize. If the segmentSize is 1024, it means that 10-bits (in 61-bits) are needed to represent and the remaining 51-bits represent the number of segments. Considering that segmentIndex is an Integer field, it will cause precision loss when the segmentIndex is greater than 32-bits. There are two cases to discuss:
    1. segmentIndex will not loss precision if segmentSize greater than 32-bits.
     value data size = 2-bits                   segmentSize > 32-bits          
         /\                            /~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~\
        0111111111111111111111111111111111111111111111111111111111111111
        ^  \__________________________/
        |               v
        |       segmentIndex <= 30-bits
        |  \___________________________________________________________/
        |                            v
       -/+                       key's range
    1. segmentIndex will loss precision if segmentSize less than 32-bits. The range of segmentIndex must be limited in order to be represented as an integer.
                                   segmentIndex = 31-bits    segmentSize = 10-bits         
                               /~~~~~~~~~~~~~~~^~~~~~~~~~~~~~\/~~~~^~~~\
        0111111111111111111111111111111111111111111111111111111111111111
        ^                      \/
        |             value data size = 2-bits  
        |                        \_____________________________________/
        |                                         v
       -/+                                   key's range

If following your implement, some wrong will happened:

// this.upperBoundary = ((long) Integer.MAX_VALUE) / (long) dataType.size() * ((long) memory.segmentSize());
var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1024));
map.put(549755813887L, 1); // error will happen, but as the matter of fact, you can calculate segmentIndex and segmentOffset with this number

var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1L << 34)); // if could be, the upperBoundary will numeric overflow

@bchapuis bchapuis merged commit 00ec04d into apache:main Jun 18, 2024
8 checks passed
@bchapuis
Copy link
Member

Thank you for the detailed explanation and the fix. As said, this is typically the kind of contribution we need to improve the robustness of the project. Do not hesitate to tell us more about your use case and reach out if you have questions, it would be a pleasure to collaborate!

@CalvinKirs
Copy link
Member

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.↳

Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure↳

Hi @bchapuis We can definitely revert this change (re-add the benchmarking code module). I believe we were just using its API without copying its source code into our project, and we won't be distributing the JMH-related jars in the binary version.
https://www.apache.org/legal/resolved.html#optional
https://www.apache.org/legal/resolved.html#prohibited

To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

@CalvinKirs
Copy link
Member

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.↳↳
Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure↳↳

Hi @bchapuis We can definitely revert this change (re-add the benchmarking code module). I believe we were just using its API without copying its source code into our project, and we won't be distributing the JMH-related jars in the binary version. https://www.apache.org/legal/resolved.html#optional https://www.apache.org/legal/resolved.html#prohibited

To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

I looked into it, and sure enough, there's some discussion on this. https://issues.apache.org/jira/browse/LEGAL-399 :smi]

@bchapuis
Copy link
Member

@CalvinKirs Thanks a lot for this pointer. It's good to know that JMH can be used (I acted as a lumberjack in the previous release to ensure that we are compliant with the license ;)

@yagagagaga yagagagaga deleted the 20240617 branch July 2, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants