bugfix when key out of Boundary for MemoryAlignedDataMap #880

yagagagaga · 2024-06-17T04:55:01Z

MemoryAlignedDataMap: fix bug when key out of Boundary
before:

var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1024));
map.put(-1L << 60, 1);
map.put(-1L << 59, 2);
map.get(-1L << 60); // return 2 rather than 1
map.put(549755813888L, 6); // throw java.lang.IndexOutOfBoundsException: Index -2147483648 out of bounds for length 5

after:

var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1024));
map.put(-1L << 60, 1); // throw java.lang.IndexOutOfBoundsException: Key should between 0 and 549755813887, but your key is -1152921504606846976
map.put(-1L << 59, 2); // throw java.lang.IndexOutOfBoundsException: Key should between 0 and 549755813887, but your key is -576460752303423488
map.put(549755813888L, 6); // throw java.lang.IndexOutOfBoundsException: Key should between 0 and 549755813887, but your key is 549755813888

sonarqubecloud · 2024-06-17T04:55:47Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
1.0% Duplication on New Code

See analysis details on SonarCloud

bchapuis · 2024-06-17T08:19:42Z

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.

Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure

To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

yagagagaga · 2024-06-17T10:10:22Z

Thanks for your reply, I had reformat the code and rollback the Memory.java.

bchapuis · 2024-06-17T12:13:40Z

Thank you, could you elaborate a bit more on the computation of the upperBoundary? I'm probably missing something, but I would naivly implement it as follow, which does not give exacly the same result as your implementation.

this.upperBoundary = ((long) memory.segmentSize()) * ((long) Integer.MAX_VALUE) / (long) dataType.size();

yagagagaga · 2024-06-17T17:41:47Z

As all we known, Long is a 64-bits length data type with 63-bits represents the numerical value and 1-bit represents positive or negative.
The key in MemoryAlignedDataMap needs to consider the following limitations:

value's data size. If the value occupies 4 bytes, it means that 2-bits (in 63-bits) are needed to represent.

The calculation of segmentIndex depends on the size of segmentSize. If the segmentSize is 1024, it means that 10-bits (in 61-bits) are needed to represent and the remaining 51-bits represent the number of segments. Considering that segmentIndex is an Integer field, it will cause precision loss when the segmentIndex is greater than 32-bits. There are two cases to discuss:

segmentIndex will not loss precision if segmentSize greater than 32-bits.

 value data size = 2-bits                   segmentSize > 32-bits          
     /\                            /~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~\
    0111111111111111111111111111111111111111111111111111111111111111
    ^  \__________________________/
    |               v
    |       segmentIndex <= 30-bits
    |  \___________________________________________________________/
    |                            v
   -/+                       key's range

segmentIndex will loss precision if segmentSize less than 32-bits. The range of segmentIndex must be limited in order to be represented as an integer.

                               segmentIndex = 31-bits    segmentSize = 10-bits         
                           /~~~~~~~~~~~~~~~^~~~~~~~~~~~~~\/~~~~^~~~\
    0111111111111111111111111111111111111111111111111111111111111111
    ^                      \/
    |             value data size = 2-bits  
    |                        \_____________________________________/
    |                                         v
   -/+                                   key's range

If following your implement, some wrong will happened:

// this.upperBoundary = ((long) Integer.MAX_VALUE) / (long) dataType.size() * ((long) memory.segmentSize());
var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1024));
map.put(549755813887L, 1); // error will happen, but as the matter of fact, you can calculate segmentIndex and segmentOffset with this number

var map = new MemoryAlignedDataMap<>(new IntegerDataType(), new OnHeapMemory(1L << 34)); // if could be, the upperBoundary will numeric overflow

bchapuis · 2024-06-18T08:02:07Z

Thank you for the detailed explanation and the fix. As said, this is typically the kind of contribution we need to improve the robustness of the project. Do not hesitate to tell us more about your use case and reach out if you have questions, it would be a pleasure to collaborate!

CalvinKirs · 2024-06-18T09:41:31Z

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.↳

Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure↳

Hi @bchapuis We can definitely revert this change (re-add the benchmarking code module). I believe we were just using its API without copying its source code into our project, and we won't be distributing the JMH-related jars in the binary version.
https://www.apache.org/legal/resolved.html#optional
https://www.apache.org/legal/resolved.html#prohibited

To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

CalvinKirs · 2024-06-18T09:54:37Z

@yagagagaga Thanks a lot, this is a good contribution. This part of the codebase really deserves more checks and tests.↳↳
Regarding the HashMap, this is what I used initially. However, when dealing with very large datasets (several billion records), the cost of computing the hash code when accessing the HashMap is noticeable compared to the direct memory access of an array. The size of the array can easily be mitigated by increasing the size of segments. Unfortunately, I had to remove the JMH benchmarks due to license compatibility requirements (JMH is released under the GPL). I would keep the array for now and reintroduce some sort of benchmarks if we need to use another data structure↳↳

Hi @bchapuis We can definitely revert this change (re-add the benchmarking code module). I believe we were just using its API without copying its source code into our project, and we won't be distributing the JMH-related jars in the binary version. https://www.apache.org/legal/resolved.html#optional https://www.apache.org/legal/resolved.html#prohibited

To format the code, you need to execute ./mvnw spotless:apply. Let me know if you need further assistance.

I looked into it, and sure enough, there's some discussion on this. https://issues.apache.org/jira/browse/LEGAL-399 :smi]

bchapuis · 2024-06-18T11:37:53Z

@CalvinKirs Thanks a lot for this pointer. It's good to know that JMH can be used (I acted as a lumberjack in the previous release to ensure that we are compliant with the license ;)

bugfix and performance improved for MemoryAlignedDataMap and Memory

6fe7b81

Reformat code and rollback Memory.java

1455e21

yagagagaga changed the title ~~bugfix and performance improved for MemoryAlignedDataMap and Memory~~ bugfix when key out of Boundary for MemoryAlignedDataMap Jun 17, 2024

bchapuis merged commit 00ec04d into apache:main Jun 18, 2024
8 checks passed

yagagagaga deleted the 20240617 branch July 2, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bugfix when key out of Boundary for MemoryAlignedDataMap #880

bugfix when key out of Boundary for MemoryAlignedDataMap #880

Uh oh!

yagagagaga commented Jun 17, 2024 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jun 17, 2024

Uh oh!

bchapuis commented Jun 17, 2024

Uh oh!

yagagagaga commented Jun 17, 2024 •

edited

Loading

Uh oh!

bchapuis commented Jun 17, 2024

Uh oh!

yagagagaga commented Jun 17, 2024 •

edited

Loading

Uh oh!

Uh oh!

bchapuis commented Jun 18, 2024

Uh oh!

CalvinKirs commented Jun 18, 2024

Uh oh!

CalvinKirs commented Jun 18, 2024

Uh oh!

bchapuis commented Jun 18, 2024

Uh oh!

Uh oh!

bugfix when key out of Boundary for MemoryAlignedDataMap #880

bugfix when key out of Boundary for MemoryAlignedDataMap #880

Uh oh!

Conversation

yagagagaga commented Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Jun 17, 2024

Quality Gate passed

Uh oh!

bchapuis commented Jun 17, 2024

Uh oh!

yagagagaga commented Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bchapuis commented Jun 17, 2024

Uh oh!

yagagagaga commented Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bchapuis commented Jun 18, 2024

Uh oh!

CalvinKirs commented Jun 18, 2024

Uh oh!

CalvinKirs commented Jun 18, 2024

Uh oh!

bchapuis commented Jun 18, 2024

Uh oh!

Uh oh!

yagagagaga commented Jun 17, 2024 •

edited

Loading

yagagagaga commented Jun 17, 2024 •

edited

Loading

yagagagaga commented Jun 17, 2024 •

edited

Loading