Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement the fastGetObjectMap to get better read performance #845

Closed

Conversation

pyb1993
Copy link

@pyb1993 pyb1993 commented Jul 23, 2021

1. unroll the locate to get method

  • Unroll version have better performance: I test the version of roll and unroll (about 2%)
    SimpleMapBenchmark.read          0.5f    unroll           2048       1000  thrpt    9  360077.662 ± 7646.572  ops/s
    SimpleMapBenchmark.read          0.5f    unroll           2048       3000  thrpt    9  352933.493 ± 5690.624  ops/s
    SimpleMapBenchmark.read          0.5f    roll           2048          1000  thrpt    9  368682.803 ± 7567.719  ops/s
    SimpleMapBenchmark.read          0.5f    roll           2048          3000  thrpt    9  358348.669 ± 7085.066  ops/s

There are two possible reasons:
JIT doesn't inline the method locateKey, there is little overhead
get method need to compare the result of locateKey and then decide to return the real value

2. Remove the magic number in place

  • this algorithm will make the low probability of hash collision and take some cost
  • with low loadFactor, we don't need to consider the hash collision

3. Automatically adjust load factor

  • according to previous benchmark, different size have different best loadFactor
  • I guess the reason is that small map can utilize the cache locality if we get high loadFactor.
Other little optimization (maybe optimized automatically , no obvious improvement )
-   remove the key not null check(cuckooObjectMap also not check it, check it in put is enough)
-   switch the order of check other is null and check other is equal to key(the second is more likely to be true, and key 
        would not be null so we can use key.equals(other))

Benchmark Result

num = 100

Benchmark (mapType) (maxCapacity) (numClasses) Mode Cnt Score Error Units
SimpleMapBenchmark.read fastGet 2048 100 thrpt 9 1767666.150 ± 7451.419 ops/s
SimpleMapBenchmark.read cuckoo 2048 100 thrpt 9 1458135.074 ± 373999.983 ops/s

num=1000

Benchmark (mapType) (maxCapacity) (numClasses) Mode Cnt Score Error Units
SimpleMapBenchmark.read fastGet 2048 1000 thrpt 9 370069.484 ± 20877.873 ops/s
SimpleMapBenchmark.read cuckoo 2048 1000 thrpt 9 363330.568 ± 18756.168 ops/s

num=3000

Benchmark (mapType) (maxCapacity) (numClasses) Mode Cnt Score Error Units
SimpleMapBenchmark.read fastGet 2048 3000 thrpt 9 355674.722 ± 18533.691 ops/s
SimpleMapBenchmark.read cuckoo 2048 3000 thrpt 9 333397.797 ± 78791.696 ops/s

num=10000

Benchmark (mapType) (maxCapacity) (numClasses) Mode Cnt Score Error Units
SimpleMapBenchmark.read fastGet 2048 10000 thrpt 9 319202.343 ± 11023.639 ops/s
SimpleMapBenchmark.read cuckoo 2048 10000 thrpt 9 316253.575 ± 68273.397 ops/s

Copy link
Collaborator

@theigl theigl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @pyb1993! I left some initial comments. Please also run mvn formatter:format to apply project code styles.

I will try to get some feedback by the original authors of ObjectMap as well.

@theigl
Copy link
Collaborator

theigl commented Jul 23, 2021

@pyb1993: I ran the benchmark on your current branch on one of my linux boxes with JDK 11:

With a single thread:

mvn -f benchmarks/pom.xml compile exec:java -Dexec.args="-f 3 -wi 10 -i 3 -t 1 -w 2s -r 2s FastGetObjectMapBenchmark.read"
Benchmark                       (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark.read    fastGet           2048           100  thrpt    9  855709.085 ±  5643.041  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          1000  thrpt    9  189912.648 ±  3902.263  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          3000  thrpt    9  184169.213 ±  8908.928  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048         10000  thrpt    9  181783.603 ±  3773.340  ops/s

FastGetObjectMapBenchmark.read     cuckoo           2048           100  thrpt    9  846559.307 ± 31966.850  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          1000  thrpt    9  194285.589 ± 22357.330  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          3000  thrpt    9  184905.701 ± 14447.756  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048         10000  thrpt    9  187945.091 ±  4137.380  ops/s

And with two threads:

mvn -f benchmarks/pom.xml compile exec:java -Dexec.args="-f 3 -wi 10 -i 3 -t 2 -w 2s -r 2s FastGetObjectMapBenchmark.read"
Benchmark                       (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt        Score       Error  Units
FastGetObjectMapBenchmark.read    fastGet           2048           100  thrpt    9  1775902.890 ± 44836.590  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          1000  thrpt    9   362333.710 ±  7255.862  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          3000  thrpt    9   360358.157 ±  4465.184  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048         10000  thrpt    9   340645.007 ±  6892.110  ops/s

FastGetObjectMapBenchmark.read     cuckoo           2048           100  thrpt    9  1703987.634 ± 59457.182  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          1000  thrpt    9   385086.809 ±  6782.309  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          3000  thrpt    9   378497.012 ±  6633.835  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048         10000  thrpt    9   371224.718 ±  4818.188  ops/s

For me, fastGet is still a little slower than cuckoo, but the overall read performance is much better than the default objectMap.

@pyb1993
Copy link
Author

pyb1993 commented Jul 24, 2021

@pyb1993: I ran the benchmark on your current branch on one of my linux boxes with JDK 11:

With a single thread:

mvn -f benchmarks/pom.xml compile exec:java -Dexec.args="-f 3 -wi 10 -i 3 -t 1 -w 2s -r 2s FastGetObjectMapBenchmark.read"
Benchmark                       (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark.read    fastGet           2048           100  thrpt    9  855709.085 ±  5643.041  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          1000  thrpt    9  189912.648 ±  3902.263  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          3000  thrpt    9  184169.213 ±  8908.928  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048         10000  thrpt    9  181783.603 ±  3773.340  ops/s

FastGetObjectMapBenchmark.read     cuckoo           2048           100  thrpt    9  846559.307 ± 31966.850  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          1000  thrpt    9  194285.589 ± 22357.330  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          3000  thrpt    9  184905.701 ± 14447.756  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048         10000  thrpt    9  187945.091 ±  4137.380  ops/s

And with two threads:

mvn -f benchmarks/pom.xml compile exec:java -Dexec.args="-f 3 -wi 10 -i 3 -t 2 -w 2s -r 2s FastGetObjectMapBenchmark.read"
Benchmark                       (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt        Score       Error  Units
FastGetObjectMapBenchmark.read    fastGet           2048           100  thrpt    9  1775902.890 ± 44836.590  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          1000  thrpt    9   362333.710 ±  7255.862  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          3000  thrpt    9   360358.157 ±  4465.184  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048         10000  thrpt    9   340645.007 ±  6892.110  ops/s

FastGetObjectMapBenchmark.read     cuckoo           2048           100  thrpt    9  1703987.634 ± 59457.182  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          1000  thrpt    9   385086.809 ±  6782.309  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          3000  thrpt    9   378497.012 ±  6633.835  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048         10000  thrpt    9   371224.718 ±  4818.188  ops/s

For me, fastGet is still a little slower than cuckoo, but the overall read performance is much better than the default objectMap.

Yes, you are correct, I rerun the benchmark many times, and find the result is not very stable:
In most times, cuckoo will be a little quicker than fastGet. but sometimes the fastGet is also quicker than cuckoo.
Only in small map(num=100) then fastGet will outperforms the cuckoo.

In that case, I'm not sure it is worth to replace cuckoo, maybe keep it is a better option

@tommyettinger
Copy link

tommyettinger commented Jul 24, 2021

I looked at the code changes for a minute, and noticed that the FastGetObjectMap code is mostly the same... except that it always uses a lower load factor than ObjectMap or CuckooObjectMap. If you make ObjectMap and CuckooObjectMap also use 0.5 load factor, like this should be using at the size of 2048 that the benchmark runs at, you get a very different outcome:

Benchmark                            (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score        Error  Units
FastGetObjectMapBenchmark.read         fastGet           2048           100  thrpt    8  984496.868 ± 138226.490  ops/s
FastGetObjectMapBenchmark.read          cuckoo           2048           100  thrpt    8  964066.798 ±  31559.059  ops/s
FastGetObjectMapBenchmark.read             obj           2048           100  thrpt    8  868244.026 ±  24440.959  ops/s
FastGetObjectMapBenchmark.read            hash           2048           100  thrpt    8  966474.123 ±  27272.680  ops/s
FastGetObjectMapBenchmark.write        fastGet           2048           100  thrpt    8  737184.677 ± 120212.539  ops/s
FastGetObjectMapBenchmark.write         cuckoo           2048           100  thrpt    8  713588.461 ± 109947.526  ops/s
FastGetObjectMapBenchmark.write            obj           2048           100  thrpt    8  706915.647 ± 169966.607  ops/s
FastGetObjectMapBenchmark.write           hash           2048           100  thrpt    8  672497.488 ± 135969.047  ops/s
FastGetObjectMapBenchmark.writeRead    fastGet           2048           100  thrpt    8  223227.096 ±   3652.897  ops/s
FastGetObjectMapBenchmark.writeRead     cuckoo           2048           100  thrpt    8  210817.956 ±  11824.875  ops/s
FastGetObjectMapBenchmark.writeRead        obj           2048           100  thrpt    8  219017.636 ±   6081.790  ops/s
FastGetObjectMapBenchmark.writeRead       hash           2048           100  thrpt    8  194159.176 ±  10909.609  ops/s

This could easily have benchmark quality issues because not very many iterations were used, but if it holds up, ObjectMap (obj in this listing) is the fastest for read time (by over 10%), second-fastest for write time behind HashMap, and for writeRead time all of the results are very close, though HashMap is the leader. Whoops, I'm used to seeing these measured in time taken, not throughput. ObjectMap is slowest by over 10% for read, second-slowest for write, and second-fastest in a close bunch for writeRead.

What this seems to suggest is that just lowering the load factor on the existing ObjectMap is enough to make it significantly faster, even relative to CuckooObjectMap. 0.5 is probably a good amount. I'll run more benchmarks with varying load factors to see where the changes start to matter.

GitHub doesn't let me attach a bare .java file, but here are my changes to the benchmark, zipped:
FastGetObjectMapBenchmark.zip
Note that I changed a usage of Random to a ThreadLocalRandom, because the seed and statistical quality don't matter, and ThreadLocalRandom should take up much less of the time in the benchmark -- synchronization in Random can really slow it down.

@tommyettinger
Copy link

So, I ran some more benchmarks... These are all using FastGetObjectMapBenchmark.
fastGet is the customized subclass of ObjectMap being evaluated.
cuckoo is CuckooObjectMap.
obj is ObjectMap with the get() and place() optimizations from FastGetObjectMap applied, because they all make sense here (but not its loadFactor changes, because I want something to be different so I can compare).
hash is java.util.HashMap.

This tests with 500 Classes inserted, instead of 100, and the results are less favorable to
CuckooObjectMap with the higher item count. This tests loadFactors of 0.4, 0.5, 0.6, 0.7, and
0.8, but fastGet doesn't take loadFactor into account, so it's a little confusing that it gets different
results at all for different loadFactors. There is also a lot of noise in the benchmark; I ran this on Windows 10, and I think there may have been some OS junk competing for CPU time.

Benchmark  (loadFactor)  (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
read                0.4    fastGet           2048           500  thrpt   10  203396.861 ±  8688.657  ops/s
read                0.4     cuckoo           2048           500  thrpt   10  174474.845 ± 11104.342  ops/s
read                0.4        obj           2048           500  thrpt   10  182592.193 ±  4077.766  ops/s
read                0.4       hash           2048           500  thrpt   10  159097.986 ±  9607.009  ops/s
read                0.5    fastGet           2048           500  thrpt   10  193455.437 ± 14741.645  ops/s
read                0.5     cuckoo           2048           500  thrpt   10  161923.298 ± 22251.526  ops/s
read                0.5        obj           2048           500  thrpt   10  177151.514 ±  7222.974  ops/s
read                0.5       hash           2048           500  thrpt   10  163061.511 ±  3982.901  ops/s
read                0.6    fastGet           2048           500  thrpt   10  200107.413 ±  2153.898  ops/s
read                0.6     cuckoo           2048           500  thrpt   10  166540.735 ±  9245.270  ops/s
read                0.6        obj           2048           500  thrpt   10  180055.150 ±  3254.601  ops/s
read                0.6       hash           2048           500  thrpt   10  162189.791 ±  7832.268  ops/s
read                0.7    fastGet           2048           500  thrpt   10  174583.487 ±  4305.354  ops/s
read                0.7     cuckoo           2048           500  thrpt   10  173032.679 ±  3766.070  ops/s
read                0.7        obj           2048           500  thrpt   10  182344.666 ±  6568.991  ops/s
read                0.7       hash           2048           500  thrpt   10  161397.007 ±  2309.251  ops/s
read                0.8    fastGet           2048           500  thrpt   10  198841.597 ± 12824.240  ops/s
read                0.8     cuckoo           2048           500  thrpt   10  169526.261 ± 13254.839  ops/s
read                0.8        obj           2048           500  thrpt   10  152559.914 ± 52911.405  ops/s
read                0.8       hash           2048           500  thrpt   10  155779.667 ± 20023.779  ops/s
write               0.4    fastGet           2048           500  thrpt   10  173894.412 ±  3791.294  ops/s
write               0.4     cuckoo           2048           500  thrpt   10  168937.060 ±  4303.122  ops/s
write               0.4        obj           2048           500  thrpt   10  177942.996 ±  4306.470  ops/s
write               0.4       hash           2048           500  thrpt   10  132062.077 ±  3824.929  ops/s
write               0.5    fastGet           2048           500  thrpt   10  176622.602 ±  4953.666  ops/s
write               0.5     cuckoo           2048           500  thrpt   10  163975.929 ±  3902.864  ops/s
write               0.5        obj           2048           500  thrpt   10  174190.288 ±  5795.178  ops/s
write               0.5       hash           2048           500  thrpt   10  130471.241 ± 15769.696  ops/s
write               0.6    fastGet           2048           500  thrpt   10  176066.884 ±  8087.540  ops/s
write               0.6     cuckoo           2048           500  thrpt   10  162391.330 ±  4616.634  ops/s
write               0.6        obj           2048           500  thrpt   10  177106.562 ±  3092.769  ops/s
write               0.6       hash           2048           500  thrpt   10  133279.551 ±  6136.359  ops/s
write               0.7    fastGet           2048           500  thrpt   10  177620.663 ±  5643.269  ops/s
write               0.7     cuckoo           2048           500  thrpt   10  163657.554 ±  3696.383  ops/s
write               0.7        obj           2048           500  thrpt   10  155664.683 ± 30198.158  ops/s
write               0.7       hash           2048           500  thrpt   10  138005.766 ±  5484.546  ops/s
write               0.8    fastGet           2048           500  thrpt   10  174708.845 ±  3925.234  ops/s
write               0.8     cuckoo           2048           500  thrpt   10  163353.853 ±  2017.528  ops/s
write               0.8        obj           2048           500  thrpt   10  165161.512 ± 16316.204  ops/s
write               0.8       hash           2048           500  thrpt   10  136692.675 ±  3913.704  ops/s
writeRead           0.4    fastGet           2048           500  thrpt   10   43591.521 ±   704.075  ops/s
writeRead           0.4     cuckoo           2048           500  thrpt   10   46173.253 ±   824.982  ops/s
writeRead           0.4        obj           2048           500  thrpt   10   48410.097 ±   687.127  ops/s
writeRead           0.4       hash           2048           500  thrpt   10   45380.194 ±  1189.894  ops/s
writeRead           0.5    fastGet           2048           500  thrpt   10   43689.811 ±   712.184  ops/s
writeRead           0.5     cuckoo           2048           500  thrpt   10   40964.239 ±   693.984  ops/s
writeRead           0.5        obj           2048           500  thrpt   10   42961.852 ±  1806.682  ops/s
writeRead           0.5       hash           2048           500  thrpt   10   43414.371 ±  1383.136  ops/s
writeRead           0.6    fastGet           2048           500  thrpt   10   43878.141 ±   455.079  ops/s
writeRead           0.6     cuckoo           2048           500  thrpt   10   40819.561 ±   441.262  ops/s
writeRead           0.6        obj           2048           500  thrpt   10   42724.069 ±   485.276  ops/s
writeRead           0.6       hash           2048           500  thrpt   10   43415.844 ±   831.169  ops/s
writeRead           0.7    fastGet           2048           500  thrpt   10   43819.019 ±   616.127  ops/s
writeRead           0.7     cuckoo           2048           500  thrpt   10   41281.160 ±   457.417  ops/s
writeRead           0.7        obj           2048           500  thrpt   10   43204.736 ±   595.106  ops/s
writeRead           0.7       hash           2048           500  thrpt   10   43279.243 ±   626.552  ops/s
writeRead           0.8    fastGet           2048           500  thrpt   10   41917.364 ±  4236.458  ops/s
writeRead           0.8     cuckoo           2048           500  thrpt   10   40736.723 ±  1126.091  ops/s
writeRead           0.8        obj           2048           500  thrpt   10   42525.028 ±   765.568  ops/s
writeRead           0.8       hash           2048           500  thrpt   10   42937.855 ±   891.393  ops/s

To sum this up, FastGetObjectMap seems like a good set of changes, but I would be wary of taking away the
option to configure loadFactor. That's one of the few tools hash tables have to adjust for higher- or
lower-quality hashCode() implementations. Here we're only testing identity hash codes, and those are
generally quite well-distributed compared to, say, Vector2 hash codes from libGDX, or maliciously crafted
Strings. I think there's some issue with the benchmark code or the way I'm running it, also; running just
obj without running fastGet first makes obj much faster, and that suggests there could be some bimorphic callsite
slowdown. It's hard to tell, though if FastGetObjectMap replaced ObjectMap, then that issue couldn't occur.
There's also a lot of variance between forked VMs here; obj in particular could slow down by 20% on some
forked runs, and then be back at full speed on another fork (other types had this happen as well, but it is
more noticeable with obj). So, yeah, in this case, I agree with all the changes in FastGetObjectMap except the fixed loadFactor -- I think that should still be up to the user, but could default to 0.7f or 0.5f in the constructors that don't specify a loadFactor.

@tommyettinger
Copy link

I figured out the flaw with the benchmark... on my end. I was mvn package-ing kryo but not mvn install-ing it, so the benchmarks were stuck with an old version. Implementing the unrolled get() and masking place(), but not the load factor change, now makes ObjectMap faster than CuckooObjectMap pretty reliably -- I can run more benchmarks later, but for now:

Benchmark                       (loadFactor)  (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark.read           0.4        obj           2048           500  thrpt   10  275569.483 ±  3758.463  ops/s
FastGetObjectMapBenchmark.read           0.4     cuckoo           2048           500  thrpt   10  272176.565 ±  3442.241  ops/s
FastGetObjectMapBenchmark.read           0.5        obj           2048           500  thrpt   10  267311.200 ± 10068.484  ops/s
FastGetObjectMapBenchmark.read           0.5     cuckoo           2048           500  thrpt   10  265858.911 ±  3855.796  ops/s
FastGetObjectMapBenchmark.read           0.6        obj           2048           500  thrpt   10  275953.601 ±  3599.506  ops/s
FastGetObjectMapBenchmark.read           0.6     cuckoo           2048           500  thrpt   10  251233.291 ± 36603.872  ops/s
FastGetObjectMapBenchmark.read           0.7        obj           2048           500  thrpt   10  272036.503 ±  5120.783  ops/s
FastGetObjectMapBenchmark.read           0.7     cuckoo           2048           500  thrpt   10  262150.137 ±  3546.619  ops/s
FastGetObjectMapBenchmark.read           0.8        obj           2048           500  thrpt   10  269115.094 ± 20723.802  ops/s
FastGetObjectMapBenchmark.read           0.8     cuckoo           2048           500  thrpt   10  263997.198 ±  4647.231  ops/s

This shows obj is faster than cuckoo at every load factor tested, using pyb1993's optimizations. Without those optimizations, it's probably considerably slower, since that should match the above comment's benchmarks.

If you're making changes to ObjectMap, you should also make sure IdentityMap has similar changes. I found in my code that the magic-multiply-then-shift is unnecessary for identityHashCode() results, and it is faster with a mask. It seems like this (for the current benchmarks) shows that the magic-multiply-then-shift is usually slower, though I prefer it when the hashCode() could have a strong bias towards high bits instead of the low bits the mask uses.

@pyb1993 , is it OK if I credit you in a CONTRIBUTORS file in the repo where I've been working on these types of data structure? These are very useful changes, and I'll be incorporating something like them in my code. @NathanSweet is also going in there, because a lot of the changes after my initial wave of code were his work.

@pyb1993
Copy link
Author

pyb1993 commented Jul 25, 2021

I figured out the flaw with the benchmark... on my end. I was mvn package-ing kryo but not mvn install-ing it, so the benchmarks were stuck with an old version. Implementing the unrolled get() and masking place(), but not the load factor change, now makes ObjectMap faster than CuckooObjectMap pretty reliably -- I can run more benchmarks later, but for now:

Benchmark                       (loadFactor)  (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark.read           0.4        obj           2048           500  thrpt   10  275569.483 ±  3758.463  ops/s
FastGetObjectMapBenchmark.read           0.4     cuckoo           2048           500  thrpt   10  272176.565 ±  3442.241  ops/s
FastGetObjectMapBenchmark.read           0.5        obj           2048           500  thrpt   10  267311.200 ± 10068.484  ops/s
FastGetObjectMapBenchmark.read           0.5     cuckoo           2048           500  thrpt   10  265858.911 ±  3855.796  ops/s
FastGetObjectMapBenchmark.read           0.6        obj           2048           500  thrpt   10  275953.601 ±  3599.506  ops/s
FastGetObjectMapBenchmark.read           0.6     cuckoo           2048           500  thrpt   10  251233.291 ± 36603.872  ops/s
FastGetObjectMapBenchmark.read           0.7        obj           2048           500  thrpt   10  272036.503 ±  5120.783  ops/s
FastGetObjectMapBenchmark.read           0.7     cuckoo           2048           500  thrpt   10  262150.137 ±  3546.619  ops/s
FastGetObjectMapBenchmark.read           0.8        obj           2048           500  thrpt   10  269115.094 ± 20723.802  ops/s
FastGetObjectMapBenchmark.read           0.8     cuckoo           2048           500  thrpt   10  263997.198 ±  4647.231  ops/s

This shows obj is faster than cuckoo at every load factor tested, using pyb1993's optimizations. Without those optimizations, it's probably considerably slower, since that should match the above comment's benchmarks.

If you're making changes to ObjectMap, you should also make sure IdentityMap has similar changes. I found in my code that the magic-multiply-then-shift is unnecessary for identityHashCode() results, and it is faster with a mask. It seems like this (for the current benchmarks) shows that the magic-multiply-then-shift is usually slower, though I prefer it when the hashCode() could have a strong bias towards high bits instead of the low bits the mask uses.

@pyb1993 , is it OK if I credit you in a CONTRIBUTORS file in the repo where I've been working on these types of data structure? These are very useful changes, and I'll be incorporating something like them in my code. @NathanSweet is also going in there, because a lot of the changes after my initial wave of code were his work.

for the contributors in your repo, I am very welcome to be added^_^
For the bechmark result, I rerun many times in MacOS, and sometimes the cuckoo is better, sometimes the fastGet is better. it is not stable, but the cuckoo is more likely better when the item count is higher(1000,3000)

So I'm not sure about the benchmark result.
If you and @theigl can confirm the result, I am free to undo the dynamic loadFactor change
(but this fastGet is mainly used for classRegistration, so maybe the hashcode quality is not a problem)

@tommyettinger
Copy link

It makes sense that the results aren't stable, I think, because this is using Object.hashCode() or just the identity hash code of each object. The identity hash codes should be different each time, so if the references aren't identical between the cuckoo and fastGet benchmarks, the actual number of collisions will vary between runs. It might make more sense to generate Strings, which have a more representative hashCode() implementation for most Java classes, or random Integers, generated by the same seed for each group of objects. If this is only being used with Class keys, then we're stuck with identity hash codes; they tend to be relatively-high-quality random numbers, but have some issues with locking that I don't fully understand. ByteBuddy might be implementing a normal hashCode() for the generated classes, but I don't know really how that library works. I'll benchmark this some more later.

@theigl
Copy link
Collaborator

theigl commented Jul 26, 2021

@tommyettinger: Thanks a lot for your effort and input on this topic!

Let me summarize briefly what we found out so far:

  1. FastGetObjectMap outperforms ObjectMap by 10-20% when reading

This is mainly due to simplified hashing and a lower load-factor. According to my benchmarks, a load-factor of 0.5 or 0.6 seems ideal, at least for the use-case we are benchmarking. The simplified hashing only makes sense for high quality hashes like identity hashcodes.

  1. CuckooObjectMap outperforms FastGetObjectMap by 5-10% when reading for map sizes > 500 elements.

This result is consistent in all my benchmarks. Once we reach a certain number of elements (somewhere between 500 and 1000), the Cuckoo map starts to perform significantly better. Although I'd very much like to delete the Cuckoo map from Kryo, I don't think it is a good idea until we find a replacement that comes closer in performance.

However, there are other parts of Kryo that currently use ObjectMap that might benefit from switching to the new, faster implementation. The IdentityObjectIntMap in MapReferenceResolver for instance. But this has to be benchmarked with a real-world use case before I would consider changing the implementation. For scenarios where reads and writes are balanced, my benchmarks show that FastGetObjectMap is still faster, but only by about 2%.

For reference, these are the benchmark results I got from running @tommyettinger's benchmark on my linux box with a large number of forks and warmups. The error rate for the runs with > 100 elements is quite low, so I'm pretty sure these results are valid:

Benchmark                             (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark3.read          cuckoo           2048           100  thrpt    9  901751.050 ± 69378.088  ops/s
FastGetObjectMapBenchmark3.read         fastGet           2048           100  thrpt    9  874029.724 ± 11040.301  ops/s
FastGetObjectMapBenchmark3.read             obj           2048           100  thrpt    9  803881.628 ± 17089.752  ops/s
FastGetObjectMapBenchmark3.read            hash           2048           100  thrpt    9  873987.979 ± 68187.282  ops/s

FastGetObjectMapBenchmark3.read          cuckoo           2048          1000  thrpt    9  200249.675 ±  3628.701  ops/s
FastGetObjectMapBenchmark3.read         fastGet           2048          1000  thrpt    9  178400.099 ± 13617.125  ops/s
FastGetObjectMapBenchmark3.read            hash           2048          1000  thrpt    9  169823.262 ±  3492.215  ops/s
FastGetObjectMapBenchmark3.read             obj           2048          1000  thrpt    9  152193.884 ± 27983.011  ops/s

FastGetObjectMapBenchmark3.read          cuckoo           2048          3000  thrpt    9  190520.528 ±  1260.122  ops/s
FastGetObjectMapBenchmark3.read         fastGet           2048          3000  thrpt    9  180000.873 ± 11179.701  ops/s
FastGetObjectMapBenchmark3.read             obj           2048          3000  thrpt    9  165642.223 ±  2900.557  ops/s
FastGetObjectMapBenchmark3.read            hash           2048          3000  thrpt    9  165305.519 ±  3006.401  ops/s

FastGetObjectMapBenchmark3.read          cuckoo           2048         10000  thrpt    9  181784.360 ±  4693.853  ops/s
FastGetObjectMapBenchmark3.read         fastGet           2048         10000  thrpt    9  176059.147 ±  6336.086  ops/s
FastGetObjectMapBenchmark3.read             obj           2048         10000  thrpt    9  158154.273 ±  2574.914  ops/s
FastGetObjectMapBenchmark3.read            hash           2048         10000  thrpt    9  156412.495 ±  3297.872  ops/s

FastGetObjectMapBenchmark3.write        fastGet           2048           100  thrpt    9  711911.286 ±  7407.523  ops/s
FastGetObjectMapBenchmark3.write         cuckoo           2048           100  thrpt    9  688634.806 ± 17693.374  ops/s
FastGetObjectMapBenchmark3.write            obj           2048           100  thrpt    9  640426.948 ± 13311.081  ops/s
FastGetObjectMapBenchmark3.write           hash           2048           100  thrpt    9  630192.809 ±  4457.366  ops/s

FastGetObjectMapBenchmark3.write         cuckoo           2048          1000  thrpt    9   75099.094 ±  2870.669  ops/s
FastGetObjectMapBenchmark3.write        fastGet           2048          1000  thrpt    9   74290.204 ±  1002.739  ops/s
FastGetObjectMapBenchmark3.write            obj           2048          1000  thrpt    9   72653.609 ±  1966.490  ops/s
FastGetObjectMapBenchmark3.write           hash           2048          1000  thrpt    9   62677.821 ±  5744.457  ops/s

FastGetObjectMapBenchmark3.write        fastGet           2048          3000  thrpt    9   23054.563 ±  1383.498  ops/s
FastGetObjectMapBenchmark3.write            obj           2048          3000  thrpt    9   22640.087 ±  1596.283  ops/s
FastGetObjectMapBenchmark3.write         cuckoo           2048          3000  thrpt    9   21663.940 ±  2328.371  ops/s
FastGetObjectMapBenchmark3.write           hash           2048          3000  thrpt    9   17995.772 ±   796.708  ops/s

FastGetObjectMapBenchmark3.write         cuckoo           2048         10000  thrpt    9    5979.943 ±   406.254  ops/s
FastGetObjectMapBenchmark3.write        fastGet           2048         10000  thrpt    9    5718.929 ±   385.197  ops/s
FastGetObjectMapBenchmark3.write            obj           2048         10000  thrpt    9    5701.094 ±   309.234  ops/s
FastGetObjectMapBenchmark3.write           hash           2048         10000  thrpt    9    4860.473 ±    78.571  ops/s

FastGetObjectMapBenchmark3.writeRead       hash           2048           100  thrpt    9  264731.769 ± 14282.130  ops/s
FastGetObjectMapBenchmark3.writeRead    fastGet           2048           100  thrpt    9  218024.219 ± 18505.642  ops/s
FastGetObjectMapBenchmark3.writeRead     cuckoo           2048           100  thrpt    9  213267.021 ±  6174.069  ops/s
FastGetObjectMapBenchmark3.writeRead        obj           2048           100  thrpt    9  208221.008 ±  5742.714  ops/s

FastGetObjectMapBenchmark3.writeRead    fastGet           2048          1000  thrpt    9   25579.959 ±   402.784  ops/s
FastGetObjectMapBenchmark3.writeRead     cuckoo           2048          1000  thrpt    9   24133.977 ±   620.268  ops/s
FastGetObjectMapBenchmark3.writeRead       hash           2048          1000  thrpt    9   24115.726 ±   151.297  ops/s
FastGetObjectMapBenchmark3.writeRead        obj           2048          1000  thrpt    9   23627.770 ±   614.566  ops/s

FastGetObjectMapBenchmark3.writeRead       hash           2048          3000  thrpt    9    6453.622 ±   191.857  ops/s
FastGetObjectMapBenchmark3.writeRead    fastGet           2048          3000  thrpt    9    6032.564 ±   131.448  ops/s
FastGetObjectMapBenchmark3.writeRead        obj           2048          3000  thrpt    9    5887.065 ±   346.304  ops/s
FastGetObjectMapBenchmark3.writeRead     cuckoo           2048          3000  thrpt    9    5140.471 ±    88.034  ops/s

FastGetObjectMapBenchmark3.writeRead       hash           2048         10000  thrpt    9    1243.688 ±    27.968  ops/s
FastGetObjectMapBenchmark3.writeRead    fastGet           2048         10000  thrpt    9    1213.300 ±    29.741  ops/s
FastGetObjectMapBenchmark3.writeRead        obj           2048         10000  thrpt    9    1197.335 ±    16.430  ops/s
FastGetObjectMapBenchmark3.writeRead     cuckoo           2048         10000  thrpt    9    1176.832 ±    16.654  ops/s

@pyb1993
Copy link
Author

pyb1993 commented Jul 26, 2021

Hi, according to your suggestion, I choose the nameIdToClass to do some benchmark, and try to replace the intMap.
I use the continuous id as the key, and class as the value.

see my latest commit to get details.
If this works, I can do more benchmark on other part of kryo(classToNameId, nameToClass, idToRegistration) and create a new PR.

quicker than (3% to 10%)
If there are more same class (most happened in collections), read will be more important than write.

FastGetObjectMapBenchmark.read         intMap           2048           500  thrpt   10  217789.340 ± 52158.272  ops/s
FastGetObjectMapBenchmark.read         intMap           2048          1000  thrpt   10  185616.635 ± 20610.151  ops/s
FastGetObjectMapBenchmark.read         intMap           2048          3000  thrpt   10  189498.133 ±  3460.429  ops/s
FastGetObjectMapBenchmark.read         intMap           2048         10000  thrpt   10  178081.649 ±  8869.369  ops/s
FastGetObjectMapBenchmark.read  fastGetIntMap           2048           500  thrpt   10  224623.596 ± 22657.120  ops/s
FastGetObjectMapBenchmark.read  fastGetIntMap           2048          1000  thrpt   10  203991.136 ±  6355.197  ops/s
FastGetObjectMapBenchmark.read  fastGetIntMap           2048          3000  thrpt   10  198668.113 ± 11263.387  ops/s
FastGetObjectMapBenchmark.read  fastGetIntMap           2048         10000  thrpt   10  195530.189 ± 58851.877  ops/s



quicker than (1%, 4%, 5%, 6%)
FastGetObjectMapBenchmark.writeRead         intMap           2048           500  thrpt   10  87734.833 ± 5324.936  ops/s
FastGetObjectMapBenchmark.writeRead         intMap           2048          1000  thrpt   10  43089.341 ± 3225.335  ops/s
FastGetObjectMapBenchmark.writeRead         intMap           2048          3000  thrpt   10  14206.493 ± 1199.021  ops/s
FastGetObjectMapBenchmark.writeRead         intMap           2048         10000  thrpt   10   3626.960 ±   40.521  ops/s
FastGetObjectMapBenchmark.writeRead  fastGetIntMap           2048           500  thrpt   10  88981.727 ± 9759.120  ops/s
FastGetObjectMapBenchmark.writeRead  fastGetIntMap           2048          1000  thrpt   10  45685.725 ±  521.659  ops/s
FastGetObjectMapBenchmark.writeRead  fastGetIntMap           2048          3000  thrpt   10  14964.966 ±  490.173  ops/s
FastGetObjectMapBenchmark.writeRead  fastGetIntMap           2048         10000  thrpt   10   3787.810 ±  209.264  ops/s

On the other hand, if we use the dynamic loadFactor and add an parameter as EnableDynamicLoadFactor in constructor, The overall performance of writeRead/read can be more stable at (quicker than intMap 5% - 10%), Do you think it is necessary if we only use this trick in
nameIdToClass ? (which hashCode is very stable..)

@tommyettinger
Copy link

So first: I think two parts of the PR are purely good optimizations, and make no changes to the behavior; these are the places where locateKey() is inlined/unrolled. I suspect that just these optimizations to get() provide a major improvement over the ObjectMap without this PR. In all the benchmarks I have run, these optimizations are present; there's no reason not to have them. Well, unless you rely on users being able to extend ObjectMap and override its locateKey() method for equality; even then, it's just another place to change the equality comparison.

A more debatable change is in place(), making it just mask the hashCode() instead of doing the Fibonacci rehashing (the magic multiply and shift); this should help performance for good hashCode() implementations, but cause more collisions with low-quality hashCode() implementations. Unlike CuckooObjectMap, ObjectMap (and FastGetObjectMap) can handle having more collisions without, at some point, an OutOfMemoryError killing the program. CuckooObjectMap is likely to never encounter enough completely-colliding hash codes to have that crash if it sticks to using Class keys, but it cannot be safely used for third-party-submitted data as keys. If the goal is to replace CuckooObjectMap, you can afford the same assumption that the hash codes will be at least moderately-high-quality, and use the mask optimization to place(). I do use the mask optimization in my benchmarks of both FastGetObjectMap and ObjectMap; I can roll back this change to ObjectMap and benchmark again, but I think it's a reasonable change to make. The Fibonacci rehash can also sometimes increase hash collisions, particularly if all hash codes are Fibonacci numbers or low multiples thereof, so it isn't always the best option.

The load factor changes could help, but they also reduce the ability of future code to easily adapt to changing needs for keys. I think changing the default load factor in the constructor from 0.8f to something lower (0.7 is reasonable, maybe 0.5 would be too) could be good here; it uses more memory by default, but if the capacity isn't filled-up, then it makes minimal difference on performance. I would need to run more benchmarks, with more varied key counts, before I could tell where the load factor starts to matter, and what might be a good balance.

I mentioned I would benchmark random Integers; these are (as expected) much faster than Classes to hash, because getting an identity hash code is a pretty involved process. Either fastGet or obj (with most of the optimizations in fastGet) are at the top of pretty much all benches here, though I only tried 500 for numIntegers (like numClasses before). I use a seed for this, which helps ensure more reliable benchmark results (I think I have it right).

Benchmark  (loadFactor)  (mapType)  (maxCapacity)  (numIntegers)   Mode  Cnt       Score       Error  Units
read                0.4    fastGet           2048            500  thrpt   24  311247.642 ±  8117.935  ops/s
read                0.4     cuckoo           2048            500  thrpt   24  288889.556 ±  2796.552  ops/s
read                0.4        obj           2048            500  thrpt   24  281159.447 ± 20055.693  ops/s
read                0.4       hash           2048            500  thrpt   24  187050.992 ± 11477.826  ops/s
read                0.5    fastGet           2048            500  thrpt   24  305052.970 ±  3072.985  ops/s
read                0.5     cuckoo           2048            500  thrpt   24  295044.878 ±  3335.154  ops/s
read                0.5        obj           2048            500  thrpt   24  310254.856 ±  9237.816  ops/s
read                0.5       hash           2048            500  thrpt   24  208629.023 ±  1622.164  ops/s
read                0.6    fastGet           2048            500  thrpt   24  324609.057 ±  2998.449  ops/s
read                0.6     cuckoo           2048            500  thrpt   24  305444.803 ±  3484.108  ops/s
read                0.6        obj           2048            500  thrpt   24  324293.284 ±  2554.168  ops/s
read                0.6       hash           2048            500  thrpt   24  207541.244 ±  2609.115  ops/s
read                0.7    fastGet           2048            500  thrpt   24  325357.080 ±  1724.044  ops/s
read                0.7     cuckoo           2048            500  thrpt   24  309505.610 ±  2738.307  ops/s
read                0.7        obj           2048            500  thrpt   24  325688.085 ±  2939.122  ops/s
read                0.7       hash           2048            500  thrpt   24  197686.334 ± 10791.997  ops/s
read                0.8    fastGet           2048            500  thrpt   24  309515.955 ± 19954.716  ops/s
read                0.8     cuckoo           2048            500  thrpt   24  305579.325 ±  9399.305  ops/s
read                0.8        obj           2048            500  thrpt   24  320494.343 ±  4875.453  ops/s
read                0.8       hash           2048            500  thrpt   24  208023.992 ±  2880.216  ops/s
write               0.4    fastGet           2048            500  thrpt   24  251423.107 ±  1779.975  ops/s
write               0.4     cuckoo           2048            500  thrpt   24  209027.134 ±  1657.113  ops/s
write               0.4        obj           2048            500  thrpt   24  267438.405 ±  2373.473  ops/s
write               0.4       hash           2048            500  thrpt   24  153771.147 ±  4671.341  ops/s
write               0.5    fastGet           2048            500  thrpt   24  252150.900 ±  1862.121  ops/s
write               0.5     cuckoo           2048            500  thrpt   24  202480.173 ±  2087.405  ops/s
write               0.5        obj           2048            500  thrpt   24  246146.243 ±  5024.671  ops/s
write               0.5       hash           2048            500  thrpt   24  145083.640 ±  1014.747  ops/s
write               0.6    fastGet           2048            500  thrpt   24  251033.614 ±  2065.863  ops/s
write               0.6     cuckoo           2048            500  thrpt   24  201611.388 ±  2114.741  ops/s
write               0.6        obj           2048            500  thrpt   24  250234.362 ±  2371.303  ops/s
write               0.6       hash           2048            500  thrpt   24  145220.120 ±   964.643  ops/s
write               0.7    fastGet           2048            500  thrpt   24  249378.654 ±  2752.061  ops/s
write               0.7     cuckoo           2048            500  thrpt   24  198853.919 ±  3785.977  ops/s
write               0.7        obj           2048            500  thrpt   24  250717.465 ±  2158.502  ops/s
write               0.7       hash           2048            500  thrpt   24  145666.077 ±   943.396  ops/s
write               0.8    fastGet           2048            500  thrpt   24  250936.847 ±  1302.827  ops/s
write               0.8     cuckoo           2048            500  thrpt   24  198121.092 ±  4365.235  ops/s
write               0.8        obj           2048            500  thrpt   24  258688.034 ±  2100.709  ops/s
write               0.8       hash           2048            500  thrpt   24  144406.426 ±  1876.688  ops/s
writeRead           0.4    fastGet           2048            500  thrpt   24   78596.949 ±   541.856  ops/s
writeRead           0.4     cuckoo           2048            500  thrpt   24   70088.231 ±   457.812  ops/s
writeRead           0.4        obj           2048            500  thrpt   24   88078.772 ±   654.802  ops/s
writeRead           0.4       hash           2048            500  thrpt   24   59556.445 ±  2882.402  ops/s
writeRead           0.5    fastGet           2048            500  thrpt   24   78138.961 ±   648.683  ops/s
writeRead           0.5     cuckoo           2048            500  thrpt   24   62374.084 ±   714.386  ops/s
writeRead           0.5        obj           2048            500  thrpt   24   78035.299 ±   528.528  ops/s
writeRead           0.5       hash           2048            500  thrpt   24   60077.107 ±   433.450  ops/s
writeRead           0.6    fastGet           2048            500  thrpt   24   78473.978 ±   373.149  ops/s
writeRead           0.6     cuckoo           2048            500  thrpt   24   61797.286 ±   994.279  ops/s
writeRead           0.6        obj           2048            500  thrpt   24   77929.805 ±   558.747  ops/s
writeRead           0.6       hash           2048            500  thrpt   24   60125.884 ±   538.126  ops/s
writeRead           0.7    fastGet           2048            500  thrpt   24   78678.622 ±   511.142  ops/s
writeRead           0.7     cuckoo           2048            500  thrpt   24   62300.585 ±   905.950  ops/s
writeRead           0.7        obj           2048            500  thrpt   24   78540.848 ±   607.452  ops/s
writeRead           0.7       hash           2048            500  thrpt   24   60073.270 ±   578.024  ops/s
writeRead           0.8    fastGet           2048            500  thrpt   24   78729.001 ±   426.471  ops/s
writeRead           0.8     cuckoo           2048            500  thrpt   24   61330.391 ±  1444.338  ops/s
writeRead           0.8        obj           2048            500  thrpt   24   77886.240 ±   825.636  ops/s
writeRead           0.8       hash           2048            500  thrpt   24   60000.079 ±   460.316  ops/s

It's surprising how slow HashMap is with Integer keys, and CuckooObjectMap often isn't much better on writeRead benches.

Here's my code for the benchmark with Integers:
FastGetObjectMapBenchmark2.zip
And my changes to ObjectMap:
ObjectMap.zip

@theigl
Copy link
Collaborator

theigl commented Jul 27, 2021

@tommyettinger: Thanks again, and I fully agree with your analysis!

Do you plan on integrating any of the changes into libGDX? If so, we should coordinate the changes, since Nate used to copy the map implementations from libGDX to Kryo in the past, and I'd like to keep them as compatible as possible.

@tommyettinger
Copy link

@theigl I'd like to test with some of libGDX's more problematic keys, since the change in place() seems very significant for speed on the keys we've tested so far, but is unknown regarding low-quality hashCodes. GridPoint2 is not uncommon for a key type in sparse area maps, and has hit crashes with the cuckoo-hashed object map in some larger maps. Vector2 can often have all of its useful bits in the upper 16 of the 32-bit hashCode, leaving a huge space for hashes to collide. Although place() can change the rehashing technique, user code can't subclass locateKey() in libGDX (it's package-private), so code outside ObjectMap's package can't change how equality is handled. Being able to do this would be kind of nice if users wanted to use arrays as keys (treating them as immutable and comparing/hashing them by value); this might be reasonable as a substitute for the missing Vector4 class. The working changes I'm making in jdkgdxds use a protected boolean equate(Object left, Object right) method that defaults to just calling left.equals(right), and can be overridden for things like IdentityMap, since it compares with ==, while the array key case could use Arrays.equals(). I suspect there's a tiny speed hit for the default equate() that gets bigger for more complex checks, but they are checks that would need to be done in multiple places and have changes coordinated between them -- a prime case for one central method. The default is so small that it should be inlined like a getter.

Also, since locateKey() is package-private now, no code could be relying on it outside libGDX (or Kryo) -- this makes the unrolled/inlined locateKey() in get() no problem at all to incorporate. It shouldn't change any behavior for users other than being faster, and the only code changes are in subclasses of ObjectMap or ObjectSet, in the same package in libGDX (or Kryo).

I'm leaving the decision for any changes up to @NathanSweet , since he has the final word in both projects, as far as I know. The get()/containsKey() unroll optimization is 100% safe in libGDX, the place() optimization needs more testing to see how it affects keys from libGDX, and the equate() change is currently only used in jdkgdxds, and isn't necessary for the code to be used as it is now. As for jdkgdxds, it isn't possible to merge with libGDX any time soon, since it needs Java 8. I'm not in favor of changing load factor forcibly without user control, but I am totally on-board with changing the defaults to a lower load factor (maybe not by much).

@pyb1993
Copy link
Author

pyb1993 commented Jul 27, 2021

@theigl @tommyettinger
Thanks for your analysis , but I'm a little confusing now and have several questions:

  • what I have done is to apply some optimization to the subclass of ObjectMap, and try to use it replace some part of kryo.

  • Now the benchmark from theigl shows that the fastGet is a little slower than cuckoo, so it is not reasonable to replace it

  • According to theigl's suggestion, I have implemented another fastGetIntMap in the same way, and try to replace the intMap in the nameIdToClass while shows my benchmark. @theigl Could you take a look at that part ?

  • According to @tommyettinger suggestion, he says "there's no reason not to have them", so you mean you want to apply this optimization to ObjectMap rather than the fastGetObjectMap(In that case this is not necessary)? But the hashCode quality is a concern, so you will do more test on that (correct me if I misunderstand what you mean)

To sum up, I'm curious what I should do and what actions you will take for this PR?

@theigl
Copy link
Collaborator

theigl commented Jul 27, 2021

@pyb1993:

Now the benchmark from theigl shows that the fastGet is a little slower than cuckoo, so it is not reasonable to replace it

Correct. We will not replace it for now.

According to theigl's suggestion, I have implemented another fastGetIntMap in the same way, and try to replace the intMap in the nameIdToClass while shows my benchmark. @theigl Could you take a look at that part ?

I will do some benchmarks in the coming days and get back to you. The IntMap would be a good start to test your changes because it is independent of ObjectMap, always produces good hashcodes, and doesn't have sub-classes.

According to @tommyettinger suggestion, he says "there's no reason not to have them", so you mean you want to apply this optimization to ObjectMap rather than the fastGetObjectMap(In that case this is not necessary)? But the hashCode quality is a concern, so you will do more test on that (correct me if I misunderstand what you mean)

I think what tommyettinger means is that the inlining of the locateKey method would make sense for all ObjectMaps.
That being said, it is currently not possible to simply inline the locateKey method directly in ObjectMap.get because it is overridden by IdentityMap to adjust the definition of equality. The same is true for ObjectIntMap and the subclass IdentityObjectIntMap.

The changes to the hashcode have to be evaluated separately for every sub-class and every usage of the map.

@NathanSweet
Copy link
Member

I'm leaving the decision for any changes up to @NathanSweet

It doesn't look like there are controversial changes, so I'm good with whatever you guys come up with for making the maps faster/better! I do wonder if it's worth optimizing using inlining/unrolling in Java-land. Since this is so sensitive to JVM optimizations, maybe we should test with at least Java 16.

@tommyettinger
Copy link

Since this is so sensitive to JVM optimizations, maybe we should test with at least Java 16.

I'm pretty sure my tests were with Java 16, unless Maven was somehow forcing another version. That might explain some discrepancies, though everything is quite close. My JAVA_HOME is set to an AdoptOpenJDK 16 JDK.

@pyb1993
Copy link
Author

pyb1993 commented Aug 12, 2021

@theigl
Hi, there is any updates about this PR?
You say you will do more benchmarks~

@theigl
Copy link
Collaborator

theigl commented Aug 22, 2021

@pyb1993: Sorry that this takes so long. I'm very busy at my job at the moment and can only do only minimal work on Kryo. I'll get back to you as soon as I have more time.

@pyb1993
Copy link
Author

pyb1993 commented Oct 11, 2021

@pyb1993: Sorry that this takes so long. I'm very busy at my job at the moment and can only do only minimal work on Kryo. I'll get back to you as soon as I have more time.

Are you still working in your job?

@tommyettinger
Copy link

So, I should probably follow up here. For a while, I was using several of the optimizations from FastGetObjectMap in various data structures in jdkgdxds (I'm on my phone because home network is down, so getting a link is a challenge). I had to roll back some changes to int and long-keyed maps, because in one case where most of the changed bits were higher-order, an IntMap slowed down to a crawl (though it didn't crash). That case used the x coordinate in the upper 16 bits and y in the lower 16; for a 512x256 grid, the upper bits were the only part that changed, they weren't seen by 16-bit-or-lower hashes, and so each collided 512 times. I went back to Fibonacci hashing, which doesn't have this issue, and performance returned to normal. Object-keyed maps still use the plain mask of the hashCode(), because most of those can be hashed differently by subclassing the ObjectMap in the few cases that need it. Identity hash codes also don't benefit from Fibonacci hashing. I use an overrideable equate() method in my Object-keyed maps and sets so users can compare keys for equality how they want, and can make case-insensitive data structures, among other changes. This separates the equality logic from locateKey(), and allows internals to be better-encapsulated. It could be higher or lower performance; equate() is usually very small and so gets optimized well by HotSpot, but there is an extra method call if inlining hasn't happened. The equate() can also be used in equals(). Otherwise, the inlining has been good for me.

@theigl
Copy link
Collaborator

theigl commented Jan 18, 2022

@tommyettinger: Thanks a lot for your feedback! I finally found the time to revisit this topic and the info you shared gives me more confidence in applying these changes.

I will leave hashing in the maps with numeric keys as it is now and replace Fibonacci hashing in the other maps.
I will apply inlining to all maps. I did some more benchmarks and it really makes sense in all implementations even with JDK 17.

@pyb1993: I will create another PR with your changes to hashing and inlining and test the resulting code in our production environment.

theigl added a commit that referenced this pull request Jan 18, 2022
theigl added a commit that referenced this pull request Jan 18, 2022
@theigl theigl mentioned this pull request Jan 18, 2022
3 tasks
theigl added a commit that referenced this pull request Jan 18, 2022
@tommyettinger
Copy link

@theigl This is all up to you and depends on your usage scenarios, but I should emphasize that the Fibonacci hashing acts as a form of defense against heavy pileup of collisions when the lower bits of the keys' hashCode() are extremely similar or identical. If you know that a map will not deal with third-party-submitted, potentially-malicious String keys, or keys of some other type that can be engineered to collide completely, then CuckooObjectMap remains a good choice and so does an ObjectMap with masking (without Fibonacci). But, even for int keys, if the lower bits of those keys are more-often-than-not duplicated in other keys, Fibonacci hashing provides a very significant speed improvement by using the upper bits to help reduce collisions. My point is, if you control the keys at every stage and know their hashCode() implementations will be well-distributed, masking is a valid option, and will likely be slightly faster than using Fibonacci hashing. That case is true for all identity-hashed maps, since the identity hash code is very thoroughly mixed. If you don't control the keys or have to accept keys with poorly distributed hashCode() implementations, like Float or even unusual sets of int keys, then Fibonacci hashing can help considerably, as I described in my previous comment.

I also should mention that I ran some JMH benchmarks on CuckooObjectMap and some variants that are hardened somewhat against mass collision scenarios; all variants had comparable speed to the original, with very slow insertion relative to any other maps I tried, and an insignificant improvement to lookup times (within the margin of error) relative to ObjectMap. I was rather discouraged by this, because I had been hoping to maybe switch some implementations to a variant of CuckooObjectMap, but it didn't substantially outperform the existing code in any metric. I just hardened one variant some more, and it can now successfully insert 256 keys with the same hashCode() (0 for all of them); I expect it performs a little worse than the existing CuckooObjectMap, and could perform substantially worse, but won't OutOfMemoryError on small input sets: https://github.com/tommyettinger/jdkgdxds/blob/master/src/test/java/com/github/tommyettinger/ds/test/ObjectObjectCuckooMap.java#L408

theigl added a commit that referenced this pull request Jan 21, 2022
theigl added a commit that referenced this pull request Jan 21, 2022
theigl added a commit that referenced this pull request Jan 21, 2022
theigl added a commit that referenced this pull request Jan 21, 2022
theigl added a commit that referenced this pull request Jan 21, 2022
theigl added a commit that referenced this pull request Jan 21, 2022
theigl added a commit that referenced this pull request Jan 21, 2022
theigl added a commit that referenced this pull request Jan 22, 2022
theigl added a commit that referenced this pull request Jan 22, 2022
theigl added a commit that referenced this pull request Jan 22, 2022
theigl added a commit that referenced this pull request Jan 23, 2022
theigl added a commit that referenced this pull request Jan 23, 2022
theigl added a commit that referenced this pull request Jan 23, 2022
theigl added a commit that referenced this pull request Jan 23, 2022
@theigl
Copy link
Collaborator

theigl commented Jan 24, 2022

@tommyettinger:

My point is, if you control the keys at every stage and know their hashCode() implementations will be well-distributed, masking is a valid option, and will likely be slightly faster than using Fibonacci hashing. That case is true for all identity-hashed maps, since the identity hash code is very thoroughly mixed. If you don't control the keys or have to accept keys with poorly distributed hashCode() implementations, like Float or even unusual sets of int keys, then Fibonacci hashing can help considerably, as I described in my previous comment.

I decided against dropping Fibonacci hashing from all maps and instead only removed it from the two identity maps. Thank you very much for your advice!

I also should mention that I ran some JMH benchmarks on CuckooObjectMap and some variants that are hardened somewhat against mass collision scenarios; all variants had comparable speed to the original, with very slow insertion relative to any other maps I tried, and an insignificant improvement to lookup times (within the margin of error) relative to ObjectMap. I was rather discouraged by this, because I had been hoping to maybe switch some implementations to a variant of CuckooObjectMap, but it didn't substantially outperform the existing code in any metric. I just hardened one variant some more, and it can now successfully insert 256 keys with the same hashCode() (0 for all of them); I expect it performs a little worse than the existing CuckooObjectMap, and could perform substantially worse, but won't OutOfMemoryError on small input sets:

I ran several days worth of benchmarks against my PR #876 and was finally able to get rid of the CuckooObjectMap in Kryo. The only place where we used the CuckooMap was resolving registrations for classes. I replaced it with the improved version of IdentityMap (inlined get and masking instead of Fibonacci hashing). Performance is the same or slightly faster for all my real world benchmarks.

I pushed a custom Kryo version built from the PR into our production environment and will monitor it for a day or two. So far it looks like a 3-5% performance boost over the master.

@tommyettinger
Copy link

3-5% performance boost is great considering no features were lost as payment. I like the approach where identity maps can rely on identityHashCode() producing good output, since the cases where it doesn't are virtually impossible, and being able to omit the long multiply and shift should help there (especially since getting an identityHashCode() can be slower than some hashCode() implementations). I also like not having to worry about collision buildup in other maps/sets, because those do still use Fibonacci hashing to mix the upper and lower bits.

I'm glad you were able to figure out a solution that avoids using CuckooObjectMap. The insertion speed in the Cuckoo maps was really surprisingly poor, to me, and the lookup speed seems very close, as you said, so the main advantage is lessened when compared to the maps in this PR. I think part of the problem is that Cuckoo Hashing really does seem to depend on an implementation of double hashing (where a key has two independent hashCode() values); C++ maps can provide this, but Java really can't. The stash behavior also only guarantees constant-time operations if it is fixed-size, and the larger the stash, the larger the constant factor. If the stash runs out of room and is fixed-size, it's a major problem. Being able to use an old standard like linear probing is certainly more comforting.

theigl added a commit that referenced this pull request Jan 25, 2022
* #845 Replace Fibonacci hashing with plain masking for identity maps and inline `get` methods

* #845 Adjust JavaDoc

* #845 Add updated `MapBenchmark`
@theigl
Copy link
Collaborator

theigl commented Jan 25, 2022

@pyb1993 @tommyettinger: Most changes suggested in this PR have been applied in #876 and #877. Many thanks to both of you!

@theigl theigl closed this Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants