-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement the fastGetObjectMap to get better read performance #845
implement the fastGetObjectMap to get better read performance #845
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR @pyb1993! I left some initial comments. Please also run mvn formatter:format
to apply project code styles.
I will try to get some feedback by the original authors of ObjectMap
as well.
benchmarks/src/main/java/com/esotericsoftware/kryo/benchmarks/io/FastGetObjectMapBenchmark.java
Outdated
Show resolved
Hide resolved
…rameter of constructor
@pyb1993: I ran the benchmark on your current branch on one of my linux boxes with JDK 11: With a single thread:
And with two threads:
For me, |
Yes, you are correct, I rerun the benchmark many times, and find the result is not very stable: In that case, I'm not sure it is worth to replace cuckoo, maybe keep it is a better option |
I looked at the code changes for a minute, and noticed that the FastGetObjectMap code is mostly the same... except that it always uses a lower load factor than ObjectMap or CuckooObjectMap. If you make ObjectMap and CuckooObjectMap also use 0.5 load factor, like this should be using at the size of 2048 that the benchmark runs at, you get a
This could easily have benchmark quality issues because not very many iterations were used, but if it holds up,
GitHub doesn't let me attach a bare .java file, but here are my changes to the benchmark, zipped: |
So, I ran some more benchmarks... These are all using FastGetObjectMapBenchmark. This tests with 500 Classes inserted, instead of 100, and the results are less favorable to
To sum this up, FastGetObjectMap seems like a good set of changes, but I would be wary of taking away the |
I figured out the flaw with the benchmark... on my end. I was
This shows obj is faster than cuckoo at every load factor tested, using pyb1993's optimizations. Without those optimizations, it's probably considerably slower, since that should match the above comment's benchmarks. If you're making changes to ObjectMap, you should also make sure IdentityMap has similar changes. I found in my code that the magic-multiply-then-shift is unnecessary for identityHashCode() results, and it is faster with a mask. It seems like this (for the current benchmarks) shows that the magic-multiply-then-shift is usually slower, though I prefer it when the hashCode() could have a strong bias towards high bits instead of the low bits the mask uses. @pyb1993 , is it OK if I credit you in a |
for the contributors in your repo, I am very welcome to be added^_^ So I'm not sure about the benchmark result. |
It makes sense that the results aren't stable, I think, because this is using |
@tommyettinger: Thanks a lot for your effort and input on this topic! Let me summarize briefly what we found out so far:
This is mainly due to simplified hashing and a lower load-factor. According to my benchmarks, a load-factor of 0.5 or 0.6 seems ideal, at least for the use-case we are benchmarking. The simplified hashing only makes sense for high quality hashes like identity hashcodes.
This result is consistent in all my benchmarks. Once we reach a certain number of elements (somewhere between 500 and 1000), the Cuckoo map starts to perform significantly better. Although I'd very much like to delete the Cuckoo map from Kryo, I don't think it is a good idea until we find a replacement that comes closer in performance. However, there are other parts of Kryo that currently use For reference, these are the benchmark results I got from running @tommyettinger's benchmark on my linux box with a large number of forks and warmups. The error rate for the runs with > 100 elements is quite low, so I'm pretty sure these results are valid:
|
Hi, according to your suggestion, I choose the nameIdToClass to do some benchmark, and try to replace the intMap. see my latest commit to get details.
On the other hand, if we use the dynamic loadFactor and add an parameter as EnableDynamicLoadFactor in constructor, The overall performance of writeRead/read can be more stable at (quicker than intMap 5% - 10%), Do you think it is necessary if we only use this trick in |
So first: I think two parts of the PR are purely good optimizations, and make no changes to the behavior; these are the places where locateKey() is inlined/unrolled. I suspect that just these optimizations to get() provide a major improvement over the ObjectMap without this PR. In all the benchmarks I have run, these optimizations are present; there's no reason not to have them. Well, unless you rely on users being able to extend ObjectMap and override its locateKey() method for equality; even then, it's just another place to change the equality comparison. A more debatable change is in place(), making it just mask the hashCode() instead of doing the Fibonacci rehashing (the magic multiply and shift); this should help performance for good hashCode() implementations, but cause more collisions with low-quality hashCode() implementations. Unlike CuckooObjectMap, ObjectMap (and FastGetObjectMap) can handle having more collisions without, at some point, an OutOfMemoryError killing the program. CuckooObjectMap is likely to never encounter enough completely-colliding hash codes to have that crash if it sticks to using Class keys, but it cannot be safely used for third-party-submitted data as keys. If the goal is to replace CuckooObjectMap, you can afford the same assumption that the hash codes will be at least moderately-high-quality, and use the mask optimization to place(). I do use the mask optimization in my benchmarks of both FastGetObjectMap and ObjectMap; I can roll back this change to ObjectMap and benchmark again, but I think it's a reasonable change to make. The Fibonacci rehash can also sometimes increase hash collisions, particularly if all hash codes are Fibonacci numbers or low multiples thereof, so it isn't always the best option. The load factor changes could help, but they also reduce the ability of future code to easily adapt to changing needs for keys. I think changing the default load factor in the constructor from 0.8f to something lower (0.7 is reasonable, maybe 0.5 would be too) could be good here; it uses more memory by default, but if the capacity isn't filled-up, then it makes minimal difference on performance. I would need to run more benchmarks, with more varied key counts, before I could tell where the load factor starts to matter, and what might be a good balance. I mentioned I would benchmark random Integers; these are (as expected) much faster than Classes to hash, because getting an identity hash code is a pretty involved process. Either fastGet or obj (with most of the optimizations in fastGet) are at the top of pretty much all benches here, though I only tried 500 for numIntegers (like numClasses before). I use a seed for this, which helps ensure more reliable benchmark results (I think I have it right).
It's surprising how slow HashMap is with Integer keys, and CuckooObjectMap often isn't much better on writeRead benches. Here's my code for the benchmark with Integers: |
@tommyettinger: Thanks again, and I fully agree with your analysis! Do you plan on integrating any of the changes into libGDX? If so, we should coordinate the changes, since Nate used to copy the map implementations from libGDX to Kryo in the past, and I'd like to keep them as compatible as possible. |
@theigl I'd like to test with some of libGDX's more problematic keys, since the change in Also, since I'm leaving the decision for any changes up to @NathanSweet , since he has the final word in both projects, as far as I know. The get()/containsKey() unroll optimization is 100% safe in libGDX, the place() optimization needs more testing to see how it affects keys from libGDX, and the equate() change is currently only used in jdkgdxds, and isn't necessary for the code to be used as it is now. As for jdkgdxds, it isn't possible to merge with libGDX any time soon, since it needs Java 8. I'm not in favor of changing load factor forcibly without user control, but I am totally on-board with changing the defaults to a lower load factor (maybe not by much). |
@theigl @tommyettinger
To sum up, I'm curious what I should do and what actions you will take for this PR? |
Correct. We will not replace it for now.
I will do some benchmarks in the coming days and get back to you. The
I think what tommyettinger means is that the inlining of the The changes to the hashcode have to be evaluated separately for every sub-class and every usage of the map. |
It doesn't look like there are controversial changes, so I'm good with whatever you guys come up with for making the maps faster/better! I do wonder if it's worth optimizing using inlining/unrolling in Java-land. Since this is so sensitive to JVM optimizations, maybe we should test with at least Java 16. |
I'm pretty sure my tests were with Java 16, unless Maven was somehow forcing another version. That might explain some discrepancies, though everything is quite close. My JAVA_HOME is set to an AdoptOpenJDK 16 JDK. |
@theigl |
@pyb1993: Sorry that this takes so long. I'm very busy at my job at the moment and can only do only minimal work on Kryo. I'll get back to you as soon as I have more time. |
Are you still working in your job? |
So, I should probably follow up here. For a while, I was using several of the optimizations from FastGetObjectMap in various data structures in jdkgdxds (I'm on my phone because home network is down, so getting a link is a challenge). I had to roll back some changes to int and long-keyed maps, because in one case where most of the changed bits were higher-order, an IntMap slowed down to a crawl (though it didn't crash). That case used the x coordinate in the upper 16 bits and y in the lower 16; for a 512x256 grid, the upper bits were the only part that changed, they weren't seen by 16-bit-or-lower hashes, and so each collided 512 times. I went back to Fibonacci hashing, which doesn't have this issue, and performance returned to normal. Object-keyed maps still use the plain mask of the hashCode(), because most of those can be hashed differently by subclassing the ObjectMap in the few cases that need it. Identity hash codes also don't benefit from Fibonacci hashing. I use an overrideable equate() method in my Object-keyed maps and sets so users can compare keys for equality how they want, and can make case-insensitive data structures, among other changes. This separates the equality logic from locateKey(), and allows internals to be better-encapsulated. It could be higher or lower performance; equate() is usually very small and so gets optimized well by HotSpot, but there is an extra method call if inlining hasn't happened. The equate() can also be used in equals(). Otherwise, the inlining has been good for me. |
@tommyettinger: Thanks a lot for your feedback! I finally found the time to revisit this topic and the info you shared gives me more confidence in applying these changes. I will leave hashing in the maps with numeric keys as it is now and replace Fibonacci hashing in the other maps. @pyb1993: I will create another PR with your changes to hashing and inlining and test the resulting code in our production environment. |
@theigl This is all up to you and depends on your usage scenarios, but I should emphasize that the Fibonacci hashing acts as a form of defense against heavy pileup of collisions when the lower bits of the keys' hashCode() are extremely similar or identical. If you know that a map will not deal with third-party-submitted, potentially-malicious String keys, or keys of some other type that can be engineered to collide completely, then CuckooObjectMap remains a good choice and so does an ObjectMap with masking (without Fibonacci). But, even for int keys, if the lower bits of those keys are more-often-than-not duplicated in other keys, Fibonacci hashing provides a very significant speed improvement by using the upper bits to help reduce collisions. My point is, if you control the keys at every stage and know their hashCode() implementations will be well-distributed, masking is a valid option, and will likely be slightly faster than using Fibonacci hashing. That case is true for all identity-hashed maps, since the identity hash code is very thoroughly mixed. If you don't control the keys or have to accept keys with poorly distributed hashCode() implementations, like Float or even unusual sets of int keys, then Fibonacci hashing can help considerably, as I described in my previous comment. I also should mention that I ran some JMH benchmarks on CuckooObjectMap and some variants that are hardened somewhat against mass collision scenarios; all variants had comparable speed to the original, with very slow insertion relative to any other maps I tried, and an insignificant improvement to lookup times (within the margin of error) relative to ObjectMap. I was rather discouraged by this, because I had been hoping to maybe switch some implementations to a variant of CuckooObjectMap, but it didn't substantially outperform the existing code in any metric. I just hardened one variant some more, and it can now successfully insert 256 keys with the same hashCode() (0 for all of them); I expect it performs a little worse than the existing CuckooObjectMap, and could perform substantially worse, but won't OutOfMemoryError on small input sets: https://github.com/tommyettinger/jdkgdxds/blob/master/src/test/java/com/github/tommyettinger/ds/test/ObjectObjectCuckooMap.java#L408 |
I decided against dropping Fibonacci hashing from all maps and instead only removed it from the two identity maps. Thank you very much for your advice!
I ran several days worth of benchmarks against my PR #876 and was finally able to get rid of the CuckooObjectMap in Kryo. The only place where we used the CuckooMap was resolving registrations for classes. I replaced it with the improved version of I pushed a custom Kryo version built from the PR into our production environment and will monitor it for a day or two. So far it looks like a 3-5% performance boost over the master. |
3-5% performance boost is great considering no features were lost as payment. I like the approach where identity maps can rely on identityHashCode() producing good output, since the cases where it doesn't are virtually impossible, and being able to omit the long multiply and shift should help there (especially since getting an identityHashCode() can be slower than some hashCode() implementations). I also like not having to worry about collision buildup in other maps/sets, because those do still use Fibonacci hashing to mix the upper and lower bits. I'm glad you were able to figure out a solution that avoids using CuckooObjectMap. The insertion speed in the Cuckoo maps was really surprisingly poor, to me, and the lookup speed seems very close, as you said, so the main advantage is lessened when compared to the maps in this PR. I think part of the problem is that Cuckoo Hashing really does seem to depend on an implementation of double hashing (where a key has two independent hashCode() values); C++ maps can provide this, but Java really can't. The stash behavior also only guarantees constant-time operations if it is fixed-size, and the larger the stash, the larger the constant factor. If the stash runs out of room and is fixed-size, it's a major problem. Being able to use an old standard like linear probing is certainly more comforting. |
@pyb1993 @tommyettinger: Most changes suggested in this PR have been applied in #876 and #877. Many thanks to both of you! |
1. unroll the locate to get method
There are two possible reasons:
JIT doesn't inline the method locateKey, there is little overhead
get method need to compare the result of locateKey and then decide to return the real value
2. Remove the magic number in place
3. Automatically adjust load factor
Benchmark Result
num = 100
num=1000
num=3000
num=10000