-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove lru partid as hint. #2253
Conversation
Em.. How many parts on your one host? |
About 20. But more threads than parts number. |
The best practice is to use buckets of spinlock, which can fully fit into L1 cache. Like CuckooHash, use 4K spin bit lock as concurrency control. |
Typically, we have default 100 parts on one host. The hint here is to avoid hash_combine for tuple, it has about 2x performance improvement in my benchmark. |
You mean compact all spinLocks into one area? |
Yes. Spinlock is easy to implement using atomic operations. As long as lock buckets far larger than the threads, the lock collision can be ignored. |
benchmark of pure hash may mean nothing. Access the memory out of the L1/L2/L3 CPU cache overwhelm consumption. |
Make sense. Let me think a while how to optimize it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accept it now.
Let's optimize it with more buckets and compacted spinLocks to protect the cache line.
I need to correct it. We have default 100 parts on the whole cluster. |
Please recheck the code style. |
c4ff2b7
to
17303c4
Compare
Co-authored-by: trippli <trippli@tencent.com> Co-authored-by: dangleptr <37216992+dangleptr@users.noreply.github.com>
Co-authored-by: trippli <trippli@tencent.com> Co-authored-by: dangleptr <37216992+dangleptr@users.noreply.github.com>
Use partId as hint of LRU cache is a bad idea, as a machine only has some of the partitions. This will cause lock conflict a big performance issue.
PS:
Hash a limited num stored in the L1 cache of CPU, has little impact on the performance, as the access to the memory cost hundreds of CPU cycle.