Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove lru partid as hint. #2253

Merged
merged 2 commits into from
Jul 30, 2020
Merged

Conversation

xuguruogu
Copy link
Collaborator

Use partId as hint of LRU cache is a bad idea, as a machine only has some of the partitions. This will cause lock conflict a big performance issue.

PS:
Hash a limited num stored in the L1 cache of CPU, has little impact on the performance, as the access to the memory cost hundreds of CPU cycle.

@dangleptr
Copy link
Contributor

Use partId as hint of LRU cache is a bad idea, as a machine only has some of the partitions. This will cause lock conflict a big performance issue.

Em.. How many parts on your one host?

@xuguruogu
Copy link
Collaborator Author

Use partId as hint of LRU cache is a bad idea, as a machine only has some of the partitions. This will cause lock conflict a big performance issue.

Em.. How many parts on your one host?

About 20. But more threads than parts number.

@xuguruogu
Copy link
Collaborator Author

The best practice is to use buckets of spinlock, which can fully fit into L1 cache. Like CuckooHash, use 4K spin bit lock as concurrency control.

@dangleptr
Copy link
Contributor

Use partId as hint of LRU cache is a bad idea, as a machine only has some of the partitions. This will cause lock conflict a big performance issue.

Em.. How many parts on your one host?

About 20. But more threads than parts number.

Typically, we have default 100 parts on one host. The hint here is to avoid hash_combine for tuple, it has about 2x performance improvement in my benchmark.

@dangleptr
Copy link
Contributor

The best practice is to use buckets of spinlock, which can fully fit into L1 cache. Like CuckooHash, use 4K spin bit lock as concurrency control.

You mean compact all spinLocks into one area?

@xuguruogu
Copy link
Collaborator Author

The best practice is to use buckets of spinlock, which can fully fit into L1 cache. Like CuckooHash, use 4K spin bit lock as concurrency control.

You mean compact all spinLocks into one area?

Yes. Spinlock is easy to implement using atomic operations. As long as lock buckets far larger than the threads, the lock collision can be ignored.

@xuguruogu
Copy link
Collaborator Author

Use partId as hint of LRU cache is a bad idea, as a machine only has some of the partitions. This will cause lock conflict a big performance issue.

Em.. How many parts on your one host?

About 20. But more threads than parts number.

Typically, we have default 100 parts on one host. The hint here is to avoid hash_combine for tuple, it has about 2x performance improvement in my benchmark.

benchmark of pure hash may mean nothing. Access the memory out of the L1/L2/L3 CPU cache overwhelm consumption.

@dangleptr
Copy link
Contributor

The best practice is to use buckets of spinlock, which can fully fit into L1 cache. Like CuckooHash, use 4K spin bit lock as concurrency control.

You mean compact all spinLocks into one area?

Yes. Spinlock is easy to implement using atomic operations. As long as lock buckets far larger than the threads, the lock collision can be ignored.

Make sense. Let me think a while how to optimize it.

@dangleptr dangleptr added the ready-for-testing PR: ready for the CI test label Jul 28, 2020
dangleptr
dangleptr previously approved these changes Jul 28, 2020
Copy link
Contributor

@dangleptr dangleptr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept it now.
Let's optimize it with more buckets and compacted spinLocks to protect the cache line.

@xuguruogu
Copy link
Collaborator Author

Use partId as hint of LRU cache is a bad idea, as a machine only has some of the partitions. This will cause lock conflict a big performance issue.

Em.. How many parts on your one host?

About 20. But more threads than parts number.

Typically, we have default 100 parts on one host. The hint here is to avoid hash_combine for tuple, it has about 2x performance improvement in my benchmark.

I need to correct it. We have default 100 parts on the whole cluster.

@dangleptr
Copy link
Contributor

Please recheck the code style.

@dangleptr dangleptr merged commit 7149ba2 into vesoft-inc:master Jul 30, 2020
critical27 pushed a commit to critical27/nebula that referenced this pull request Aug 4, 2020
Co-authored-by: trippli <trippli@tencent.com>
Co-authored-by: dangleptr <37216992+dangleptr@users.noreply.github.com>
@xuguruogu xuguruogu deleted the remove-lru-hint branch August 6, 2020 10:19
tong-hao pushed a commit to tong-hao/nebula that referenced this pull request Jun 1, 2021
Co-authored-by: trippli <trippli@tencent.com>
Co-authored-by: dangleptr <37216992+dangleptr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants