-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential lock contention on TrieCache #10992
Comments
cc @Longarithm in case you have thoughts on this |
We found a potential issue in nearcore/core/store/src/trie/trie_storage.rs Line 142 in 4c0aa98
update_cache span has very little idle cpu time (3433 ns) and if it were due to lock contention, it should have much higher idle time
|
I instrumented the logic after acquiring locks and found that when it takes a long time, the time is indeed spent on the actual logic of updating cache, not acquiring the lock itself |
@Longarithm could we simplify the logic by actually removing an element when we delete it from the cache instead of putting it into the deletion queue? |
The idle time metric is only really relevant for spans that are entered and/or exited multiple times, such as in asynchronous code (where each In other words, it is not transparently measuring something like CPU time for us and cannot be used as a proxy to determine whether the code within the span was actually idle or not. |
@nagisa unfortunately I was not able to get to the end of this issue. From what I see, it seems that the cache simply isn't performant enough. Tracing shows that each put/pop takes at least 1 microsecond and sometimes takes ~10 microseconds, which is quite slow. That may be due to the number of entries in the cache but could also due to tracing instrumentation (at 1μs latency I suspect tracing adds nontrivial overhead). We need to look more into it. If it is indeed the performance of the cache that is the issue, I think there may be several ways we can resolve it:
|
tracing is hundreds-of-nanoseconds of overhead per event/span, but also its really taxing to use it to trace such small and frequent operations. (I was also selectively tracing only Anyhow, yeah, I'm looking into it. |
As for It certainly doesn't look like it could be lock contention -- in that case we would see multiple tracing observations each time a lock is taken for a long time, or a ladder-shaped steps or something along those lines. In practice we always see just one instance jump out. One caveat is that the I think we can definitely afford to pre-allocate whatever size necessary for the LRU per shard here. Note however that on my node I'm not seeing observations of |
Early results with a pre-allocated |
…rie caches (#10995) This achieves two distinct things: 1. With the Mutex-based lock around the hashmap of `TrieCache`s we now only hold the lock strictly for as long as necessary, which *may* be as long as it is necessary to create a `TrieCache`, but that's a one-time occurrence outside of resharding scenarios as far as I can tell. 2. With the pre-allocation of the `LruCache` we get rid of the occasionally-very-expensive reallocation of the backing store of this data. --- I haven't been running this code for very long, but these are what the graphs look now (lowpass filter of 5ms for the bottom graph which shows span durations, so for the new graph this means that most of the stuff is under 5ms;; the top graph shows the "current" size of the shard trie caches.) Old(?; I don't actually know what code this node is running): ![image](https://github.com/near/nearcore/assets/679122/6b101adf-f77f-49bc-a188-b11bb784645a) New: ![image](https://github.com/near/nearcore/assets/679122/5b9a9251-9713-403a-9f50-64108f9c29cf) Most notably gone are the occasional blips up-and-above 200ms range per call. Fixes #10992 --------- Co-authored-by: Bowen Wang <bowen@near.org>
…rie caches (#10995) This achieves two distinct things: 1. With the Mutex-based lock around the hashmap of `TrieCache`s we now only hold the lock strictly for as long as necessary, which *may* be as long as it is necessary to create a `TrieCache`, but that's a one-time occurrence outside of resharding scenarios as far as I can tell. 2. With the pre-allocation of the `LruCache` we get rid of the occasionally-very-expensive reallocation of the backing store of this data. --- I haven't been running this code for very long, but these are what the graphs look now (lowpass filter of 5ms for the bottom graph which shows span durations, so for the new graph this means that most of the stuff is under 5ms;; the top graph shows the "current" size of the shard trie caches.) Old(?; I don't actually know what code this node is running): ![image](https://github.com/near/nearcore/assets/679122/6b101adf-f77f-49bc-a188-b11bb784645a) New: ![image](https://github.com/near/nearcore/assets/679122/5b9a9251-9713-403a-9f50-64108f9c29cf) Most notably gone are the occasional blips up-and-above 200ms range per call. Fixes #10992 --------- Co-authored-by: Bowen Wang <bowen@near.org>
After #10973, we still see long time spent on
ShardTries::update_cache
. There is an example ofupdate_cache
taking more than 2s after the node has been running for a while.This seems to point to a lock contention on
TrieCache
. With the introduction of aggressive prefetching in #10935, there is potential contention when a thread tries to prefetch data and another thread callsupdate_cache
. What is still unclear to me is how this can block for 2s. It might be related to the TrieCache implementation since it is not a simple LRU cache. This also happens quite rarely.The text was updated successfully, but these errors were encountered: