Potential lock contention on TrieCache #10992

bowenwang1996 · 2024-04-08T23:01:02Z

After #10973, we still see long time spent on ShardTries::update_cache. There is an example of update_cache taking more than 2s after the node has been running for a while.

This seems to point to a lock contention on TrieCache. With the introduction of aggressive prefetching in #10935, there is potential contention when a thread tries to prefetch data and another thread calls update_cache. What is still unclear to me is how this can block for 2s. It might be related to the TrieCache implementation since it is not a simple LRU cache. This also happens quite rarely.

The text was updated successfully, but these errors were encountered:

bowenwang1996 · 2024-04-08T23:02:34Z

cc @Longarithm in case you have thoughts on this

bowenwang1996 · 2024-04-08T23:34:21Z

We found a potential issue in

nearcore/core/store/src/trie/trie_storage.rs

Line 142 in 4c0aa98

    
           while self.total_size > self.total_size_limit || self.cache.len() == self.cache.cap() {

. It is possible that deleted key no longer exists in the cache, so when it is popped from the deletion queue, the size of the cache does not change. So this loop could potentially take a really long time. Another piece of evidence for this hypothesis is that in the trace linked above, update_cache span has very little idle cpu time (3433 ns) and if it were due to lock contention, it should have much higher idle time

bowenwang1996 · 2024-04-08T23:49:10Z

I instrumented the logic after acquiring locks and found that when it takes a long time, the time is indeed spent on the actual logic of updating cache, not acquiring the lock itself

bowenwang1996 · 2024-04-08T23:54:53Z

@Longarithm could we simplify the logic by actually removing an element when we delete it from the cache instead of putting it into the deletion queue?

nagisa · 2024-04-09T00:02:58Z

Another piece of evidence for this hypothesis is that in the trace linked above, update_cache span has very little idle cpu time (3433 ns) and if it were due to lock contention, it should have much higher idle time

The idle time metric is only really relevant for spans that are entered and/or exited multiple times, such as in asynchronous code (where each poll is an individual entry/exit.) In synchronous code the idle time will usually be quite low (even if/when waiting on a synchronous lock implementation,) as it usually enters the span immediately after it is created and only does so once until the span is destroyed.

In other words, it is not transparently measuring something like CPU time for us and cannot be used as a proxy to determine whether the code within the span was actually idle or not.

bowenwang1996 · 2024-04-09T05:06:24Z

@nagisa unfortunately I was not able to get to the end of this issue. From what I see, it seems that the cache simply isn't performant enough. Tracing shows that each put/pop takes at least 1 microsecond and sometimes takes ~10 microseconds, which is quite slow. That may be due to the number of entries in the cache but could also due to tracing instrumentation (at 1μs latency I suspect tracing adds nontrivial overhead). We need to look more into it. If it is indeed the performance of the cache that is the issue, I think there may be several ways we can resolve it:

Optimize the cache operations as much as possible
Use some heavy lifting such as lock free LRU caches (looks like there is a rust implementation)
Do I understand correctly that for the correctness of Trie to work, we only need to force update cache on deletion but not necessarily insertion? It seems we mostly rely on trie nodes being fetched during execution anyways and not sure how much it helps to have it do a blocking update on writing trie changes. cc @Longarithm

nagisa · 2024-04-09T08:23:49Z

tracing is hundreds-of-nanoseconds of overhead per event/span, but also its really taxing to use it to trace such small and frequent operations.

(I was also selectively tracing only store::update and store::trie::shard_tries targets to minimize the tracing overhead as much as possible)

Anyhow, yeah, I'm looking into it.

nagisa · 2024-04-09T09:56:18Z

Tracing shows that each put/pop takes at least 1 microsecond and sometimes takes ~10 microseconds, which is quite slow.

As for TrieCacheInner::put speed, when we're looking at a ShardTries::update_cache calls that are taking upwards of 200ms, they still contain about 1000 ops for the cache to deal with. Even with nothing under that call instrumented (so no tracing overhead) we're looking at 200+µs/op on average.

It certainly doesn't look like it could be lock contention -- in that case we would see multiple tracing observations each time a lock is taken for a long time, or a ladder-shaped steps or something along those lines. In practice we always see just one instance jump out.

One caveat is that the LruCache is "unbounded". Looking at the cache size, it seems to be approaching 24MiB in size on my node. If the comments are to be believed, with 100 to 500 bytes per cached value, this would come out to 250k to 50k of entries. Reallocations of this data structure can become somewhat a notable expense. I think this might be a plausible cause for occasional freak observations, given that these observations get progressively rarer and longer.

I think we can definitely afford to pre-allocate whatever size necessary for the LRU per shard here.

Note however that on my node I'm not seeing observations of ShardTries::update_cache taking more than 1 second at all, even after an entire night of runtime. The longest I'm seeing is 350ms... But this is with the mutex-based lock implementation.

nagisa · 2024-04-09T10:30:23Z

Early results with a pre-allocated LruCache seem to be promising. Will keep it running for a little while longer, see what things end up looking like after a few hrs and submit a PR.

…rie caches (#10995) This achieves two distinct things: 1. With the Mutex-based lock around the hashmap of `TrieCache`s we now only hold the lock strictly for as long as necessary, which *may* be as long as it is necessary to create a `TrieCache`, but that's a one-time occurrence outside of resharding scenarios as far as I can tell. 2. With the pre-allocation of the `LruCache` we get rid of the occasionally-very-expensive reallocation of the backing store of this data. --- I haven't been running this code for very long, but these are what the graphs look now (lowpass filter of 5ms for the bottom graph which shows span durations, so for the new graph this means that most of the stuff is under 5ms;; the top graph shows the "current" size of the shard trie caches.) Old(?; I don't actually know what code this node is running): ![image](https://github.com/near/nearcore/assets/679122/6b101adf-f77f-49bc-a188-b11bb784645a) New: ![image](https://github.com/near/nearcore/assets/679122/5b9a9251-9713-403a-9f50-64108f9c29cf) Most notably gone are the occasional blips up-and-above 200ms range per call. Fixes #10992 --------- Co-authored-by: Bowen Wang <bowen@near.org>

nagisa mentioned this issue Apr 9, 2024

Preallocate the LruCache inside the TrieCache and use a Mutex for trie caches #10995

Merged

nagisa closed this as completed in #10995 Apr 9, 2024

github-actions bot mentioned this issue May 1, 2024

Monthly issue metrics report #11194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential lock contention on TrieCache #10992

Potential lock contention on TrieCache #10992

bowenwang1996 commented Apr 8, 2024

bowenwang1996 commented Apr 8, 2024

bowenwang1996 commented Apr 8, 2024 •

edited

Loading

bowenwang1996 commented Apr 8, 2024

bowenwang1996 commented Apr 8, 2024

nagisa commented Apr 9, 2024

bowenwang1996 commented Apr 9, 2024

nagisa commented Apr 9, 2024 •

edited

Loading

nagisa commented Apr 9, 2024 •

edited

Loading

nagisa commented Apr 9, 2024

Potential lock contention on TrieCache #10992

Potential lock contention on TrieCache #10992

Comments

bowenwang1996 commented Apr 8, 2024

bowenwang1996 commented Apr 8, 2024

bowenwang1996 commented Apr 8, 2024 • edited Loading

bowenwang1996 commented Apr 8, 2024

bowenwang1996 commented Apr 8, 2024

nagisa commented Apr 9, 2024

bowenwang1996 commented Apr 9, 2024

nagisa commented Apr 9, 2024 • edited Loading

nagisa commented Apr 9, 2024 • edited Loading

nagisa commented Apr 9, 2024

bowenwang1996 commented Apr 8, 2024 •

edited

Loading

nagisa commented Apr 9, 2024 •

edited

Loading

nagisa commented Apr 9, 2024 •

edited

Loading