Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential lock contention on TrieCache #10992

Closed
bowenwang1996 opened this issue Apr 8, 2024 · 9 comments · Fixed by #10995
Closed

Potential lock contention on TrieCache #10992

bowenwang1996 opened this issue Apr 8, 2024 · 9 comments · Fixed by #10995

Comments

@bowenwang1996
Copy link
Collaborator

After #10973, we still see long time spent on ShardTries::update_cache. There is an example of update_cache taking more than 2s after the node has been running for a while.
Screen Shot 2024-04-08 at 3 44 13 PM

This seems to point to a lock contention on TrieCache. With the introduction of aggressive prefetching in #10935, there is potential contention when a thread tries to prefetch data and another thread calls update_cache. What is still unclear to me is how this can block for 2s. It might be related to the TrieCache implementation since it is not a simple LRU cache. This also happens quite rarely.

@bowenwang1996
Copy link
Collaborator Author

cc @Longarithm in case you have thoughts on this

@bowenwang1996
Copy link
Collaborator Author

bowenwang1996 commented Apr 8, 2024

We found a potential issue in

while self.total_size > self.total_size_limit || self.cache.len() == self.cache.cap() {
. It is possible that deleted key no longer exists in the cache, so when it is popped from the deletion queue, the size of the cache does not change. So this loop could potentially take a really long time. Another piece of evidence for this hypothesis is that in the trace linked above, update_cache span has very little idle cpu time (3433 ns) and if it were due to lock contention, it should have much higher idle time

@bowenwang1996
Copy link
Collaborator Author

I instrumented the logic after acquiring locks and found that when it takes a long time, the time is indeed spent on the actual logic of updating cache, not acquiring the lock itself

@bowenwang1996
Copy link
Collaborator Author

@Longarithm could we simplify the logic by actually removing an element when we delete it from the cache instead of putting it into the deletion queue?

@nagisa
Copy link
Collaborator

nagisa commented Apr 9, 2024

Another piece of evidence for this hypothesis is that in the trace linked above, update_cache span has very little idle cpu time (3433 ns) and if it were due to lock contention, it should have much higher idle time

The idle time metric is only really relevant for spans that are entered and/or exited multiple times, such as in asynchronous code (where each poll is an individual entry/exit.) In synchronous code the idle time will usually be quite low (even if/when waiting on a synchronous lock implementation,) as it usually enters the span immediately after it is created and only does so once until the span is destroyed.

In other words, it is not transparently measuring something like CPU time for us and cannot be used as a proxy to determine whether the code within the span was actually idle or not.

@bowenwang1996
Copy link
Collaborator Author

@nagisa unfortunately I was not able to get to the end of this issue. From what I see, it seems that the cache simply isn't performant enough. Tracing shows that each put/pop takes at least 1 microsecond and sometimes takes ~10 microseconds, which is quite slow. That may be due to the number of entries in the cache but could also due to tracing instrumentation (at 1μs latency I suspect tracing adds nontrivial overhead). We need to look more into it. If it is indeed the performance of the cache that is the issue, I think there may be several ways we can resolve it:

  • Optimize the cache operations as much as possible
  • Use some heavy lifting such as lock free LRU caches (looks like there is a rust implementation)
  • Do I understand correctly that for the correctness of Trie to work, we only need to force update cache on deletion but not necessarily insertion? It seems we mostly rely on trie nodes being fetched during execution anyways and not sure how much it helps to have it do a blocking update on writing trie changes. cc @Longarithm

@nagisa
Copy link
Collaborator

nagisa commented Apr 9, 2024

tracing is hundreds-of-nanoseconds of overhead per event/span, but also its really taxing to use it to trace such small and frequent operations.

(I was also selectively tracing only store::update and store::trie::shard_tries targets to minimize the tracing overhead as much as possible)

Anyhow, yeah, I'm looking into it.

@nagisa
Copy link
Collaborator

nagisa commented Apr 9, 2024

Tracing shows that each put/pop takes at least 1 microsecond and sometimes takes ~10 microseconds, which is quite slow.

As for TrieCacheInner::put speed, when we're looking at a ShardTries::update_cache calls that are taking upwards of 200ms, they still contain about 1000 ops for the cache to deal with. Even with nothing under that call instrumented (so no tracing overhead) we're looking at 200+µs/op on average.

It certainly doesn't look like it could be lock contention -- in that case we would see multiple tracing observations each time a lock is taken for a long time, or a ladder-shaped steps or something along those lines. In practice we always see just one instance jump out.


One caveat is that the LruCache is "unbounded". Looking at the cache size, it seems to be approaching 24MiB in size on my node. If the comments are to be believed, with 100 to 500 bytes per cached value, this would come out to 250k to 50k of entries. Reallocations of this data structure can become somewhat a notable expense. I think this might be a plausible cause for occasional freak observations, given that these observations get progressively rarer and longer.

I think we can definitely afford to pre-allocate whatever size necessary for the LRU per shard here.


Note however that on my node I'm not seeing observations of ShardTries::update_cache taking more than 1 second at all, even after an entire night of runtime. The longest I'm seeing is 350ms... But this is with the mutex-based lock implementation.

@nagisa
Copy link
Collaborator

nagisa commented Apr 9, 2024

Early results with a pre-allocated LruCache seem to be promising. Will keep it running for a little while longer, see what things end up looking like after a few hrs and submit a PR.

github-merge-queue bot pushed a commit that referenced this issue Apr 9, 2024
…rie caches (#10995)

This achieves two distinct things:

1. With the Mutex-based lock around the hashmap of `TrieCache`s we now
only hold the lock strictly for as long as necessary, which *may* be as
long as it is necessary to create a `TrieCache`, but that's a one-time
occurrence outside of resharding scenarios as far as I can tell.
2. With the pre-allocation of the `LruCache` we get rid of the
occasionally-very-expensive reallocation of the backing store of this
data.

---

I haven't been running this code for very long, but these are what the
graphs look now (lowpass filter of 5ms for the bottom graph which shows
span durations, so for the new graph this means that most of the stuff
is under 5ms;; the top graph shows the "current" size of the shard trie
caches.)

Old(?; I don't actually know what code this node is running):


![image](https://github.com/near/nearcore/assets/679122/6b101adf-f77f-49bc-a188-b11bb784645a)

New:


![image](https://github.com/near/nearcore/assets/679122/5b9a9251-9713-403a-9f50-64108f9c29cf)

Most notably gone are the occasional blips up-and-above 200ms range per
call.

Fixes  #10992

---------

Co-authored-by: Bowen Wang <bowen@near.org>
VanBarbascu pushed a commit that referenced this issue Apr 17, 2024
…rie caches (#10995)

This achieves two distinct things:

1. With the Mutex-based lock around the hashmap of `TrieCache`s we now
only hold the lock strictly for as long as necessary, which *may* be as
long as it is necessary to create a `TrieCache`, but that's a one-time
occurrence outside of resharding scenarios as far as I can tell.
2. With the pre-allocation of the `LruCache` we get rid of the
occasionally-very-expensive reallocation of the backing store of this
data.

---

I haven't been running this code for very long, but these are what the
graphs look now (lowpass filter of 5ms for the bottom graph which shows
span durations, so for the new graph this means that most of the stuff
is under 5ms;; the top graph shows the "current" size of the shard trie
caches.)

Old(?; I don't actually know what code this node is running):

![image](https://github.com/near/nearcore/assets/679122/6b101adf-f77f-49bc-a188-b11bb784645a)

New:

![image](https://github.com/near/nearcore/assets/679122/5b9a9251-9713-403a-9f50-64108f9c29cf)

Most notably gone are the occasional blips up-and-above 200ms range per
call.

Fixes  #10992

---------

Co-authored-by: Bowen Wang <bowen@near.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants