-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: chainstore: remove locking in index cache #10866
Conversation
1bbd2b7
to
e9a6b87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -59,11 +59,9 @@ func (ci *ChainIndex) GetTipsetByHeight(ctx context.Context, from *types.TipSet, | |||
return nil, xerrors.Errorf("failed to round down: %w", err) | |||
} | |||
|
|||
ci.indexCacheLk.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to have a transactional / consistent view of indexCache during this loop? Can there be other actors changing indexCache while this loop is running that will create unintended behavior?
fillCache has a comment that we want to be holding indexCacheLk during the call -- https://github.com/filecoin-project/lotus/blob/master/chain/store/index.go#L99
There's a lot going on in these loops. I think we just need to make sure that there's always a consistent view of the data
if we want to merge this, we should remove the comment here:
also we need to make sure that this change is safe based on the expectations of transactional consistency that was previously in GetTipsetByHeight we may be needed to have a lk to ensure we need a transactional view of the state, and that lock may be named indexCacheLk but it may want to be renamed to indicate the transaction needed. Also I haven't fully read the code yet so lmk if anything here can be clarified on! |
@@ -27,8 +27,7 @@ func init() { | |||
} | |||
|
|||
type ChainIndex struct { | |||
indexCacheLk sync.Mutex | |||
indexCache map[types.TipSetKey]*lbEntry | |||
indexCache *lru.ARCCache[types.TipSetKey, *lbEntry] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We initially had an ARCCache here which was dropped in favor of a simple map with a mutex for the reasons pointed here: #10423 (review)
I think we can keep the mutex but reduce the size of critical section covered by the mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be replaced by a simple LRU cache if that is a concern.
So this PR is essentially reducing the critical section of the chain index lock from Lines 64 to 93 in 90c8928
However, with this change, in the event that several processes call GetTipsetByHeight simultaneously, all these processes will run the fillCache function Line 100 in 90c8928
I'm not sure if this such a case happens often, but if it does, it makes sense to keep the lock coverage until the end of fillCache @magik6k Thoughts? Are the operations performed in fillCache relatively expensive? |
We could add a lock on each tipset (tsk) and ensure that only one caller of fillCache is happening at a time by holding that lock. that way subsequent callers should get the cache. we would possibly see a lot of lock contention there, but there may not be a way to avoid that here without changing something else / that could be a diff pr |
It would be interesting to know if they all were building a cache for the same tipset chain or not. If it was the same tipset chain then this behaviour is correct/beneficial as @shrenujbansal points out. |
I think the more correct but still not too complex thing to do here is to use a sharded mutex based on the tipset during fillCache and access to the cache. Sharded Mutex is an array of predetermined size containing mutexes. When trying to lock the mutex by key, the key (tipset key in this case) is hashed using a fast non-cryptographic hash function and the mutex at hash/maphash.Hash uses runtime seed, this seed should be the same across invocations. maphash.String is preferred due to optimizations. The seed can be generated using maphash.MakeSeed. |
@arajasek can you share these goroutine dumps with me? |
I will on slack |
@Kubuxu I think you are describing a mutex like a "striped mutex" https://pkg.go.dev/github.com/nmvalera/striped-mutex not used this library before but worth trying before writing our own |
Yeah, it is what I was explaining. I haven't bothered to look. |
For the infura goroutine dumps, we have discovered that there was a request made to g et block at height -76, and it was looking like the ChainIndex was making some progress. |
The negative epoch issue should be resolved by: #10878 Due to the above, I would recommend the usage of a Sharded Lock with 16 or 32 shards, with the lock being always released before a new lock is taken. Rough lock fairness guarantees of Golang should allow for forward progress. A possible refinement of my idea could be using Sharded Lock only during |
@Kubuxu Wouldn't a sharded mutex potentially result in the same problem as this? I would think an adhoc process which regularly populates the cache with the updated epochs might be more suitable in this case. Not sure how we'd implement it within the process though |
What we want to avoid is walking the 20 blocks needlessly. Only one of the threads can take a shard of a mutex for walking back the 20 blocks to build an index entry for the given tipset. Let's say the first tipset not covered by the index is TipsetA. Both of them reach that point concurrently, they check the xsync/Map, no entry for TipsetA. Now they will attempt to lock shard number |
The issue isn't recent epochs but far history. |
This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight. It also replaces a go map with xsync.Map which doesn't require locking. The lock is taken when it appears that ChainIndex filling work should be started. After claiming the lock, the status of the cache is rechecked, if the entry is still missing, the fillCache is started. Thanks to @snissn and @arajasek for debugging and taking initial stabs at this. Supersedes #10866 and 10885 Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>
Thanks for the discussion & work here everyone! |
This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight. It also replaces a go map with xsync.Map which doesn't require locking. The lock is taken when it appears that ChainIndex filling work should be started. After claiming the lock, the status of the cache is rechecked, if the entry is still missing, the fillCache is started. Thanks to @snissn and @arajasek for debugging and taking initial stabs at this. Supersedes #10866 and 10885 Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>
This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight. It also replaces a go map with xsync.Map which doesn't require locking. The lock is taken when it appears that ChainIndex filling work should be started. After claiming the lock, the status of the cache is rechecked, if the entry is still missing, the fillCache is started. Thanks to @snissn and @arajasek for debugging and taking initial stabs at this. Supersedes #10866 and 10885 Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>
Related Issues
#10788?
Our friends working with Lotus shared 8 goroutine dumps from a node that was stuck syncing. Contention over the index cache's lock was a common theme.
Proposed Changes
Use a threadsafe cache instead of a mutex-map here.
Additional Info
Checklist
Before you mark the PR ready for review, please make sure that:
<PR type>: <area>: <change being made>
fix: mempool: Introduce a cache for valid signatures
PR type
: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, testarea
, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps