feat: chainstore: remove locking in index cache #10866

arajasek · 2023-05-12T21:32:17Z

Related Issues

Our friends working with Lotus shared 8 goroutine dumps from a node that was stuck syncing. Contention over the index cache's lock was a common theme.

Proposed Changes

Use a threadsafe cache instead of a mutex-map here.

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

Commits have a clear commit message.
PR title is in the form of of <PR type>: <area>: <change being made>
- example: fix: mempool: Introduce a cache for valid signatures
- PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
- area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
New features have usage guidelines and / or documentation updates in
- Lotus Documentation
- Discussion Tutorials
Tests exist for new functionality or change in behavior
CI is green

fridrik01

LGTM

snissn · 2023-05-16T01:35:01Z

chain/store/index.go

@@ -59,11 +59,9 @@ func (ci *ChainIndex) GetTipsetByHeight(ctx context.Context, from *types.TipSet,
 		return nil, xerrors.Errorf("failed to round down: %w", err)
 	}

-	ci.indexCacheLk.Lock()


do we need to have a transactional / consistent view of indexCache during this loop? Can there be other actors changing indexCache while this loop is running that will create unintended behavior?

fillCache has a comment that we want to be holding indexCacheLk during the call -- https://github.com/filecoin-project/lotus/blob/master/chain/store/index.go#L99

There's a lot going on in these loops. I think we just need to make sure that there's always a consistent view of the data

snissn · 2023-05-16T01:37:38Z

if we want to merge this, we should remove the comment here:

https://github.com/filecoin-project/lotus/pull/10866/files#diff-806866d49e06f388f52056a4cce962c76af2eb7218913aa473da3c15fcd65e18R97

// Caller must hold indexCacheLk
func (ci *ChainIndex) fillCache(ctx context.Context, tsk types.TipSetKey) (*lbEntry, error) {

also we need to make sure that this change is safe based on the expectations of transactional consistency that was previously in

GetTipsetByHeight

we may be needed to have a lk to ensure we need a transactional view of the state, and that lock may be named indexCacheLk but it may want to be renamed to indicate the transaction needed.

Also I haven't fully read the code yet so lmk if anything here can be clarified on!

shrenujbansal · 2023-05-16T05:14:42Z

chain/store/index.go

@@ -27,8 +27,7 @@ func init() {
 }

 type ChainIndex struct {
-	indexCacheLk sync.Mutex
-	indexCache   map[types.TipSetKey]*lbEntry
+	indexCache *lru.ARCCache[types.TipSetKey, *lbEntry]


We initially had an ARCCache here which was dropped in favor of a simple map with a mutex for the reasons pointed here: #10423 (review)
I think we can keep the mutex but reduce the size of critical section covered by the mutex

It could be replaced by a simple LRU cache if that is a concern.

shrenujbansal · 2023-05-16T05:28:03Z

So this PR is essentially reducing the critical section of the chain index lock from

lotus/chain/store/index.go

Lines 64 to 93 in 90c8928

    
           	cur := rounded.Key() 
        
           	for { 
        
           		lbe, ok := ci.indexCache[cur] 
        
           		if !ok { 
        
           			fc, err := ci.fillCache(ctx, cur) 
        
           			if err != nil { 
        
           				return nil, xerrors.Errorf("failed to fill cache: %w", err) 
        
           			} 
        
           			lbe = fc 
        
           		} 
        
           		if to == lbe.targetHeight { 
        
           			ts, err := ci.loadTipSet(ctx, lbe.target) 
        
           			if err != nil { 
        
           				return nil, xerrors.Errorf("failed to load tipset: %w", err) 
        
           			} 
        
           			return ts, nil 
        
           		} 
        
           		if to > lbe.targetHeight { 
        
           			ts, err := ci.loadTipSet(ctx, cur) 
        
           			if err != nil { 
        
           				return nil, xerrors.Errorf("failed to load tipset: %w", err) 
        
           			} 
        
           			return ci.walkBack(ctx, ts, to) 
        
           		} 
        
           		cur = lbe.target 
        
           	} 
        
           }

to the individual cache operations.

However, with this change, in the event that several processes call GetTipsetByHeight simultaneously, all these processes will run the fillCache function

lotus/chain/store/index.go

Line 100 in 90c8928

    
           func (ci *ChainIndex) fillCache(ctx context.Context, tsk types.TipSetKey) (*lbEntry, error) {

which will be wasteful.
I'm not sure if this such a case happens often, but if it does, it makes sense to keep the lock coverage until the end of fillCache

@magik6k Thoughts? Are the operations performed in fillCache relatively expensive?

snissn · 2023-05-16T07:54:27Z

However, with this change, in the event that several processes call GetTipsetByHeight simultaneously, all these processes will run the fillCache function

We could add a lock on each tipset (tsk) and ensure that only one caller of fillCache is happening at a time by holding that lock. that way subsequent callers should get the cache. we would possibly see a lot of lock contention there, but there may not be a way to avoid that here without changing something else / that could be a diff pr

Kubuxu · 2023-05-16T11:29:21Z

Our friends working with Lotus shared 8 goroutine dumps from a node that was stuck syncing. Contention over the index cache's lock was a common theme.

It would be interesting to know if they all were building a cache for the same tipset chain or not. If it was the same tipset chain then this behaviour is correct/beneficial as @shrenujbansal points out.

Kubuxu · 2023-05-16T11:43:51Z

I think the more correct but still not too complex thing to do here is to use a sharded mutex based on the tipset during fillCache and access to the cache.

Sharded Mutex is an array of predetermined size containing mutexes. When trying to lock the mutex by key, the key (tipset key in this case) is hashed using a fast non-cryptographic hash function and the mutex at digest % N_shards is taken.

hash/maphash.Hash uses runtime seed, this seed should be the same across invocations. maphash.String is preferred due to optimizations. The seed can be generated using maphash.MakeSeed.

Kubuxu · 2023-05-16T11:49:24Z

@arajasek can you share these goroutine dumps with me?

jennijuju · 2023-05-16T13:38:52Z

@arajasek can you share these goroutine dumps with me?

I will on slack

snissn · 2023-05-16T16:40:25Z

@Kubuxu I think you are describing a mutex like a "striped mutex" https://pkg.go.dev/github.com/nmvalera/striped-mutex not used this library before but worth trying before writing our own

Kubuxu · 2023-05-16T19:13:20Z

Yeah, it is what I was explaining. I haven't bothered to look.

Kubuxu · 2023-05-16T20:33:35Z

For the infura goroutine dumps, we have discovered that there was a request made to g et block at height -76, and it was looking like the ChainIndex was making some progress.

Kubuxu · 2023-05-17T14:49:10Z

The negative epoch issue should be resolved by: #10878
Despite this, a DoS vector still exists when requesting epoch 0 or 1 (there might be a special case for epoch 0).
Striped/Sharded Mutex would help with that but would reduce the performance slightly due to the need to lock and unlock repeatedly.

Due to the above, I would recommend the usage of a Sharded Lock with 16 or 32 shards, with the lock being always released before a new lock is taken. Rough lock fairness guarantees of Golang should allow for forward progress.

A possible refinement of my idea could be using Sharded Lock only during fillCache and xsync.MapOf as the Map for the cache. In this case if lbEntry for given tipset is not available, shared lock should be taken and the value in the map re-checked, if it is still not available, it should be computed, otherwise the lbEntry can be returned releasing the lock.

shrenujbansal · 2023-05-18T16:26:05Z

@Kubuxu Wouldn't a sharded mutex potentially result in the same problem as this?
For instance, you could have 2 threads which try to get tipsets 0 (all the way back) and 1,400,0000 (half way back).
In this case, both threads would end up doing the long walkback which we'd probably want to avoid

I would think an adhoc process which regularly populates the cache with the updated epochs might be more suitable in this case. Not sure how we'd implement it within the process though

Kubuxu · 2023-05-18T19:48:13Z

For instance, you could have 2 threads which try to get tipsets 0 (all the way back) and 1,400,0000 (half way back).
In this case, both threads would end up doing the long walkback which we'd probably want to avoid.

What we want to avoid is walking the 20 blocks needlessly. Only one of the threads can take a shard of a mutex for walking back the 20 blocks to build an index entry for the given tipset.

Let's say the first tipset not covered by the index is TipsetA. Both of them reach that point concurrently, they check the xsync/Map, no entry for TipsetA. Now they will attempt to lock shard number hash(TipsetA) % N_shards, one of them will win, this thread will start walking the 20 blocks to get from TipsetA to TipsetB at height TipsetA.height - 20. It will find TipsetB, put in in the cache, then release the lock and move on to checking the cache for TipsetB. This is when the second thread will acquire the lock, it will recheck the map, and notice the entry for TipsetA pointing towards TipsetB.
Both threads will check the cache for TipsetB and the process repeats.

Kubuxu · 2023-05-18T19:48:42Z

I would think an adhoc process which regularly populates the cache with the updated epochs might be more suitable in this case. Not sure how we'd implement it within the process though

The issue isn't recent epochs but far history.

@snissn

This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight. It also replaces a go map with xsync.Map which doesn't require locking. The lock is taken when it appears that ChainIndex filling work should be started. After claiming the lock, the status of the cache is rechecked, if the entry is still missing, the fillCache is started. Thanks to @snissn and @arajasek for debugging and taking initial stabs at this. Supersedes #10866 and 10885 Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>

arajasek · 2023-05-24T16:40:40Z

Thanks for the discussion & work here everyone!

@snissn

This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight. It also replaces a go map with xsync.Map which doesn't require locking. The lock is taken when it appears that ChainIndex filling work should be started. After claiming the lock, the status of the cache is rechecked, if the entry is still missing, the fillCache is started. Thanks to @snissn and @arajasek for debugging and taking initial stabs at this. Supersedes #10866 and 10885 Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>

@snissn

This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight. It also replaces a go map with xsync.Map which doesn't require locking. The lock is taken when it appears that ChainIndex filling work should be started. After claiming the lock, the status of the cache is rechecked, if the entry is still missing, the fillCache is started. Thanks to @snissn and @arajasek for debugging and taking initial stabs at this. Supersedes #10866 and 10885 Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>

arajasek requested a review from a team as a code owner May 12, 2023 21:32

feat: chainstore: remove locking in index cache

e9a6b87

arajasek force-pushed the asr/speedup-tscache branch from 1bbd2b7 to e9a6b87 Compare May 12, 2023 21:41

fridrik01 approved these changes May 15, 2023

View reviewed changes

snissn reviewed May 16, 2023

View reviewed changes

shrenujbansal reviewed May 16, 2023

View reviewed changes

snissn mentioned this pull request May 18, 2023

feat: chainstore: striped lock to prevent redundant work in index cache #10885

Closed

7 tasks

Kubuxu mentioned this pull request May 19, 2023

feat: chainstore: sharded mutex for filling chain height index #10896

Merged

arajasek closed this May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: chainstore: remove locking in index cache #10866

feat: chainstore: remove locking in index cache #10866

arajasek commented May 12, 2023

fridrik01 left a comment

snissn May 16, 2023

snissn commented May 16, 2023

shrenujbansal May 16, 2023

Kubuxu May 16, 2023

shrenujbansal commented May 16, 2023

snissn commented May 16, 2023 •

edited

Loading

Kubuxu commented May 16, 2023

Kubuxu commented May 16, 2023

Kubuxu commented May 16, 2023

jennijuju commented May 16, 2023

snissn commented May 16, 2023

Kubuxu commented May 16, 2023 •

edited

Loading

Kubuxu commented May 16, 2023

Kubuxu commented May 17, 2023

shrenujbansal commented May 18, 2023

Kubuxu commented May 18, 2023

Kubuxu commented May 18, 2023

arajasek commented May 24, 2023

feat: chainstore: remove locking in index cache #10866

feat: chainstore: remove locking in index cache #10866

Conversation

arajasek commented May 12, 2023

Related Issues

Proposed Changes

Additional Info

Checklist

fridrik01 left a comment

Choose a reason for hiding this comment

snissn May 16, 2023

Choose a reason for hiding this comment

snissn commented May 16, 2023

shrenujbansal May 16, 2023

Choose a reason for hiding this comment

Kubuxu May 16, 2023

Choose a reason for hiding this comment

shrenujbansal commented May 16, 2023

snissn commented May 16, 2023 • edited Loading

Kubuxu commented May 16, 2023

Kubuxu commented May 16, 2023

Kubuxu commented May 16, 2023

jennijuju commented May 16, 2023

snissn commented May 16, 2023

Kubuxu commented May 16, 2023 • edited Loading

Kubuxu commented May 16, 2023

Kubuxu commented May 17, 2023

shrenujbansal commented May 18, 2023

Kubuxu commented May 18, 2023

Kubuxu commented May 18, 2023

arajasek commented May 24, 2023

snissn commented May 16, 2023 •

edited

Loading

Kubuxu commented May 16, 2023 •

edited

Loading