Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: chainstore: remove locking in index cache #10866

Closed
wants to merge 1 commit into from

Conversation

arajasek
Copy link
Contributor

Related Issues

#10788?

Our friends working with Lotus shared 8 goroutine dumps from a node that was stuck syncing. Contention over the index cache's lock was a common theme.

Proposed Changes

Use a threadsafe cache instead of a mutex-map here.

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@arajasek arajasek requested a review from a team as a code owner May 12, 2023 21:32
Copy link
Contributor

@fridrik01 fridrik01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -59,11 +59,9 @@ func (ci *ChainIndex) GetTipsetByHeight(ctx context.Context, from *types.TipSet,
return nil, xerrors.Errorf("failed to round down: %w", err)
}

ci.indexCacheLk.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to have a transactional / consistent view of indexCache during this loop? Can there be other actors changing indexCache while this loop is running that will create unintended behavior?

fillCache has a comment that we want to be holding indexCacheLk during the call -- https://github.com/filecoin-project/lotus/blob/master/chain/store/index.go#L99

There's a lot going on in these loops. I think we just need to make sure that there's always a consistent view of the data

@snissn
Copy link
Contributor

snissn commented May 16, 2023

if we want to merge this, we should remove the comment here:

https://github.com/filecoin-project/lotus/pull/10866/files#diff-806866d49e06f388f52056a4cce962c76af2eb7218913aa473da3c15fcd65e18R97

// Caller must hold indexCacheLk
func (ci *ChainIndex) fillCache(ctx context.Context, tsk types.TipSetKey) (*lbEntry, error) {

also we need to make sure that this change is safe based on the expectations of transactional consistency that was previously in

GetTipsetByHeight

we may be needed to have a lk to ensure we need a transactional view of the state, and that lock may be named indexCacheLk but it may want to be renamed to indicate the transaction needed.

Also I haven't fully read the code yet so lmk if anything here can be clarified on!

@@ -27,8 +27,7 @@ func init() {
}

type ChainIndex struct {
indexCacheLk sync.Mutex
indexCache map[types.TipSetKey]*lbEntry
indexCache *lru.ARCCache[types.TipSetKey, *lbEntry]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initially had an ARCCache here which was dropped in favor of a simple map with a mutex for the reasons pointed here: #10423 (review)
I think we can keep the mutex but reduce the size of critical section covered by the mutex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be replaced by a simple LRU cache if that is a concern.

@shrenujbansal
Copy link
Contributor

So this PR is essentially reducing the critical section of the chain index lock from

cur := rounded.Key()
for {
lbe, ok := ci.indexCache[cur]
if !ok {
fc, err := ci.fillCache(ctx, cur)
if err != nil {
return nil, xerrors.Errorf("failed to fill cache: %w", err)
}
lbe = fc
}
if to == lbe.targetHeight {
ts, err := ci.loadTipSet(ctx, lbe.target)
if err != nil {
return nil, xerrors.Errorf("failed to load tipset: %w", err)
}
return ts, nil
}
if to > lbe.targetHeight {
ts, err := ci.loadTipSet(ctx, cur)
if err != nil {
return nil, xerrors.Errorf("failed to load tipset: %w", err)
}
return ci.walkBack(ctx, ts, to)
}
cur = lbe.target
}
}
to the individual cache operations.

However, with this change, in the event that several processes call GetTipsetByHeight simultaneously, all these processes will run the fillCache function

func (ci *ChainIndex) fillCache(ctx context.Context, tsk types.TipSetKey) (*lbEntry, error) {
which will be wasteful.
I'm not sure if this such a case happens often, but if it does, it makes sense to keep the lock coverage until the end of fillCache

@magik6k Thoughts? Are the operations performed in fillCache relatively expensive?

@snissn
Copy link
Contributor

snissn commented May 16, 2023

However, with this change, in the event that several processes call GetTipsetByHeight simultaneously, all these processes will run the fillCache function

We could add a lock on each tipset (tsk) and ensure that only one caller of fillCache is happening at a time by holding that lock. that way subsequent callers should get the cache. we would possibly see a lot of lock contention there, but there may not be a way to avoid that here without changing something else / that could be a diff pr

@Kubuxu
Copy link
Contributor

Kubuxu commented May 16, 2023

Our friends working with Lotus shared 8 goroutine dumps from a node that was stuck syncing. Contention over the index cache's lock was a common theme.

It would be interesting to know if they all were building a cache for the same tipset chain or not. If it was the same tipset chain then this behaviour is correct/beneficial as @shrenujbansal points out.

@Kubuxu
Copy link
Contributor

Kubuxu commented May 16, 2023

I think the more correct but still not too complex thing to do here is to use a sharded mutex based on the tipset during fillCache and access to the cache.

Sharded Mutex is an array of predetermined size containing mutexes. When trying to lock the mutex by key, the key (tipset key in this case) is hashed using a fast non-cryptographic hash function and the mutex at digest % N_shards is taken.

hash/maphash.Hash uses runtime seed, this seed should be the same across invocations. maphash.String is preferred due to optimizations. The seed can be generated using maphash.MakeSeed.

@Kubuxu
Copy link
Contributor

Kubuxu commented May 16, 2023

@arajasek can you share these goroutine dumps with me?

@jennijuju
Copy link
Member

@arajasek can you share these goroutine dumps with me?

I will on slack

@snissn
Copy link
Contributor

snissn commented May 16, 2023

@Kubuxu I think you are describing a mutex like a "striped mutex" https://pkg.go.dev/github.com/nmvalera/striped-mutex not used this library before but worth trying before writing our own

@Kubuxu
Copy link
Contributor

Kubuxu commented May 16, 2023

Yeah, it is what I was explaining. I haven't bothered to look.

@Kubuxu
Copy link
Contributor

Kubuxu commented May 16, 2023

For the infura goroutine dumps, we have discovered that there was a request made to g et block at height -76, and it was looking like the ChainIndex was making some progress.

@Kubuxu
Copy link
Contributor

Kubuxu commented May 17, 2023

The negative epoch issue should be resolved by: #10878
Despite this, a DoS vector still exists when requesting epoch 0 or 1 (there might be a special case for epoch 0).
Striped/Sharded Mutex would help with that but would reduce the performance slightly due to the need to lock and unlock repeatedly.

Due to the above, I would recommend the usage of a Sharded Lock with 16 or 32 shards, with the lock being always released before a new lock is taken. Rough lock fairness guarantees of Golang should allow for forward progress.

A possible refinement of my idea could be using Sharded Lock only during fillCache and xsync.MapOf as the Map for the cache. In this case if lbEntry for given tipset is not available, shared lock should be taken and the value in the map re-checked, if it is still not available, it should be computed, otherwise the lbEntry can be returned releasing the lock.

@shrenujbansal
Copy link
Contributor

@Kubuxu Wouldn't a sharded mutex potentially result in the same problem as this?
For instance, you could have 2 threads which try to get tipsets 0 (all the way back) and 1,400,0000 (half way back).
In this case, both threads would end up doing the long walkback which we'd probably want to avoid

I would think an adhoc process which regularly populates the cache with the updated epochs might be more suitable in this case. Not sure how we'd implement it within the process though

@Kubuxu
Copy link
Contributor

Kubuxu commented May 18, 2023

For instance, you could have 2 threads which try to get tipsets 0 (all the way back) and 1,400,0000 (half way back).
In this case, both threads would end up doing the long walkback which we'd probably want to avoid.

What we want to avoid is walking the 20 blocks needlessly. Only one of the threads can take a shard of a mutex for walking back the 20 blocks to build an index entry for the given tipset.

Let's say the first tipset not covered by the index is TipsetA. Both of them reach that point concurrently, they check the xsync/Map, no entry for TipsetA. Now they will attempt to lock shard number hash(TipsetA) % N_shards, one of them will win, this thread will start walking the 20 blocks to get from TipsetA to TipsetB at height TipsetA.height - 20. It will find TipsetB, put in in the cache, then release the lock and move on to checking the cache for TipsetB. This is when the second thread will acquire the lock, it will recheck the map, and notice the entry for TipsetA pointing towards TipsetB.
Both threads will check the cache for TipsetB and the process repeats.

@Kubuxu
Copy link
Contributor

Kubuxu commented May 18, 2023

I would think an adhoc process which regularly populates the cache with the updated epochs might be more suitable in this case. Not sure how we'd implement it within the process though

The issue isn't recent epochs but far history.

Kubuxu pushed a commit that referenced this pull request May 19, 2023
This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight.
It also replaces a go map with xsync.Map which doesn't require locking.

The lock is taken when it appears that ChainIndex filling work should be
started. After claiming the lock, the status of the cache is rechecked,
if the entry is still missing, the fillCache is started.

Thanks to @snissn and @arajasek for debugging and taking initial stabs at this.

Supersedes #10866 and 10885

Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>
@arajasek
Copy link
Contributor Author

Thanks for the discussion & work here everyone!

@arajasek arajasek closed this May 24, 2023
shrenujbansal pushed a commit that referenced this pull request May 31, 2023
This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight.
It also replaces a go map with xsync.Map which doesn't require locking.

The lock is taken when it appears that ChainIndex filling work should be
started. After claiming the lock, the status of the cache is rechecked,
if the entry is still missing, the fillCache is started.

Thanks to @snissn and @arajasek for debugging and taking initial stabs at this.

Supersedes #10866 and 10885

Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>
jennijuju pushed a commit that referenced this pull request May 31, 2023
This PR introduces as sharded mutex within the ChainIndex#GetTipsetByHeight.
It also replaces a go map with xsync.Map which doesn't require locking.

The lock is taken when it appears that ChainIndex filling work should be
started. After claiming the lock, the status of the cache is rechecked,
if the entry is still missing, the fillCache is started.

Thanks to @snissn and @arajasek for debugging and taking initial stabs at this.

Supersedes #10866 and 10885

Signed-off-by: Jakub Sztandera <kubuxu@protocol.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants