Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: nodes keep out of sync when missing gossiped block (issue #284) #540

Merged
merged 15 commits into from
Dec 28, 2023

Conversation

srene
Copy link
Contributor

@srene srene commented Dec 20, 2023

PR Standards

Opening a pull request should be able to meet the following requirements


For Author:

  • Targeted PR against correct branch
  • included the correct type prefix in the PR title
  • Linked to Github issue with discussion and accepted design
  • Targets only one github issue
  • Wrote unit and integration tests
  • All CI checks have passed
  • Added relevant godoc comments

For Reviewer:

  • confirmed the correct type prefix in the PR title
  • Reviewers assigned
  • confirmed all author checklist items have been addressed

After reviewer approval:

  • In case targets main branch, PR should be squashed and merged.
  • In case PR targets a release branch, PR should be rebased.

@srene srene requested a review from a team as a code owner December 20, 2023 11:02
@srene
Copy link
Contributor Author

srene commented Dec 20, 2023

Fixing the following issue #284.
The problem is when nodes are disconnected or join the network with some delay, they are unable to receive missing blocks and keep out of sync till next batch submission.
The solution enables that nodes can receive previously missed blocks when disconnected and keep up with the block height.
The solution relies in the following:

  • Modification to go-libp2p-pubsub to emit gossip of cached messages to any node when the number of nodes of the network is small (below the degree D), otherwise no gossip (periodic broadcast of previously received messages) is received by nodes.
  • Caching of out-of-order received blocks, to apply them later when missing block received.
  • Modification of default gossipsub parameters to enable previous blocks caching

@omritoptix
Copy link
Contributor

@srene tests fail

Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (d2b2cdf) 52.46% compared to head (4b4a6e0) 52.43%.

Files Patch % Lines
block/block.go 31.81% 13 Missing and 2 partials ⚠️
block/manager.go 11.76% 15 Missing ⚠️
block/retriever.go 25.00% 2 Missing and 1 partial ⚠️
config/config.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #540      +/-   ##
==========================================
- Coverage   52.46%   52.43%   -0.04%     
==========================================
  Files         100      100              
  Lines       14717    14764      +47     
==========================================
+ Hits         7722     7742      +20     
- Misses       5886     5911      +25     
- Partials     1109     1111       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@srene
Copy link
Contributor Author

srene commented Dec 20, 2023

@omri build successful with go1.19 after gossipsub downgrade

Copy link
Contributor

@omritoptix omritoptix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO applyBlock should only applyBlock and it's being called from the sequencer or full nodes but adding caching logic there is not relevant for some of the callers (like the sequncer).

You can simply check in the ApplyBlockCallback before even calling the applyBlock if it should be cached (by the store height etc) . If it’s correct apply it. If not add to cache. And than every time I get a block, assuming my cache is not empty I can try and call AttemptApplyCachedBlocks (or some better name)

This function should probably be called in 2 situations:

  1. once we get a new gossiped block
  2. once we get a block from the DA

@@ -50,19 +50,22 @@ func (m *Manager) syncUntilTarget(ctx context.Context, syncTarget uint64) error
return err
}

if settlementBatch.StartHeight != currentHeight+1 {
//code commented out because it may be the case the start height from the da batch is not exactly the next block from the last received
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply delete the entire lines. no need for leaving it with comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if currentHeight != settlementBatch.EndHeight {

//code commented out because it may be the case the current height is higher than da height after applying pending blocks, in case of missing blocks
/*if currentHeight != settlementBatch.EndHeight {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply delete the entire lines. no need for leaving it with comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

block/block.go Outdated
@@ -112,6 +113,30 @@ func (m *Manager) applyBlock(ctx context.Context, block *types.Block, commit *ty
return nil
}

func (m *Manager) checkPrevCachedBlocks(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to create a mutex on the this function as can be called from both DA code and gossiped blocks callback which may lead to a Race condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. mutex added

block/block.go Outdated
@@ -112,6 +113,30 @@ func (m *Manager) applyBlock(ctx context.Context, block *types.Block, commit *ty
return nil
}

func (m *Manager) checkPrevCachedBlocks(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this function is not only checking it's also applying I think the name is confusing. Also when it says "Prev" blocks it can be confusing as it has been previously cached but those blocks can be in the future (i.e not yet applied)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name changed to attemptApplyCachedBlocks

block/manager.go Outdated
m.logger.Debug("Failed to apply block", "err", err)

if block.Header.Height != m.store.Height()+1 {
// We crashed after the commit and before updating the store height.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is actually not accurate as we can cache blocks which just came out of order regardless to any crash.

block/manager.go Outdated

if block.Header.Height != m.store.Height()+1 {
// We crashed after the commit and before updating the store height.
m.prevBlock[block.Header.Height] = &block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably try to apply the cached blocks even if we just applied a block because and not just if the blocks are out of order. for example, if we got previously block at (store height + 2) and put it into cache and now we got storeHeight + 1 (which means we won't go into the "if") we can still apply the cached block and no need to wait for the next out-of-order block in order to try it..

I would simply add at the end of the function a check, if the block we have in cache is == m.storeHeight + 1 then run ApplyCachedBlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. having function call in the if condition may need an extra block to apply all cached blocks. it's been changed

block/manager.go Outdated
m.logger.Debug("Caching block", "block height", block.Header.Height, "store height", m.store.Height())
err := m.checkPrevCachedBlocks(context.Background())
if err != nil {
m.logger.Debug("Error applying previous cached blocks", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure if Error is correct here as it’s not necessarily due to error - it’s mostly based on attempting. So maybe “failed” . especially when you log it with a "debug"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log message changed

block/manager.go Outdated
// We crashed after the commit and before updating the store height.
m.prevBlock[block.Header.Height] = &block
m.prevCommit[block.Header.Height] = &commit
m.logger.Debug("Caching block", "block height", block.Header.Height, "store height", m.store.Height())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO makes more send to move the “caching block” message before the actual caching and not after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

block/manager.go Outdated
m.prevBlock[block.Header.Height] = &block
m.prevCommit[block.Header.Height] = &commit
m.logger.Debug("Caching block", "block height", block.Header.Height, "store height", m.store.Height())
err := m.checkPrevCachedBlocks(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why context.background() is being passed? seems like it’s not being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is actually used when calling applyBlock. there is no context in the manager so context.background() is used by now

@omritoptix omritoptix merged commit 14ae6fd into dymensionxyz:main Dec 28, 2023
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full nodes can't catch up on sequencer block gossiping once out of sync
2 participants