Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(manager): stop applying blocks after node set unhealthy #1194

Merged
merged 12 commits into from
Nov 8, 2024

Conversation

srene
Copy link
Contributor

@srene srene commented Oct 31, 2024

PR Standards

Opening a pull request should be able to meet the following requirements

--

PR naming convention: https://hackmd.io/@nZpxHZ0CT7O5ngTp0TP9mg/HJP_jrm7A


Close #XXX

<-- Briefly describe the content of this pull request -->

For Author:

  • Targeted PR against correct branch
  • included the correct type prefix in the PR title
  • Linked to Github issue with discussion and accepted design
  • Targets only one github issue
  • Wrote unit and integration tests
  • All CI checks have passed
  • Added relevant godoc comments

For Reviewer:

  • confirmed the correct type prefix in the PR title
  • Reviewers assigned
  • confirmed all author checklist items have been addressed

After reviewer approval:

  • In case targets main branch, PR should be squashed and merged.
  • In case PR targets a release branch, PR should be rebased.

@srene srene requested a review from a team as a code owner October 31, 2024 14:50
@srene srene marked this pull request as draft October 31, 2024 14:50
block/modes.go Fixed Show fixed Hide fixed
block/modes.go Fixed Show fixed Hide fixed
block/modes.go Fixed Show fixed Hide fixed
block/modes.go Fixed Show fixed Hide fixed
@srene srene marked this pull request as ready for review November 2, 2024 16:01
mtsitrin
mtsitrin previously approved these changes Nov 7, 2024
block/modes.go Outdated
@@ -75,3 +69,34 @@ func (m *Manager) runAsProposer(ctx context.Context, eg *errgroup.Group) error {

return nil
}

func (m *Manager) subscribeFullNodeEvents(ctx context.Context) {
if m.RunMode == RunModeProposer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to remove to call site

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

block/modes.go Outdated
}

func (m *Manager) unsubscribeFullNodeEvents(ctx context.Context) {
if m.RunMode == RunModeProposer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. I'd move to callsite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

block/modes.go Outdated
m.logger.Error("Unsubscribe", "clientId", clientId, "error", err)
}
}
unsubscribe("syncLoop")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard-conding the client id can easily be error prone.. I'd suggest change to const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to const

block/sync.go Outdated
@@ -71,18 +68,16 @@ func (m *Manager) SettlementSyncLoop(ctx context.Context) error {

settlementBatch, err := m.SLClient.GetBatchAtHeight(m.State.NextHeight())
if err != nil {
return fmt.Errorf("retrieve batch: %w", err)
m.logger.Error("retrieve SL batch", "err", err)
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we break vs returning error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because if it temporary fails the access to SL, do you want to return error and quit the loop? this way it will try again for the next state update but without failing (the rollapp maybe operative and is just an issue with the hub node).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thing is that I think that if we return an error it means the hub node retires exhausted.
in that case it could be due to:

  1. bad rpc endpoitn (requires node operator inteference)
  2. hub is down

either way I think both should raise node unhleathy and let the operator figure it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to error

block/sync.go Outdated
}
m.logger.Info("Retrieved state update from SL.", "state_index", settlementBatch.StateIndex)

err = m.ApplyBatchFromSL(settlementBatch.Batch)
if err != nil {
m.freezeNode(context.Background(), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we freeze node vs returning error and handling on the call site?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it

block/sync.go Outdated
@@ -92,16 +87,18 @@ func (m *Manager) SettlementSyncLoop(ctx context.Context) error {

err = m.attemptApplyCachedBlocks()
if err != nil {
uevent.MustPublish(context.TODO(), m.Pubsub, &events.DataHealthStatus{Error: err}, events.HealthStatusList)
m.freezeNode(context.Background(), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. imo freeze should be called from callsite if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it

Copy link
Contributor

@danwt danwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously not a blocker, but I think the concept is a bit backwards

A node should be unhealthy if it stops working for whatever reason

So you should turn off the processing, and the unhealthy status should read off of that, not the other way around

@srene srene requested a review from omritoptix November 8, 2024 12:18

func (m *Manager) subscribeFullNodeEvents(ctx context.Context) {
// Subscribe to new (or finalized) state updates events.
go uevent.MustSubscribe(ctx, m.Pubsub, syncLoop, settlement.EventQueryNewSettlementBatchAccepted, m.onNewStateUpdate, m.logger)

Check notice

Code scanning / CodeQL

Spawning a Go routine Note

Spawning a Go routine may be a possible source of non-determinism
func (m *Manager) subscribeFullNodeEvents(ctx context.Context) {
// Subscribe to new (or finalized) state updates events.
go uevent.MustSubscribe(ctx, m.Pubsub, syncLoop, settlement.EventQueryNewSettlementBatchAccepted, m.onNewStateUpdate, m.logger)
go uevent.MustSubscribe(ctx, m.Pubsub, validateLoop, settlement.EventQueryNewSettlementBatchFinalized, m.onNewStateUpdateFinalized, m.logger)

Check notice

Code scanning / CodeQL

Spawning a Go routine Note

Spawning a Go routine may be a possible source of non-determinism
go uevent.MustSubscribe(ctx, m.Pubsub, validateLoop, settlement.EventQueryNewSettlementBatchFinalized, m.onNewStateUpdateFinalized, m.logger)

// Subscribe to P2P received blocks events (used for P2P syncing).
go uevent.MustSubscribe(ctx, m.Pubsub, p2pGossipLoop, p2p.EventQueryNewGossipedBlock, m.OnReceivedBlock, m.logger)

Check notice

Code scanning / CodeQL

Spawning a Go routine Note

Spawning a Go routine may be a possible source of non-determinism

// Subscribe to P2P received blocks events (used for P2P syncing).
go uevent.MustSubscribe(ctx, m.Pubsub, p2pGossipLoop, p2p.EventQueryNewGossipedBlock, m.OnReceivedBlock, m.logger)
go uevent.MustSubscribe(ctx, m.Pubsub, p2pBlocksyncLoop, p2p.EventQueryNewBlockSyncBlock, m.OnReceivedBlock, m.logger)

Check notice

Code scanning / CodeQL

Spawning a Go routine Note

Spawning a Go routine may be a possible source of non-determinism
@srene
Copy link
Contributor Author

srene commented Nov 8, 2024

Obviously not a blocker, but I think the concept is a bit backwards

A node should be unhealthy if it stops working for whatever reason

So you should turn off the processing, and the unhealthy status should read off of that, not the other way around

I created an issue to improve the unhealthy status management #1208

@omritoptix omritoptix merged commit 2c15921 into main Nov 8, 2024
5 of 6 checks passed
@omritoptix omritoptix deleted the srene/unhealthy branch November 8, 2024 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants