-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(manager): stop applying blocks after node set unhealthy #1194
Conversation
f2afbf3
to
45cdef5
Compare
75721ac
to
5b6466a
Compare
block/modes.go
Outdated
@@ -75,3 +69,34 @@ func (m *Manager) runAsProposer(ctx context.Context, eg *errgroup.Group) error { | |||
|
|||
return nil | |||
} | |||
|
|||
func (m *Manager) subscribeFullNodeEvents(ctx context.Context) { | |||
if m.RunMode == RunModeProposer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to remove to call site
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
block/modes.go
Outdated
} | ||
|
||
func (m *Manager) unsubscribeFullNodeEvents(ctx context.Context) { | ||
if m.RunMode == RunModeProposer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. I'd move to callsite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
block/modes.go
Outdated
m.logger.Error("Unsubscribe", "clientId", clientId, "error", err) | ||
} | ||
} | ||
unsubscribe("syncLoop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard-conding the client id can easily be error prone.. I'd suggest change to const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to const
block/sync.go
Outdated
@@ -71,18 +68,16 @@ func (m *Manager) SettlementSyncLoop(ctx context.Context) error { | |||
|
|||
settlementBatch, err := m.SLClient.GetBatchAtHeight(m.State.NextHeight()) | |||
if err != nil { | |||
return fmt.Errorf("retrieve batch: %w", err) | |||
m.logger.Error("retrieve SL batch", "err", err) | |||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we break vs returning error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because if it temporary fails the access to SL, do you want to return error and quit the loop? this way it will try again for the next state update but without failing (the rollapp maybe operative and is just an issue with the hub node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thing is that I think that if we return an error it means the hub node retires exhausted.
in that case it could be due to:
- bad rpc endpoitn (requires node operator inteference)
- hub is down
either way I think both should raise node unhleathy and let the operator figure it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to error
block/sync.go
Outdated
} | ||
m.logger.Info("Retrieved state update from SL.", "state_index", settlementBatch.StateIndex) | ||
|
||
err = m.ApplyBatchFromSL(settlementBatch.Batch) | ||
if err != nil { | ||
m.freezeNode(context.Background(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we freeze node vs returning error and handling on the call site?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated it
block/sync.go
Outdated
@@ -92,16 +87,18 @@ func (m *Manager) SettlementSyncLoop(ctx context.Context) error { | |||
|
|||
err = m.attemptApplyCachedBlocks() | |||
if err != nil { | |||
uevent.MustPublish(context.TODO(), m.Pubsub, &events.DataHealthStatus{Error: err}, events.HealthStatusList) | |||
m.freezeNode(context.Background(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. imo freeze should be called from callsite if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously not a blocker, but I think the concept is a bit backwards
A node should be unhealthy if it stops working for whatever reason
So you should turn off the processing, and the unhealthy status should read off of that, not the other way around
|
||
func (m *Manager) subscribeFullNodeEvents(ctx context.Context) { | ||
// Subscribe to new (or finalized) state updates events. | ||
go uevent.MustSubscribe(ctx, m.Pubsub, syncLoop, settlement.EventQueryNewSettlementBatchAccepted, m.onNewStateUpdate, m.logger) |
Check notice
Code scanning / CodeQL
Spawning a Go routine Note
func (m *Manager) subscribeFullNodeEvents(ctx context.Context) { | ||
// Subscribe to new (or finalized) state updates events. | ||
go uevent.MustSubscribe(ctx, m.Pubsub, syncLoop, settlement.EventQueryNewSettlementBatchAccepted, m.onNewStateUpdate, m.logger) | ||
go uevent.MustSubscribe(ctx, m.Pubsub, validateLoop, settlement.EventQueryNewSettlementBatchFinalized, m.onNewStateUpdateFinalized, m.logger) |
Check notice
Code scanning / CodeQL
Spawning a Go routine Note
go uevent.MustSubscribe(ctx, m.Pubsub, validateLoop, settlement.EventQueryNewSettlementBatchFinalized, m.onNewStateUpdateFinalized, m.logger) | ||
|
||
// Subscribe to P2P received blocks events (used for P2P syncing). | ||
go uevent.MustSubscribe(ctx, m.Pubsub, p2pGossipLoop, p2p.EventQueryNewGossipedBlock, m.OnReceivedBlock, m.logger) |
Check notice
Code scanning / CodeQL
Spawning a Go routine Note
|
||
// Subscribe to P2P received blocks events (used for P2P syncing). | ||
go uevent.MustSubscribe(ctx, m.Pubsub, p2pGossipLoop, p2p.EventQueryNewGossipedBlock, m.OnReceivedBlock, m.logger) | ||
go uevent.MustSubscribe(ctx, m.Pubsub, p2pBlocksyncLoop, p2p.EventQueryNewBlockSyncBlock, m.OnReceivedBlock, m.logger) |
Check notice
Code scanning / CodeQL
Spawning a Go routine Note
I created an issue to improve the unhealthy status management #1208 |
PR Standards
Opening a pull request should be able to meet the following requirements
--
PR naming convention: https://hackmd.io/@nZpxHZ0CT7O5ngTp0TP9mg/HJP_jrm7A
Close #XXX
<-- Briefly describe the content of this pull request -->
For Author:
godoc
commentsFor Reviewer:
After reviewer approval: