feat(manager): stop applying blocks after node set unhealthy #1194

srene · 2024-10-31T14:50:20Z

PR Standards

Opening a pull request should be able to meet the following requirements

--

PR naming convention: https://hackmd.io/@nZpxHZ0CT7O5ngTp0TP9mg/HJP_jrm7A

Close #XXX

<-- Briefly describe the content of this pull request -->

For Author:

Targeted PR against correct branch
included the correct type prefix in the PR title
Linked to Github issue with discussion and accepted design
Targets only one github issue
Wrote unit and integration tests
All CI checks have passed
Added relevant godoc comments

For Reviewer:

confirmed the correct type prefix in the PR title
Reviewers assigned
confirmed all author checklist items have been addressed

After reviewer approval:

In case targets main branch, PR should be squashed and merged.
In case PR targets a release branch, PR should be rebased.

block/modes.go

omritoptix · 2024-11-08T11:42:55Z

block/modes.go

@@ -75,3 +69,34 @@ func (m *Manager) runAsProposer(ctx context.Context, eg *errgroup.Group) error {

 	return nil
 }
+
+func (m *Manager) subscribeFullNodeEvents(ctx context.Context) {
+	if m.RunMode == RunModeProposer {


probably better to remove to call site

omritoptix · 2024-11-08T11:43:59Z

block/modes.go

+}
+
+func (m *Manager) unsubscribeFullNodeEvents(ctx context.Context) {
+	if m.RunMode == RunModeProposer {


same. I'd move to callsite

omritoptix · 2024-11-08T11:44:43Z

block/modes.go

+			m.logger.Error("Unsubscribe", "clientId", clientId, "error", err)
+		}
+	}
+	unsubscribe("syncLoop")


hard-conding the client id can easily be error prone.. I'd suggest change to const.

moved to const

omritoptix · 2024-11-08T11:47:52Z

block/sync.go

@@ -71,18 +68,16 @@ func (m *Manager) SettlementSyncLoop(ctx context.Context) error {

 				settlementBatch, err := m.SLClient.GetBatchAtHeight(m.State.NextHeight())
 				if err != nil {
-					return fmt.Errorf("retrieve batch: %w", err)
+					m.logger.Error("retrieve SL batch", "err", err)
+					break


why do we break vs returning error?

because if it temporary fails the access to SL, do you want to return error and quit the loop? this way it will try again for the next state update but without failing (the rollapp maybe operative and is just an issue with the hub node).

thing is that I think that if we return an error it means the hub node retires exhausted.
in that case it could be due to:

bad rpc endpoitn (requires node operator inteference)

hub is down

either way I think both should raise node unhleathy and let the operator figure it out

changed to error

omritoptix · 2024-11-08T11:48:50Z

block/sync.go

 				}
 				m.logger.Info("Retrieved state update from SL.", "state_index", settlementBatch.StateIndex)

 				err = m.ApplyBatchFromSL(settlementBatch.Batch)
 				if err != nil {
+					m.freezeNode(context.Background(), err)


why do we freeze node vs returning error and handling on the call site?

omritoptix · 2024-11-08T11:49:17Z

block/sync.go

@@ -92,16 +87,18 @@ func (m *Manager) SettlementSyncLoop(ctx context.Context) error {

 				err = m.attemptApplyCachedBlocks()
 				if err != nil {
-					uevent.MustPublish(context.TODO(), m.Pubsub, &events.DataHealthStatus{Error: err}, events.HealthStatusList)
+					m.freezeNode(context.Background(), err)


same here. imo freeze should be called from callsite if possible.

danwt

Obviously not a blocker, but I think the concept is a bit backwards

A node should be unhealthy if it stops working for whatever reason

So you should turn off the processing, and the unhealthy status should read off of that, not the other way around

block/modes.go

+
+func (m *Manager) subscribeFullNodeEvents(ctx context.Context) {
+	// Subscribe to new (or finalized) state updates events.
+	go uevent.MustSubscribe(ctx, m.Pubsub, syncLoop, settlement.EventQueryNewSettlementBatchAccepted, m.onNewStateUpdate, m.logger)


block/modes.go

+func (m *Manager) subscribeFullNodeEvents(ctx context.Context) {
+	// Subscribe to new (or finalized) state updates events.
+	go uevent.MustSubscribe(ctx, m.Pubsub, syncLoop, settlement.EventQueryNewSettlementBatchAccepted, m.onNewStateUpdate, m.logger)
+	go uevent.MustSubscribe(ctx, m.Pubsub, validateLoop, settlement.EventQueryNewSettlementBatchFinalized, m.onNewStateUpdateFinalized, m.logger)


block/modes.go

+	go uevent.MustSubscribe(ctx, m.Pubsub, validateLoop, settlement.EventQueryNewSettlementBatchFinalized, m.onNewStateUpdateFinalized, m.logger)
+
+	// Subscribe to P2P received blocks events (used for P2P syncing).
+	go uevent.MustSubscribe(ctx, m.Pubsub, p2pGossipLoop, p2p.EventQueryNewGossipedBlock, m.OnReceivedBlock, m.logger)


block/modes.go

+
+	// Subscribe to P2P received blocks events (used for P2P syncing).
+	go uevent.MustSubscribe(ctx, m.Pubsub, p2pGossipLoop, p2p.EventQueryNewGossipedBlock, m.OnReceivedBlock, m.logger)
+	go uevent.MustSubscribe(ctx, m.Pubsub, p2pBlocksyncLoop, p2p.EventQueryNewBlockSyncBlock, m.OnReceivedBlock, m.logger)


srene · 2024-11-08T12:55:48Z

Obviously not a blocker, but I think the concept is a bit backwards

A node should be unhealthy if it stops working for whatever reason

So you should turn off the processing, and the unhealthy status should read off of that, not the other way around

I created an issue to improve the unhealthy status management #1208

srene requested a review from a team as a code owner October 31, 2024 14:50

srene marked this pull request as draft October 31, 2024 14:50

github-actions bot added the dym-internal label Oct 31, 2024

github-advanced-security bot found potential problems Oct 31, 2024

View reviewed changes

block/modes.go Fixed Show fixed Hide fixed

block/modes.go Fixed Show fixed Hide fixed

block/modes.go Fixed Show fixed Hide fixed

block/modes.go Fixed Show fixed Hide fixed

srene marked this pull request as ready for review November 2, 2024 16:01

mtsitrin previously approved these changes Nov 7, 2024

View reviewed changes

srene dismissed mtsitrin’s stale review via f2afbf3 November 8, 2024 10:34

srene added 6 commits November 8, 2024 11:36

return err

84c34e7

unsubscribe

998e38c

handle apply block err

19b4787

minor edit

527afb8

fix

5c27a9d

rename func

45cdef5

srene force-pushed the srene/unhealthy branch from f2afbf3 to 45cdef5 Compare November 8, 2024 10:36

fix after merge

5b6466a

srene force-pushed the srene/unhealthy branch from 75721ac to 5b6466a Compare November 8, 2024 11:07

omritoptix reviewed Nov 8, 2024

View reviewed changes

danwt reviewed Nov 8, 2024

View reviewed changes

srene added 3 commits November 8, 2024 13:10

comments

b3cd40b

comments

6b14dc2

comments

cc6f891

srene requested a review from omritoptix November 8, 2024 12:18

lint

e9e9e75

github-advanced-security bot found potential problems Nov 8, 2024

View reviewed changes

returning error when sl is down

ec76958

omritoptix approved these changes Nov 8, 2024

View reviewed changes

omritoptix merged commit 2c15921 into main Nov 8, 2024
5 of 6 checks passed

omritoptix deleted the srene/unhealthy branch November 8, 2024 13:26

srene added a commit that referenced this pull request Nov 10, 2024

feat(manager): stop applying blocks after node set unhealthy (#1194)

3d0ad33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(manager): stop applying blocks after node set unhealthy #1194

feat(manager): stop applying blocks after node set unhealthy #1194

srene commented Oct 31, 2024

omritoptix Nov 8, 2024

srene Nov 8, 2024

omritoptix Nov 8, 2024

srene Nov 8, 2024

omritoptix Nov 8, 2024

srene Nov 8, 2024

omritoptix Nov 8, 2024

srene Nov 8, 2024

omritoptix Nov 8, 2024

srene Nov 8, 2024

omritoptix Nov 8, 2024

srene Nov 8, 2024

omritoptix Nov 8, 2024

srene Nov 8, 2024

danwt left a comment

srene commented Nov 8, 2024 •

edited

Loading

feat(manager): stop applying blocks after node set unhealthy #1194

feat(manager): stop applying blocks after node set unhealthy #1194

Conversation

srene commented Oct 31, 2024

PR Standards

Opening a pull request should be able to meet the following requirements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwt left a comment

Choose a reason for hiding this comment

srene commented Nov 8, 2024 • edited Loading

srene commented Nov 8, 2024 •

edited

Loading