-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
latest finalized block metrics #12339
Changes from 42 commits
af480ab
9da949b
9e580cc
112679e
334a4d1
83cf834
26e8d50
962464c
b86c872
00769f0
81774b4
c942663
72a2380
a01fb86
d7a9d4e
faf61d9
89a75b3
f7ab489
d9d422c
e77f529
908acf7
93b835d
2f55403
9f26066
71a0803
35c3302
bd1ea1e
6cc4fec
83ea5d1
2d5ae65
c99cea6
f77a8ab
f7c786f
dee11fc
4b27f75
672e09a
4372344
abab9f0
c32050e
3dd3f3c
971f35f
63ec90e
38871bd
d79c8b1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
"chainlink": minor | ||
--- | ||
|
||
Add the `pool_rpc_node_highest_finalized_block` metric that tracks the highest finalized block seen per RPC. If `FinalityTagEnabled = true`, a positive `NodePool.FinalizedBlockPollInterval` is needed to collect the metric. If the finality tag is not enabled, the metric is populated with a calculated latest finalized block based on the latest head and finality depth. |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
package mocks | ||
|
||
import ( | ||
"time" | ||
|
||
commonconfig "github.com/smartcontractkit/chainlink/v2/common/config" | ||
) | ||
|
||
type ChainConfig struct { | ||
IsFinalityTagEnabled bool | ||
FinalityDepthVal uint32 | ||
NoNewHeadsThresholdVal time.Duration | ||
ChainTypeVal commonconfig.ChainType | ||
} | ||
|
||
func (t ChainConfig) ChainType() commonconfig.ChainType { | ||
return t.ChainTypeVal | ||
} | ||
|
||
func (t ChainConfig) NodeNoNewHeadsThreshold() time.Duration { | ||
return t.NoNewHeadsThresholdVal | ||
} | ||
|
||
func (t ChainConfig) FinalityDepth() uint32 { | ||
return t.FinalityDepthVal | ||
} | ||
|
||
func (t ChainConfig) FinalityTagEnabled() bool { | ||
return t.IsFinalityTagEnabled | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ import ( | |
"github.com/smartcontractkit/chainlink-common/pkg/logger" | ||
"github.com/smartcontractkit/chainlink-common/pkg/services" | ||
|
||
commonconfig "github.com/smartcontractkit/chainlink/v2/common/config" | ||
"github.com/smartcontractkit/chainlink/v2/common/types" | ||
) | ||
|
||
|
@@ -43,6 +44,14 @@ type NodeConfig interface { | |
SelectionMode() string | ||
SyncThreshold() uint32 | ||
NodeIsSyncingEnabled() bool | ||
FinalizedBlockPollInterval() time.Duration | ||
} | ||
|
||
type ChainConfig interface { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you help me reason through why this configuration belongs to I'd expect configuration here to be node specific and for chain details to live at a higher level in the abstraction hierarchy, or for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Node is responsible for the health assessment of a single RPC that works only with one chain. Node does not store |
||
NodeNoNewHeadsThreshold() time.Duration | ||
FinalityDepth() uint32 | ||
FinalityTagEnabled() bool | ||
ChainType() commonconfig.ChainType | ||
} | ||
|
||
//go:generate mockery --quiet --name Node --structname mockNode --filename "mock_node_test.go" --inpackage --case=underscore | ||
|
@@ -73,14 +82,14 @@ type node[ | |
RPC NodeClient[CHAIN_ID, HEAD], | ||
] struct { | ||
services.StateMachine | ||
lfcLog logger.Logger | ||
name string | ||
id int32 | ||
chainID CHAIN_ID | ||
nodePoolCfg NodeConfig | ||
noNewHeadsThreshold time.Duration | ||
order int32 | ||
chainFamily string | ||
lfcLog logger.Logger | ||
name string | ||
id int32 | ||
chainID CHAIN_ID | ||
nodePoolCfg NodeConfig | ||
chainCfg ChainConfig | ||
order int32 | ||
chainFamily string | ||
|
||
ws url.URL | ||
http *url.URL | ||
|
@@ -90,8 +99,9 @@ type node[ | |
stateMu sync.RWMutex // protects state* fields | ||
state nodeState | ||
// Each node is tracking the last received head number and total difficulty | ||
stateLatestBlockNumber int64 | ||
stateLatestTotalDifficulty *big.Int | ||
stateLatestBlockNumber int64 | ||
stateLatestTotalDifficulty *big.Int | ||
stateLatestFinalizedBlockNumber int64 | ||
|
||
// nodeCtx is the node lifetime's context | ||
nodeCtx context.Context | ||
|
@@ -113,7 +123,7 @@ func NewNode[ | |
RPC NodeClient[CHAIN_ID, HEAD], | ||
]( | ||
nodeCfg NodeConfig, | ||
noNewHeadsThreshold time.Duration, | ||
chainCfg ChainConfig, | ||
lggr logger.Logger, | ||
wsuri url.URL, | ||
httpuri *url.URL, | ||
|
@@ -129,7 +139,7 @@ func NewNode[ | |
n.id = id | ||
n.chainID = chainID | ||
n.nodePoolCfg = nodeCfg | ||
n.noNewHeadsThreshold = noNewHeadsThreshold | ||
n.chainCfg = chainCfg | ||
n.ws = wsuri | ||
n.order = nodeOrder | ||
if httpuri != nil { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,10 @@ var ( | |
Name: "pool_rpc_node_highest_seen_block", | ||
Help: "The highest seen block for the given RPC node", | ||
}, []string{"chainID", "nodeName"}) | ||
promPoolRPCNodeHighestFinalizedBlock = promauto.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "pool_rpc_node_highest_finalized_block", | ||
Help: "The highest seen finalized block for the given RPC node", | ||
}, []string{"chainID", "nodeName"}) | ||
promPoolRPCNodeNumSeenBlocks = promauto.NewCounterVec(prometheus.CounterOpts{ | ||
Name: "pool_rpc_node_num_seen_blocks", | ||
Help: "The total number of new blocks seen by the given RPC node", | ||
|
@@ -88,7 +92,7 @@ func (n *node[CHAIN_ID, HEAD, RPC]) aliveLoop() { | |
} | ||
} | ||
|
||
noNewHeadsTimeoutThreshold := n.noNewHeadsThreshold | ||
noNewHeadsTimeoutThreshold := n.chainCfg.NodeNoNewHeadsThreshold() | ||
pollFailureThreshold := n.nodePoolCfg.PollFailureThreshold() | ||
pollInterval := n.nodePoolCfg.PollInterval() | ||
|
||
|
@@ -134,6 +138,14 @@ func (n *node[CHAIN_ID, HEAD, RPC]) aliveLoop() { | |
lggr.Debug("Polling disabled") | ||
} | ||
|
||
var pollFinalizedHeadCh <-chan time.Time | ||
if n.nodePoolCfg.FinalizedBlockPollInterval() > 0 { | ||
lggr.Debugw("Finalized block polling enabled") | ||
pollT := time.NewTicker(n.nodePoolCfg.FinalizedBlockPollInterval()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Won't the NewTicker() panic if the parameter is 0? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, the NewTicker panics if the parameter is 0, that's why we do not initialize it unless provided value is > 0
Sounds good.
IMHO, it should be possible to disable the check. Other health checks are optional, I do not see why this one should be an exception |
||
defer pollT.Stop() | ||
pollFinalizedHeadCh = pollT.C | ||
} | ||
|
||
_, highestReceivedBlockNumber, _ := n.StateAndLatest() | ||
var pollFailures uint32 | ||
|
||
|
@@ -201,6 +213,13 @@ func (n *node[CHAIN_ID, HEAD, RPC]) aliveLoop() { | |
outOfSyncT.Reset(noNewHeadsTimeoutThreshold) | ||
} | ||
n.setLatestReceived(bh.BlockNumber(), bh.BlockDifficulty()) | ||
if !n.chainCfg.FinalityTagEnabled() { | ||
latestFinalizedBN := max(bh.BlockNumber()-int64(n.chainCfg.FinalityDepth()), 0) | ||
if latestFinalizedBN > n.stateLatestFinalizedBlockNumber { | ||
promPoolRPCNodeHighestFinalizedBlock.WithLabelValues(n.chainID.String(), n.name).Set(float64(latestFinalizedBN)) | ||
n.stateLatestFinalizedBlockNumber = latestFinalizedBN | ||
} | ||
} | ||
case err := <-sub.Err(): | ||
lggr.Errorw("Subscription was terminated", "err", err, "nodeState", n.State()) | ||
n.declareUnreachable() | ||
|
@@ -214,13 +233,33 @@ func (n *node[CHAIN_ID, HEAD, RPC]) aliveLoop() { | |
lggr.Criticalf("RPC endpoint detected out of sync; %s %s", msgCannotDisable, msgDegradedState) | ||
// We don't necessarily want to wait the full timeout to check again, we should | ||
// check regularly and log noisily in this state | ||
outOfSyncT.Reset(zombieNodeCheckInterval(n.noNewHeadsThreshold)) | ||
outOfSyncT.Reset(zombieNodeCheckInterval(noNewHeadsTimeoutThreshold)) | ||
continue | ||
} | ||
} | ||
n.declareOutOfSync(func(num int64, td *big.Int) bool { return num < highestReceivedBlockNumber }) | ||
return | ||
case <-pollFinalizedHeadCh: | ||
ctx, cancel := context.WithTimeout(n.nodeCtx, n.nodePoolCfg.FinalizedBlockPollInterval()) | ||
latestFinalized, err := n.RPC().LatestFinalizedBlock(ctx) | ||
cancel() | ||
if err != nil { | ||
lggr.Warnw("Failed to fetch latest finalized block", "err", err) | ||
continue | ||
} | ||
|
||
if !latestFinalized.IsValid() { | ||
lggr.Warn("Latest finalized block is not valid") | ||
continue | ||
} | ||
|
||
latestFinalizedBN := latestFinalized.BlockNumber() | ||
if latestFinalizedBN > n.stateLatestFinalizedBlockNumber { | ||
promPoolRPCNodeHighestFinalizedBlock.WithLabelValues(n.chainID.String(), n.name).Set(float64(latestFinalizedBN)) | ||
n.stateLatestFinalizedBlockNumber = latestFinalizedBN | ||
} | ||
} | ||
|
||
} | ||
} | ||
|
||
|
@@ -316,7 +355,7 @@ func (n *node[CHAIN_ID, HEAD, RPC]) outOfSyncLoop(isOutOfSync func(num int64, td | |
return | ||
} | ||
lggr.Debugw(msgReceivedBlock, "blockNumber", head.BlockNumber(), "blockDifficulty", head.BlockDifficulty(), "nodeState", n.State()) | ||
case <-time.After(zombieNodeCheckInterval(n.noNewHeadsThreshold)): | ||
case <-time.After(zombieNodeCheckInterval(n.chainCfg.NodeNoNewHeadsThreshold())): | ||
if n.nLiveNodes != nil { | ||
if l, _, _ := n.nLiveNodes(); l < 1 { | ||
lggr.Critical("RPC endpoint is still out of sync, but there are no other available nodes. This RPC node will be forcibly moved back into the live pool in a degraded state") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, do we need different configs for each type of poll?
What if we just reuse the PollInterval for all polling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some chains, we may also look for new heads via polling, not via subscribe.
In that case too, we wouldn't like to poll separately for new heads and new finalized heads. Mostly just make a single batch call to get both.
So that's why I am thinking, could we club all things to be polled under a same config, and fetch them together in a batch call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar impression, it would be more efficient to batch whenever we can instead of introducing a new ticker that will increase pressure on the RPC. AFAIK we already poll RPC to verify if it's healthy so probably we could use that logic for adding latest finalized block. If we manage to bundle that together into a single batch then we will get finality tracking for free
Maybe reusing existing
<-pollCh
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in favour of merging all of the polls into a single ticker, as they check different properties of the RPC.
Poll checks that RPC is reachable, this is super basic check and we want to do it often and be aggressive with the timeouts, if an RPC needs > 1s to return it's version, it not healthy, while for finalized block it seems ok.
Regarding the new heads polling, it makes sense to batch poll in this case.