-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new metrics requested by operators #2182
Conversation
Not sure what's going on here with the @mzabaluev any idea? |
cb28a96
to
049eeae
Compare
@romac When was |
That was it, thanks @mzabaluev! |
340d252
to
2d452cd
Compare
…te, with a 1 hour TTL and 30min TTI
553ee14
to
e525576
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some types need documentation, but other than that, looks good to me 🙂
I just now had a chance to play with the new metrics. Awesome work! I wanted to ask about these three values
it seems like they're always |
These are bucket counts, used to compute a histogram but the bucket bounds are wrong. We can just ignore them or adapt them once we have some idea of what the typical latency is in production. https://opentelemetry.io/docs/reference/specification/metrics/datamodel/#histogram |
thanks for the doc! I think we should either:
|
telemetry!({ | ||
let (chain, counterparty, channel_id, port_id) = self.target_info(odata.target); | ||
|
||
ibc_telemetry::global().tx_submitted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're counting submitted transactions here, but this can lead to miscounting in two scenarios:
- when there is an account sequence mismatch, we may submit the same transaction twice, so we can run into situations where we end up with
tx_latency_confirmed_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 4
tx_latency_submitted_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 5
and there will be a permanent offset between "submitted" and "confirmed", which is confusing: it can mean that a transaction was submitted but never confirmed, indicating some transactions maybe leaked
- in the runtime, we may split this set of
msgs
across multiple transactions, but we only record a single record in thetx_latency_submitted_count
metric. so the relayer may be submitting 10 separate transactions but only record 1 value here. I'm not sure this is a problem, it depends on users expectations actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this really matter? We are not counting transactions here but just measuring their latency, whether they're submitted twice or eventually rejected. The count here is only useful for computing the mean imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point!
The question is if relayer operators use the tx_latency_submitted_count
(resp. tx_latency_submitted_count
) as a standalone value -- or do they use it solely for finding the latency mean? I will ask this Q in the next call with operators.
As a side note, while playing with, I think I found a way to fix the problem with a tentative solution here 45e3a94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good, and makes the metric usable both ways so let's roll with that if someone can confirm it works. Won't have time to check myself unfortunately.
…tx_latency_submitted`, `tx_latency_confirmed`, `msg_num` (informalsystems#2182) * Add metric for number of WS reconnections * Reformatting * Add metric for number of IBC events received over WebSocket * Add missing doc comments * Add metric for number of messages submitted to a chain * Add facility for querying an account's balance via the bank module * Code reorganization * Move `query_balance` to `ChainEndpoint` * Add `wallet_balance` metric * Add `query_balance` to the chain handle * Update parameter name * Add wallet worker to monitor wallet balance and populate corresponding metric * Update bank proto * Implement `tx_latency` metric * Use `moka::sync::Cache` to capture in-flight txs in the telemetry state, with a 1 hour TTL and 30min TTI * Use short UUID * Rename `tx_latency` to `tx_latency_confirmed` and add `tx_latency_submitted` * Remove useless clone * Add more labels to `tx_latency_*` metrics * Fix for rebase on master * Update guide with new metrics * Add changelog entry * Do not add tracking id as label to tx_latency metrics to avoid blowing up the Prometheus exporter * Use custom aggregator to properly report wallet balance metric * Update guide metrics * Document `TrackingId` * Rename constructors of `TrackingId` * Turn comments into doc comments * Rename `chain::tx` module to `chain::tracking`
Closes: #2112
Description
wallet_balance
ws_events
ws_reconnect
tx_latency_submitted
tx_latency_confirmed
tx_confirmation = true
.msg_num
Example
PR author checklist:
unclog
.docs/
).Reviewer checklist:
Files changed
in the GitHub PR explorer.