Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new metrics requested by operators #2182

Merged
merged 31 commits into from
May 17, 2022
Merged

Add new metrics requested by operators #2182

merged 31 commits into from
May 17, 2022

Conversation

romac
Copy link
Member

@romac romac commented May 4, 2022

Closes: #2112

Description

Done Metric name Description Meter type
☑️ wallet_balance An indicator with how much balance (coins) there is left in each wallet key that Hermes is using. ValueRecorder
☑️ ws_events Indicates how many IBC events did Hermes receive via the websocket subscription, in total since starting up, per chain. Counter
☑️ ws_reconnect Number of times Hermes had to reconnect to the WebSocket endpoint Counter
☑️ tx_latency_submitted Indicates the latency for all transactions submitted to a chain (i.e., difference between the moment when Hermes received an event until the corresponding transaction(s) were submitted). ValueRecorder
☑️ tx_latency_confirmed Indicates the latency for all transactions confirmed by a chain (i.e., difference between the moment when Hermes received an event until the corresponding transaction(s) were confirmed). Requires tx_confirmation = true. ValueRecorder
☑️ msg_num Indicates how many messages Hermes submitted to a specific chain. Counter

Example

# HELP msg_num How many messages Hermes submitted to the chain, per chain
# TYPE msg_num counter
msg_num{chain="ibc-0"} 168
msg_num{chain="ibc-1"} 156

# HELP tx_latency_confirmed The latency for all transactions submitted to a specific chain, i.e. the difference between the moment when Hermes received a batch of events until the corresponding transaction(s) were confirmed. Milliseconds.
# TYPE tx_latency_confirmed histogram
tx_latency_confirmed_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.5"} 0
tx_latency_confirmed_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.9"} 0
tx_latency_confirmed_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.99"} 0
tx_latency_confirmed_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="+Inf"} 4
tx_latency_confirmed_sum{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 22466
tx_latency_confirmed_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 4
tx_latency_confirmed_bucket{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer",le="0.5"} 0
tx_latency_confirmed_bucket{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer",le="0.9"} 0
tx_latency_confirmed_bucket{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer",le="0.99"} 0
tx_latency_confirmed_bucket{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer",le="+Inf"} 1
tx_latency_confirmed_sum{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer"} 4256
tx_latency_confirmed_count{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer"} 1
tx_latency_confirmed_bucket{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer",le="0.5"} 0
tx_latency_confirmed_bucket{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer",le="0.9"} 0
tx_latency_confirmed_bucket{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer",le="0.99"} 0
tx_latency_confirmed_bucket{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer",le="+Inf"} 2
tx_latency_confirmed_sum{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 9408
tx_latency_confirmed_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 2
tx_latency_confirmed_bucket{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer",le="0.5"} 0
tx_latency_confirmed_bucket{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer",le="0.9"} 0
tx_latency_confirmed_bucket{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer",le="0.99"} 0
tx_latency_confirmed_bucket{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer",le="+Inf"} 1
tx_latency_confirmed_sum{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer"} 3173
tx_latency_confirmed_count{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer"} 1

# HELP tx_latency_submitted The latency for all transactions submitted to a specific chain, i.e. the difference between the moment when Hermes received a batch of events and when it submitted the corresponding transaction(s). Milliseconds.
# TYPE tx_latency_submitted histogram
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.5"} 0
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.9"} 0
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.99"} 0
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="+Inf"} 5
tx_latency_submitted_sum{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 14428
tx_latency_submitted_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 5
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer",le="0.5"} 0
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer",le="0.9"} 0
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer",le="0.99"} 0
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer",le="+Inf"} 1
tx_latency_submitted_sum{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer"} 729
tx_latency_submitted_count{chain="ibc-0",channel="channel-1",counterparty="ibc-1",port="transfer"} 1
tx_latency_submitted_bucket{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer",le="0.5"} 0
tx_latency_submitted_bucket{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer",le="0.9"} 0
tx_latency_submitted_bucket{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer",le="0.99"} 0
tx_latency_submitted_bucket{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer",le="+Inf"} 2
tx_latency_submitted_sum{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 1706
tx_latency_submitted_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 2
tx_latency_submitted_bucket{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer",le="0.5"} 0
tx_latency_submitted_bucket{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer",le="0.9"} 0
tx_latency_submitted_bucket{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer",le="0.99"} 0
tx_latency_submitted_bucket{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer",le="+Inf"} 1
tx_latency_submitted_sum{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer"} 791
tx_latency_submitted_count{chain="ibc-1",channel="channel-1",counterparty="ibc-0",port="transfer"} 1

# HELP wallet_balance The balance in each wallet that Hermes is using, per wallet, denom and chain
# TYPE wallet_balance gauge
wallet_balance{account="cosmos1934akx97773lsjjs9x74dr03uuam29hcc9grp3",chain="ibc-0",denom="stake"} 99999970473
wallet_balance{account="cosmos1hngzqscyg476nd68qggxps8r2aq56lne45ps8n",chain="ibc-1",denom="stake"} 99999978431

# HELP ws_events How many IBC events did Hermes receive via the WebSocket subscription, per chain
# TYPE ws_events counter
ws_events{chain="ibc-0"} 443
ws_events{chain="ibc-1"} 370

PR author checklist:

  • Linked to GitHub issue.
  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@romac
Copy link
Member Author

romac commented May 10, 2022

Not sure what's going on here with the tonic-related errors.

@mzabaluev any idea?

@romac romac force-pushed the romac/2112-telemetry branch from cb28a96 to 049eeae Compare May 11, 2022 08:28
@mzabaluev
Copy link
Contributor

@romac When was proto/src/prost/cosmos.bank.v1beta1.rs generated?
It's best to merge from master, then regenerate the proto files with ./scripts/sync-protobuf.sh, because prost-build has been updated.

@romac
Copy link
Member Author

romac commented May 11, 2022

That was it, thanks @mzabaluev!

@romac romac force-pushed the romac/2112-telemetry branch from 340d252 to 2d452cd Compare May 12, 2022 13:14
@romac romac marked this pull request as ready for review May 16, 2022 14:16
Copy link
Contributor

@seanchen1991 seanchen1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some types need documentation, but other than that, looks good to me 🙂

@romac romac merged commit c5c89dc into master May 17, 2022
@romac romac deleted the romac/2112-telemetry branch May 17, 2022 14:06
@adizere
Copy link
Member

adizere commented May 20, 2022

I just now had a chance to play with the new metrics. Awesome work!

I wanted to ask about these three values

tx_latency_submitted_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.5"} 0
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.9"} 0
tx_latency_submitted_bucket{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer",le="0.99"} 0

it seems like they're always 0 and not sure how to interpret them. Any thoughts @seanchen1991 or @romac ?

@romac
Copy link
Member Author

romac commented May 20, 2022

These are bucket counts, used to compute a histogram but the bucket bounds are wrong. We can just ignore them or adapt them once we have some idea of what the typical latency is in production.

https://opentelemetry.io/docs/reference/specification/metrics/datamodel/#histogram

@adizere
Copy link
Member

adizere commented May 20, 2022

thanks for the doc!

I think we should either:

  1. fill in with real numbers

    • I tried with 1s/10s/20s on a local setup and those values seem meaningful for local testnets
    • for production values, this is something we can get help from @mircea-c hopefully. but if we don't before the release I'm not sure we should release it as it is, as they are confusing
  2. put the median, first quartile, third quartile, instead of bucket bounds

    • for bucket bounds, not sure what value do they add even if we have production data -> it would be better to have data is not specific to the performance of machines or block frequency because these vary wildly
    • quartiles don't need to be parametrized, so they will make sense regardless of relayer machine configuration or blockchain parameters
    • edit: this is not currently possible with the prometheus opentelemetry libraries in Rust
  3. or keep only the sum and count (which can lead to computing the average hopefully in Grafana)

telemetry!({
let (chain, counterparty, channel_id, port_id) = self.target_info(odata.target);

ibc_telemetry::global().tx_submitted(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're counting submitted transactions here, but this can lead to miscounting in two scenarios:

  1. when there is an account sequence mismatch, we may submit the same transaction twice, so we can run into situations where we end up with

tx_latency_confirmed_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 4
tx_latency_submitted_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 5

and there will be a permanent offset between "submitted" and "confirmed", which is confusing: it can mean that a transaction was submitted but never confirmed, indicating some transactions maybe leaked

  1. in the runtime, we may split this set of msgs across multiple transactions, but we only record a single record in the tx_latency_submitted_count metric. so the relayer may be submitting 10 separate transactions but only record 1 value here. I'm not sure this is a problem, it depends on users expectations actually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this really matter? We are not counting transactions here but just measuring their latency, whether they're submitted twice or eventually rejected. The count here is only useful for computing the mean imho.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point!

The question is if relayer operators use the tx_latency_submitted_count (resp. tx_latency_submitted_count) as a standalone value -- or do they use it solely for finding the latency mean? I will ask this Q in the next call with operators.

As a side note, while playing with, I think I found a way to fix the problem with a tentative solution here 45e3a94

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good, and makes the metric usable both ways so let's roll with that if someone can confirm it works. Won't have time to check myself unfortunately.

hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
…tx_latency_submitted`, `tx_latency_confirmed`, `msg_num` (informalsystems#2182)

* Add metric for number of WS reconnections

* Reformatting

* Add metric for number of IBC events received over WebSocket

* Add missing doc comments

* Add metric for number of messages submitted to a chain

* Add facility for querying an account's balance via the bank module

* Code reorganization

* Move `query_balance` to `ChainEndpoint`

* Add `wallet_balance` metric

* Add `query_balance` to the chain handle

* Update parameter name

* Add wallet worker to monitor wallet balance and populate corresponding metric

* Update bank proto

* Implement `tx_latency` metric

* Use `moka::sync::Cache` to capture in-flight txs in the telemetry state, with a 1 hour TTL and 30min TTI

* Use short UUID

* Rename `tx_latency` to `tx_latency_confirmed` and add `tx_latency_submitted`

* Remove useless clone

* Add more labels to `tx_latency_*` metrics

* Fix for rebase on master

* Update guide with new metrics

* Add changelog entry

* Do not add tracking id as label to tx_latency metrics to avoid blowing up the Prometheus exporter

* Use custom aggregator to properly report wallet balance metric

* Update guide metrics

* Document `TrackingId`

* Rename constructors of `TrackingId`

* Turn comments into doc comments

* Rename `chain::tx` module to `chain::tracking`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hermes telemetry requirements
5 participants