Skip to content

Report on slow/stalled channel traffic #2175

Closed
@greg-szabo

Description

@greg-szabo

Summary

User story: I need to better understand when a channel is being relayed properly. For monitoring and alerting purposes, I need to be able to query the oldest sequence number that is still in the queue for a specific channel and find out how old (date) the packet is.

Problem Definition

We need to better monitor if a channel is being relayed properly or not. Out-of-band monitoring has the benefit of not relying on the technology actually doing the relaying but it has the disadvantage that it has to describe the infrastructure and application setup yet again from scratch. For example the hermes config details the relationship among networks so well, that if I say "channel-0" on the Osmosis network, everyone (including a program) understands exactly what that means (which endpoint represents it, what wallet can I use to manage it, etc).

Implementing this (and subsequent monitoring related) feature in Hermes takes advantage of the already existing configuration and library knowledge of endpoints. (Writing curl scripts to poll endpoint health is not fun. Especially, on gRPC.)

Including this and similar requests makes Hermes "ready with batteries" for production use, including monitoring assets. (Well, prometheus endpoints or HTTP API calls or somesuch. The operator still need to gather the data somewhere and present it, say, using Grafana.)

Disadvantage of this kind of feature is that it opens up topic that is not strictly IBC as a protocol but more on the side of "IBC as a product used in servers". Personally, I think it shows the maturity of a project, but others might have differing opinions. This request is fairly specific which might be good (when everyone needs it) or not so good (when it only serves one specific use-case of an operator).

Proposal

There is a monitoring bot on Discord that essentially does something similar. The goal is to find out if a channel has "stuck" packets: we define "stuck" packets as packets that haven't been relayed for 5 minutes.

One implementation idea:
One or more prometheus metric(s) per-channel configured in Hermes, that displays the oldest sequence number on the channel still in the queue as well as the submission date associated with the sequence number. (Extra query to the channel.)

This could be picked up by any monitoring tool and alert on it every 5 minutes (or whatever the operator configures).

Alternatively, if this doesn't fit the prometeus metrics specs, it could be a HTTP web API call that responds with the data in a JSON object. Somehow, I feel prometheus should fit here, but we're open to other implementations (even a CLI command, if necessary). The one implementation that doesn't work for us is plugging this data into the log file. The data has to be independently queryable, mostly separated from the current operational state of Hermes. (As mentioned in the disadvantages.)

Acceptance Criteria

  • There is a way to query the "stuck state" of a channel.
  • Alternatively: the backlog in a channel is exposed in telemetry, highlighting the timestamp of the oldest unrelayed packet.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

E: osmosisExternal: related to OsmosisI: CLIInternal: related to the relayer's CLII: logicInternal: related to the relaying logicI: telemetryInternal: related to Telemetry & metricsO: usabilityObjective: cause to improve the user experience (UX) and ease using the product

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions