Fast start for chains configured with an allow list #1705

romac · 2021-12-21T14:29:26Z

Closes: #1536

Description

Split the worker spawning code into two parts:
- a) scan the chains for the clients, connections and channels
- b) spawn the appropriate workers based on the resulting scan
When scanning a chain with filtering enabled and an allow list, skip scanning all the clients and query the allowed channels directly, resulting in much fewer queries and a faster start.
Add a --full-scan option to hermes start to opt out of the fast start mechanism and do a full scan.

PR author checklist:

Added changelog entry, using unclog.
Added tests: integration (for Hermes) or unit/mock tests (for modules).
Linked to GitHub issue.
Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

Reviewed Files changed in the GitHub PR explorer.
Manually tested (in case integration/unit/mock tests are absent).

…s and channels

… when some chains are using allow lists

adizere

Did a quick run, only aesthetic/log suggestions so far, otherwise looking good!

Will continue with more in-depth tests.

relayer-cli/src/commands/start.rs

relayer/src/supervisor/scan.rs

seanchen1991 · 2022-01-19T14:36:27Z

Would we want to set fast starting as the default Hermes start up behavior?

adizere · 2022-01-19T15:56:00Z

Two further suggestions:

It is confusing to understand exactly which are the workers that Hermes starts.
- concretely, I have a partially open connection connection-370 and a partially open channel channel-213 on chain vega-testnet; the channel channel-213 is configured with allow
- I set enabled = false for both [mode.connections] and [mode.channels] so we don't complete handshakes
- then we see these log lines, and it's confusing whether any worker is effectively starting up
- 2022-01-19T15:51:26.342253Z DEBUG ThreadId(01) connection connection-370 on chain vega-testnet is: Open, state on dest. chain (agoricdev-7) is: TryOpen
  2022-01-19T15:51:26.342560Z DEBUG ThreadId(01) done spawning workers for connection connection-370 on chain vega-testnet
  2022-01-19T15:51:26.342785Z DEBUG ThreadId(01) channel channel-213 on chain vega-testnet is: INIT; state on dest. chain (agoricdev-7) is: UNINITIALIZED
  2022-01-19T15:51:26.342965Z DEBUG ThreadId(01) done spawning workers for chain vega-testnet and channel channel-213
- I think the behavior of Hermes is correct: not starting any workers to complete the conn & chan handshakes, though these log lines seem to be saying "done spawning workers for connection connection-370 on chain vega-testnet" as if something started. Maybe we can improve somehow, eg. with spans?
I think there is a situation where we're aborting a packet worker, but the corresponding packet_cmd worker lingers on.
- we have a packet waiting to be relayed, but the problem is that the client is expired
- Hermes will launch the corresponding packet and packet_cmd workers as expected and try to handle the packet relaying
- 2022-01-19T15:20:07.005129Z INFO ThreadId(01) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: spawning
  2022-01-19T15:20:07.005687Z INFO ThreadId(01) packet{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: spawning
- then the packet_cmd worker aborts
- 2022-01-19T15:20:24.514649Z DEBUG ThreadId(11) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}:relay{odata=-WVC16_h3f ->Destination @7-376539; len=1}: prepending Destination client update at height 7-376540
  2022-01-19T15:20:24.838707Z ERROR ThreadId(11) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: aborting due to expired or frozen client
  2022-01-19T15:20:24.838943Z ERROR ThreadId(11) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: aborting after encountering fatal error: Worker failed after 1 retries
  2022-01-19T15:20:24.839040Z INFO ThreadId(11) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: terminated
- at this point, the packet worker is still running in the background
- 2022-01-19T15:21:03.424959Z INFO ThreadId(21)
  2022-01-19T15:21:03.425219Z INFO ThreadId(21) * Chains: a, agoricdev-7, agorictest-18, vega-testnet
  2022-01-19T15:21:03.425356Z INFO ThreadId(21) * Packet workers:
  2022-01-19T15:21:03.425577Z INFO ThreadId(21) - packet::channel-11/cosmos-transfer:agoricdev-7->vega-testnet (id: 1)
  2022-01-19T15:21:03.425707Z INFO ThreadId(21)

It's unclear how (and if) we should fix (2).

relayer/src/supervisor.rs

relayer/src/supervisor/scan.rs

romac · 2022-01-20T09:29:17Z

Added some metrics for queries, here are some preliminary results:

Result of a full scan

Chain: ibc0
- Client: 07-tendermint-0
  - Connection: connection-0
    - Channel: channel-0
    - Channel: channel-1
- Client: 07-tendermint-1
  - Connection: connection-1
    - Channel: channel-2
    - Channel: channel-3
    - Channel: channel-4
- Client: 07-tendermint-2
  - Connection: connection-2
    - Channel: channel-5
    - Channel: channel-6
    - Channel: channel-7
    - Channel: channel-8
Chain: ibc1
- Client: 07-tendermint-0
  - Connection: connection-0
    - Channel: channel-0
    - Channel: channel-1
- Client: 07-tendermint-1
  - Connection: connection-1
    - Channel: channel-2
    - Channel: channel-3
    - Channel: channel-4
- Client: 07-tendermint-2
  - Connection: connection-2
    - Channel: channel-5
    - Channel: channel-6
    - Channel: channel-7
    - Channel: channel-8

With `filter = false` (no fast start)

# HELP queries Number of queries emitted by the relayer, per chain and query type
# TYPE queries counter
queries{chain="ibc0",query_type="query_channel"} 9
queries{chain="ibc0",query_type="query_client_connections"} 3
queries{chain="ibc0",query_type="query_clients"} 1
queries{chain="ibc0",query_type="query_connection"} 9
queries{chain="ibc0",query_type="query_connection_channels"} 3
queries{chain="ibc0",query_type="query_latest_height"} 1
queries{chain="ibc0",query_type="query_packet_acknowledgements"} 9
queries{chain="ibc0",query_type="query_packet_commitments"} 18
queries{chain="ibc0",query_type="query_unreceived_acknowledgements"} 9
queries{chain="ibc1",query_type="query_channel"} 9
queries{chain="ibc1",query_type="query_client_connections"} 3
queries{chain="ibc1",query_type="query_clients"} 1
queries{chain="ibc1",query_type="query_connection"} 9
queries{chain="ibc1",query_type="query_connection_channels"} 3
queries{chain="ibc1",query_type="query_latest_height"} 1
queries{chain="ibc1",query_type="query_packet_acknowledgements"} 9
queries{chain="ibc1",query_type="query_packet_commitments"} 18
queries{chain="ibc1",query_type="query_unreceived_acknowledgements"} 9

With `filter = true` and the following filters (fast start)

ibc0:

[chains.packet_filter]
policy = 'allow'
list = [
  ['transfer', 'channel-0'],
]

ibc1:

[chains.packet_filter]
policy = 'allow'
list = [
  ['transfer', 'channel-8'],
]

# HELP queries Number of queries emitted by the relayer, per chain and query type
# TYPE queries counter
queries{chain="ibc0",query_type="query_channel"} 2
queries{chain="ibc0",query_type="query_client_state"} 1
queries{chain="ibc0",query_type="query_connection"} 3
queries{chain="ibc0",query_type="query_latest_height"} 1
queries{chain="ibc0",query_type="query_packet_acknowledgements"} 1
queries{chain="ibc0",query_type="query_packet_commitments"} 2
queries{chain="ibc0",query_type="query_unreceived_acknowledgements"} 1
queries{chain="ibc1",query_type="query_channel"} 2
queries{chain="ibc1",query_type="query_client_state"} 1
queries{chain="ibc1",query_type="query_connection"} 3
queries{chain="ibc1",query_type="query_latest_height"} 1
queries{chain="ibc1",query_type="query_packet_acknowledgements"} 1
queries{chain="ibc1",query_type="query_packet_commitments"} 2
queries{chain="ibc1",query_type="query_unreceived_acknowledgements"} 1

Note: The results above use mode.clients.enabled = false

romac · 2022-01-20T09:34:15Z

Would we want to set fast starting as the default Hermes start up behavior?

Yes, the relayer will use the fast start if possible, ie. when filter = true and there is an allow list configured for that chain.

adizere · 2022-01-20T12:58:25Z

The metrics are exactly what we need, amazing work there Romain!

On the Vega <> Agoric testnet.

With filter = false:

# HELP queries Number of queries emitted by the relayer, per chain and query type
# TYPE queries counter
queries{chain="a",query_type="query_channel"} 3
queries{chain="a",query_type="query_client_connections"} 3
queries{chain="a",query_type="query_clients"} 1
queries{chain="a",query_type="query_connection"} 9
queries{chain="a",query_type="query_connection_channels"} 3
queries{chain="a",query_type="query_latest_height"} 1
queries{chain="a",query_type="query_packet_commitments"} 3
queries{chain="a",query_type="query_unreceived_acknowledgements"} 3
queries{chain="agoricdev-7",query_type="query_channel"} 6
queries{chain="agoricdev-7",query_type="query_client_connections"} 3
queries{chain="agoricdev-7",query_type="query_clients"} 1
queries{chain="agoricdev-7",query_type="query_connection"} 12
queries{chain="agoricdev-7",query_type="query_connection_channels"} 4
queries{chain="agoricdev-7",query_type="query_latest_height"} 1
queries{chain="agoricdev-7",query_type="query_packet_acknowledgements"} 1
queries{chain="agoricdev-7",query_type="query_packet_commitments"} 8
queries{chain="agoricdev-7",query_type="query_unreceived_acknowledgements"} 6
queries{chain="agorictest-18",query_type="query_clients"} 1
queries{chain="agorictest-18",query_type="query_latest_height"} 1
queries{chain="vega-testnet",query_type="query_channel"} 9
queries{chain="vega-testnet",query_type="query_client_connections"} 9
queries{chain="vega-testnet",query_type="query_clients"} 1
queries{chain="vega-testnet",query_type="query_connection"} 15
queries{chain="vega-testnet",query_type="query_connection_channels"} 7
queries{chain="vega-testnet",query_type="query_latest_height"} 1
queries{chain="vega-testnet",query_type="query_packet_acknowledgements"} 9
queries{chain="vega-testnet",query_type="query_packet_commitments"} 10
queries{chain="vega-testnet",query_type="query_unreceived_acknowledgements"} 1

With filter = true restricting packet relaying on five channels across 3 chains::

# HELP queries Number of queries emitted by the relayer, per chain and query type
# TYPE queries counter
queries{chain="a",query_type="query_channel"} 1
queries{chain="a",query_type="query_client_state"} 1
queries{chain="a",query_type="query_connection"} 1
queries{chain="a",query_type="query_latest_height"} 1
queries{chain="a",query_type="query_packet_acknowledgements"} 1
queries{chain="a",query_type="query_packet_commitments"} 1
queries{chain="agoricdev-7",query_type="query_channel"} 4
queries{chain="agoricdev-7",query_type="query_client_state"} 1
queries{chain="agoricdev-7",query_type="query_connection"} 7
queries{chain="agoricdev-7",query_type="query_latest_height"} 1
queries{chain="agoricdev-7",query_type="query_packet_acknowledgements"} 1
queries{chain="agoricdev-7",query_type="query_packet_commitments"} 4
queries{chain="agoricdev-7",query_type="query_unreceived_acknowledgements"} 3
queries{chain="agorictest-18",query_type="query_clients"} 1
queries{chain="agorictest-18",query_type="query_latest_height"} 1
queries{chain="vega-testnet",query_type="query_channel"} 5
queries{chain="vega-testnet",query_type="query_client_state"} 3
queries{chain="vega-testnet",query_type="query_connection"} 7
queries{chain="vega-testnet",query_type="query_latest_height"} 1
queries{chain="vega-testnet",query_type="query_packet_acknowledgements"} 3
queries{chain="vega-testnet",query_type="query_packet_commitments"} 5
queries{chain="vega-testnet",query_type="query_unreceived_acknowledgements"} 2

Nit: Redundancy in the `channel` & `connection` workers span.

Check this out:

2022-01-20T12:48:33.000222Z  INFO ThreadId(01) channel{channel=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: spawning
2022-01-20T12:48:33.000754Z  INFO ThreadId(01) channel{channel=channel::channel-217/transfer:vega-testnet -> agoricdev-7}: spawning
2022-01-20T12:48:47.018574Z  INFO ThreadId(01) spawning
2022-01-20T12:48:47.018986Z  INFO ThreadId(01) cmd: spawning
2022-01-20T12:48:47.019382Z  INFO ThreadId(01) Hermes has started
2022-01-20T12:49:04.632104Z  INFO ThreadId(14) send_tx{id=agoricdev-7}: refresh: retrieved account sequence=28 number=97
2022-01-20T12:49:05.879965Z  INFO ThreadId(14) wait_for_block_commits: waiting for commit of tx hashes(s) DA98AC2FB36A61A6B4184AB06FCE5F3599DA9D6CFE9A17AB7B61D1FD0635337E id=agoricdev-7
2022-01-20T12:49:08.966554Z  INFO ThreadId(17) channel{channel=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: channel handshake step completed with events: OpenTryChannel(

Notice that the span mentions the word channel 4 times "channel{channel=channel::channel-213/transfer:vega-testnet -> agoricdev-7}".

Maybe we can make that less verbose?

The same applies for the connection worker.

2022-01-20T12:46:58.642088Z  INFO ThreadId(01) connection{connection=connection::connection-370:vega-testnet -> agoricdev-7}: spawning
2022-01-20T12:47:13.535306Z  INFO ThreadId(01) spawning
2022-01-20T12:47:13.535640Z  INFO ThreadId(01) cmd: spawning
2022-01-20T12:47:13.536045Z  INFO ThreadId(01) Hermes has started
2022-01-20T12:47:29.456739Z  INFO ThreadId(14) send_tx{id=agoricdev-7}: refresh: retrieved account sequence=27 number=97
2022-01-20T12:47:29.882795Z  INFO ThreadId(14) wait_for_block_commits: waiting for commit of tx hashes(s) 19096BD2AB7A47B2CD1BB41C43B3D476C5C7F6DE9348AC2C199432B81EB7ABA6 id=agoricdev-7
2022-01-20T12:47:34.753965Z  INFO ThreadId(17) connection{connection=connection::connection-370:vega-testnet -> agoricdev-7}: connection handshake step completed with events: OpenConfirmConnection(

romac · 2022-01-20T14:50:00Z

How would you suggest we clean them up? Perhaps from

channel{channel=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: spawning

to

channel_worker{object=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: spawning worker

romac · 2022-01-20T15:11:50Z

Well, channel-213 is the identifier of the channel and the channel::prefix is there to ensure that we can tell that this object is a channel when we print it. By the way I am not sure I get how changing channel to object makes filtering less easy given that we can still filter on the channel:: prefix or the channel_worker span name, can you explain what you mean?

romac · 2022-01-20T15:29:04Z

@adizere Re:

I think there is a situation where we're aborting a packet worker, but the corresponding packet_cmd worker lingers on.

Not sure if this has anything to do with this PR, maybe we can track this in an issue and investigate at a later time?

mzabaluev · 2022-01-20T16:24:48Z

we can still filter on the channel:: prefix

Probably, or maybe not; I'm thinking of journalctl match syntax and other similar backend facilities that don't necessarily allow you to efficiently match by pattern. This may be a small concern, though, just my intuition is not to commingle things that can be easily distinguished.

adizere · 2022-01-20T17:28:24Z

Well, channel-213 is the identifier of the channel and the channel::prefix is there to ensure that we can tell that this object is a channel when we print it. By the way I am not sure I get how changing channel to object makes filtering less easy given that we can still filter on the channel:: prefix or the channel_worker span name, can you explain what you mean?

channel_worker{object=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: spawning worker
object as the logging key provides less information for filters and such.
Is it possible to make the channel IDs etc. less verbose? Two of those channel repetitions seem to be in the ID.

I see. I guess we can ignore this idea then. Let's keep the spans as they are. Redundancy is not always bad.

adizere

Just had a relayer operator report on the outcome of using this PR. The Hermes start time reduced from ~40mins to ~10mins. This is anecdotal, however, since I think the 40min time included misbehavior detection. Then the Osmosis epoch time dropped and it messed-up our debugging session...

Will report again tomorrow.

For completeness, this are the query metrics.

Metrics

# TYPE queries counter
queries{chain="axelar-dojo-1",query_type="query_channel"} 2
queries{chain="axelar-dojo-1",query_type="query_client_state"} 201
queries{chain="axelar-dojo-1",query_type="query_connection"} 3
queries{chain="axelar-dojo-1",query_type="query_consensus_state"} 200
queries{chain="axelar-dojo-1",query_type="query_latest_height"} 1
queries{chain="axelar-dojo-1",query_type="query_packet_acknowledgements"} 1
queries{chain="axelar-dojo-1",query_type="query_packet_commitments"} 2
queries{chain="axelar-dojo-1",query_type="query_unreceived_acknowledgements"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_channel"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_client_state"} 143
queries{chain="axelar-testnet-lisbon-2",query_type="query_connection"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_consensus_state"} 142
queries{chain="axelar-testnet-lisbon-2",query_type="query_latest_height"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_packet_acknowledgements"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_packet_commitments"} 1
queries{chain="bombay-12",query_type="query_channel"} 2
queries{chain="bombay-12",query_type="query_client_state"} 1
queries{chain="bombay-12",query_type="query_connection"} 3
queries{chain="bombay-12",query_type="query_latest_height"} 1
queries{chain="bombay-12",query_type="query_packet_commitments"} 1
queries{chain="bombay-12",query_type="query_unreceived_acknowledgements"} 1
queries{chain="columbus-5",query_type="query_channel"} 21
queries{chain="columbus-5",query_type="query_client_state"} 227
queries{chain="columbus-5",query_type="query_connection"} 15
queries{chain="columbus-5",query_type="query_consensus_state"} 209
queries{chain="columbus-5",query_type="query_latest_height"} 1
queries{chain="columbus-5",query_type="query_packet_acknowledgements"} 2
queries{chain="columbus-5",query_type="query_packet_commitments"} 5
queries{chain="columbus-5",query_type="query_status"} 9
queries{chain="columbus-5",query_type="query_unreceived_acknowledgements"} 9
queries{chain="columbus-5",query_type="query_unreceived_packets"} 6
queries{chain="comdex-1",query_type="query_channel"} 2
queries{chain="comdex-1",query_type="query_client_state"} 197
queries{chain="comdex-1",query_type="query_connection"} 3
queries{chain="comdex-1",query_type="query_consensus_state"} 194
queries{chain="comdex-1",query_type="query_latest_height"} 1
queries{chain="comdex-1",query_type="query_packet_acknowledgements"} 1
queries{chain="comdex-1",query_type="query_packet_commitments"} 2
queries{chain="comdex-1",query_type="query_unreceived_acknowledgements"} 1
queries{chain="core-1",query_type="query_channel"} 6
queries{chain="core-1",query_type="query_client_state"} 365
queries{chain="core-1",query_type="query_connection"} 6
queries{chain="core-1",query_type="query_consensus_state"} 360
queries{chain="core-1",query_type="query_latest_height"} 1
queries{chain="core-1",query_type="query_packet_acknowledgements"} 2
queries{chain="core-1",query_type="query_packet_commitments"} 4
queries{chain="core-1",query_type="query_status"} 1
queries{chain="core-1",query_type="query_unreceived_acknowledgements"} 2
queries{chain="core-1",query_type="query_unreceived_packets"} 1
queries{chain="cosmoshub-4",query_type="query_account"} 1
queries{chain="cosmoshub-4",query_type="query_channel"} 67
queries{chain="cosmoshub-4",query_type="query_client_state"} 602
queries{chain="cosmoshub-4",query_type="query_connection"} 29
queries{chain="cosmoshub-4",query_type="query_consensus_state"} 570
queries{chain="cosmoshub-4",query_type="query_latest_height"} 1
queries{chain="cosmoshub-4",query_type="query_packet_acknowledgements"} 5
queries{chain="cosmoshub-4",query_type="query_packet_commitments"} 8
queries{chain="cosmoshub-4",query_type="query_status"} 28
queries{chain="cosmoshub-4",query_type="query_unreceived_acknowledgements"} 44
queries{chain="cosmoshub-4",query_type="query_unreceived_packets"} 22
queries{chain="emoney-3",query_type="query_channel"} 2
queries{chain="emoney-3",query_type="query_client_state"} 373
queries{chain="emoney-3",query_type="query_connection"} 3
queries{chain="emoney-3",query_type="query_consensus_state"} 372
queries{chain="emoney-3",query_type="query_latest_height"} 1
queries{chain="emoney-3",query_type="query_packet_acknowledgements"} 1
queries{chain="emoney-3",query_type="query_packet_commitments"} 2
queries{chain="emoney-3",query_type="query_unreceived_acknowledgements"} 1
queries{chain="impacthub-3",query_type="query_channel"} 4
queries{chain="impacthub-3",query_type="query_client_state"} 406
queries{chain="impacthub-3",query_type="query_connection"} 6
queries{chain="impacthub-3",query_type="query_consensus_state"} 404
queries{chain="impacthub-3",query_type="query_latest_height"} 1
queries{chain="impacthub-3",query_type="query_packet_acknowledgements"} 2
queries{chain="impacthub-3",query_type="query_packet_commitments"} 4
queries{chain="impacthub-3",query_type="query_unreceived_acknowledgements"} 2
queries{chain="juno-1",query_type="query_channel"} 15
queries{chain="juno-1",query_type="query_client_state"} 374
queries{chain="juno-1",query_type="query_connection"} 14
queries{chain="juno-1",query_type="query_consensus_state"} 347
queries{chain="juno-1",query_type="query_latest_height"} 1
queries{chain="juno-1",query_type="query_packet_acknowledgements"} 4
queries{chain="juno-1",query_type="query_packet_commitments"} 7
queries{chain="juno-1",query_type="query_status"} 6
queries{chain="juno-1",query_type="query_txs"} 1
queries{chain="juno-1",query_type="query_unreceived_acknowledgements"} 5
queries{chain="juno-1",query_type="query_unreceived_packets"} 6
queries{chain="kichain-2",query_type="query_channel"} 10
queries{chain="kichain-2",query_type="query_client_state"} 411
queries{chain="kichain-2",query_type="query_connection"} 9
queries{chain="kichain-2",query_type="query_consensus_state"} 404
queries{chain="kichain-2",query_type="query_latest_height"} 1
queries{chain="kichain-2",query_type="query_packet_acknowledgements"} 3
queries{chain="kichain-2",query_type="query_packet_commitments"} 7
queries{chain="kichain-2",query_type="query_status"} 2
queries{chain="kichain-2",query_type="query_unreceived_acknowledgements"} 5
queries{chain="kichain-2",query_type="query_unreceived_packets"} 1
queries{chain="lum-network-1",query_type="query_channel"} 6
queries{chain="lum-network-1",query_type="query_client_state"} 198
queries{chain="lum-network-1",query_type="query_connection"} 6
queries{chain="lum-network-1",query_type="query_consensus_state"} 191
queries{chain="lum-network-1",query_type="query_latest_height"} 1
queries{chain="lum-network-1",query_type="query_packet_acknowledgements"} 2
queries{chain="lum-network-1",query_type="query_packet_commitments"} 4
queries{chain="lum-network-1",query_type="query_status"} 1
queries{chain="lum-network-1",query_type="query_unreceived_acknowledgements"} 2
queries{chain="lum-network-1",query_type="query_unreceived_packets"} 1
queries{chain="osmosis-1",query_type="query_account"} 1
queries{chain="osmosis-1",query_type="query_blocks"} 2
queries{chain="osmosis-1",query_type="query_channel"} 149
queries{chain="osmosis-1",query_type="query_client_state"} 6430
queries{chain="osmosis-1",query_type="query_connection"} 137
queries{chain="osmosis-1",query_type="query_consensus_state"} 6196
queries{chain="osmosis-1",query_type="query_latest_height"} 1
queries{chain="osmosis-1",query_type="query_packet_acknowledgements"} 19
queries{chain="osmosis-1",query_type="query_packet_commitments"} 41
queries{chain="osmosis-1",query_type="query_status"} 35
queries{chain="osmosis-1",query_type="query_txs"} 12
queries{chain="osmosis-1",query_type="query_unreceived_acknowledgements"} 38
queries{chain="osmosis-1",query_type="query_unreceived_packets"} 26
queries{chain="panacea-3",query_type="query_account"} 1
queries{chain="panacea-3",query_type="query_channel"} 132
queries{chain="panacea-3",query_type="query_client_state"} 206
queries{chain="panacea-3",query_type="query_connection"} 4
queries{chain="panacea-3",query_type="query_consensus_state"} 205
queries{chain="panacea-3",query_type="query_latest_height"} 1
queries{chain="panacea-3",query_type="query_packet_acknowledgements"} 2
queries{chain="panacea-3",query_type="query_packet_commitments"} 4
queries{chain="panacea-3",query_type="query_status"} 4
queries{chain="panacea-3",query_type="query_txs"} 1
queries{chain="panacea-3",query_type="query_unreceived_acknowledgements"} 258
queries{chain="panacea-3",query_type="query_unreceived_packets"} 2
queries{chain="sentinelhub-2",query_type="query_channel"} 2
queries{chain="sentinelhub-2",query_type="query_client_state"} 432
queries{chain="sentinelhub-2",query_type="query_connection"} 3
queries{chain="sentinelhub-2",query_type="query_consensus_state"} 428
queries{chain="sentinelhub-2",query_type="query_latest_height"} 1
queries{chain="sentinelhub-2",query_type="query_packet_acknowledgements"} 1
queries{chain="sentinelhub-2",query_type="query_packet_commitments"} 2
queries{chain="sentinelhub-2",query_type="query_unreceived_acknowledgements"} 1
queries{chain="stargaze-1",query_type="query_channel"} 16
queries{chain="stargaze-1",query_type="query_client_state"} 392
queries{chain="stargaze-1",query_type="query_connection"} 14
queries{chain="stargaze-1",query_type="query_consensus_state"} 372
queries{chain="stargaze-1",query_type="query_latest_height"} 1
queries{chain="stargaze-1",query_type="query_packet_acknowledgements"} 2
queries{chain="stargaze-1",query_type="query_packet_commitments"} 6
queries{chain="stargaze-1",query_type="query_status"} 3
queries{chain="stargaze-1",query_type="query_unreceived_acknowledgements"} 6
queries{chain="stargaze-1",query_type="query_unreceived_packets"} 3
queries{chain="vidulum-1",query_type="query_channel"} 2
queries{chain="vidulum-1",query_type="query_client_state"} 195
queries{chain="vidulum-1",query_type="query_connection"} 3
queries{chain="vidulum-1",query_type="query_consensus_state"} 194
queries{chain="vidulum-1",query_type="query_latest_height"} 1
queries{chain="vidulum-1",query_type="query_packet_acknowledgements"} 1
queries{chain="vidulum-1",query_type="query_packet_commitments"} 2
queries{chain="vidulum-1",query_type="query_unreceived_acknowledgements"} 1
# HELP workers Number of workers per object
# TYPE workers gauge
workers{type="client"} 35
workers{type="packet"} 17

A question

One thing that stands out is this:

queries{chain="osmosis-1",query_type="query_client_state"} 6430

6K queries might explain why Hermes might need 10min to start, but not sure.

The question is: At startup time, assuming ~10 preconfigured channels for osmosis-1, why would Hermes need to query client_state 6K times? We should clarify this.

adizere · 2022-01-20T17:33:08Z

.changelog/v0.10.0/improvements/ibc-relayer/1536-fast-start.md

@@ -0,0 +1,4 @@
+- Reduce the startup time of the relayer by querying the necessary


Do we need this?

adizere · 2022-01-24T14:27:47Z

One valuable detail we found while parsing some logs with Romac is that Hermes scanning phase can be blazing fast (under 10 seconds for a relayer operator that has ~40 allowed channels on 17 chains) but the spawning phase is the one consuming most of the time. In an execution where the start-time was 10 minutes, 6 minutes of all that time went to only 3 spawning tasks, as follows:

almost 2 minutes

  665 Jan 20 17:03:19 Imperator-Relayer hermes[1854166]: 2022-01-20T17:03:19.229792Z DEBUG ThreadId(01) spawned Client worker: client::juno-1->osmosis-1:07-tendermint-1457
   666 Jan 20 17:05:06 Imperator-Relayer hermes[1854166]: 2022-01-20T17:05:06.504609Z DEBUG ThreadId(01) done spawning workers for chain osmosis-1 and channel channel-42

almost 2 minutes

   735 Jan 20 17:06:12 Imperator-Relayer hermes[1854166]: 2022-01-20T17:06:12.744757Z DEBUG ThreadId(01) channel channel-141 on chain cosmoshub-4 is: OPEN; state on dest. chain (osmosis-1) is: OPEN
   736 Jan 20 17:08:00 Imperator-Relayer hermes[1854166]: 2022-01-20T17:08:00.022189Z DEBUG ThreadId(01) refresh{client=07-tendermint-259 src_chain=osmosis-1 dst_chain=cosmoshub-4}: spawning task

2 minutes and 40s

   767 Jan 20 17:08:08 Imperator-Relayer hermes[1854166]: 2022-01-20T17:08:08.141602Z DEBUG ThreadId(01) spawned Client worker: client::osmosis-1->juno-1:07-tendermint-0
   768 Jan 20 17:10:55 Imperator-Relayer hermes[1854166]: 2022-01-20T17:10:55.065636Z DEBUG ThreadId(01) done spawning workers for chain juno-1 and channel channel-0

These are low-hanging fruits we should investigate. If we can eliminate the bottleneck here, the start time will reduce to less than half.

seanchen1991 · 2022-01-24T16:42:59Z

@adizere and I dug into this performance issue. Our strong suspicion is that the has_packets closure and has_acks closure are making many network calls and tanking the startup performance in certain cases.

Adi made the observation that these closures are fetching large vectors of data just to calculate a boolean value. Could we make use of query::pagination to fetch at most one acknowledgement when we're computing these closures?

romac · 2022-01-24T17:29:37Z

Very insightful, thanks! Will look into it.

adizere · 2022-01-25T10:32:37Z

@adizere and I dug into this performance issue. Our strong suspicion is that the has_packets closure and has_acks closure are making many network calls and tanking the startup performance in certain cases.

Adi made the observation that these closures are fetching large vectors of data just to calculate a boolean value. Could we make use of query::pagination to fetch at most one acknowledgement when we're computing these closures?

Building up on this idea, if we choose to optimize the start time, here's a sketch of how we can exploit pagination to reduce pressure/bandwidth on full node:

informalsystems/ibc-rs@5be11ae#diff-5d10b574e97438514650fd6f44664299baf20536aab73554ca2754906d2a72d4R76

I'd suggest the refactoring based on query::pagination can be done separately in a later PR in the next milestone. For releasing v0.11.0 we can go ahead with the current design.

romac · 2022-01-25T13:29:17Z

@adizere Thanks for the sketch! I have an alternative solution here, where we only query a single commitment and check if the resulting vector is empty or not: 83fc0db

Edit: slightly simplified in 5261d65

ancazamfir

wow! amazing work here. In addition to the perf improvement I think the scanning/ spawning code looks so much cleaner and easier to understand. Thanks!!

…1705) - Split the worker spawning code into two parts: - a) scan the chains for the clients, connections and channels - b) spawn the appropriate workers based on the resulting scan - When scanning a chain with filtering enabled and an allow list, skip scanning all the clients and query the allowed channels directly, resulting in much fewer queries and a faster start. - Add a `--full-scan` option to `hermes start` to opt out of the fast start mechanism and do a full scan. --- * Introduce a `ChainScanner` to scan the chains for clients, connections and channels * Use `ChainScanner` for spawning workers * Formatting * Add `--full-scan` option to `start` command to force a full scan even when some chains are using allow lists * Remove debug statements and print scanned chains on startup * Changelog entry * Fix duplicate info message * Quote identifiers in log messages * Better error when port/channel does not exists * Add metrics for queries * Small log improvements * Rename queries metric * Use `chain` key for recording chain identifier in tracing logs * Use more structured logging in chain scanner * Fix changelog entry * Improve logs when no workers were spawned * Improve logs when spawning connection and channel workers * Remove spaces in objects names * Add changelog entry * Revert part of logs changes * Use INFO level for spawning logs * Remove redundant changelog entry

romac added 9 commits December 10, 2021 15:51

Introduce a ChainScanner to scan the chains for clients, connection…

883ee7d

…s and channels

Use ChainScanner for spawning workers

70f5e1f

Merge branch 'master' into romac/spawn-scanner

2a51530

Formatting

4ffffb7

Add --full-scan option to start command to force a full scan even…

8f1a62c

… when some chains are using allow lists

Remove debug statements and print scanned chains on startup

25e57de

Changelog entry

600a0c6

Merge branch 'master' into romac/spawn-scanner

1151a2d

Merge branch 'master' into romac/spawn-scanner

1986051

adizere requested changes Jan 19, 2022

View reviewed changes

relayer-cli/src/commands/start.rs Show resolved Hide resolved

relayer/src/supervisor/scan.rs Show resolved Hide resolved

relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved

romac added 4 commits January 19, 2022 14:35

Fix duplicate info message

25d5119

Quote identifiers in log messages

59d744f

Merge branch 'master' into romac/spawn-scanner

9404a81

Better error when port/channel does not exists

23924e9

adizere self-requested a review January 19, 2022 14:07

mzabaluev reviewed Jan 19, 2022

View reviewed changes

romac added 3 commits January 20, 2022 09:47

Merge branch 'master' into romac/spawn-scanner

28866ad

Add metrics for queries

56ff223

Small log improvements

3d450d9

Rename queries metric

5ea4f27

romac added 3 commits January 20, 2022 10:44

Use chain key for recording chain identifier in tracing logs

a414523

Use more structured logging in chain scanner

496d735

Fix changelog entry

d763b4d

Merge branch 'master' into romac/spawn-scanner

d7f829e

romac added 2 commits January 20, 2022 16:12

Improve logs when no workers were spawned

8a36fff

Improve logs when spawning connection and channel workers

8f49559

romac force-pushed the romac/spawn-scanner branch from 0f301b5 to 8f49559 Compare January 20, 2022 15:12

romac added 2 commits January 20, 2022 16:20

Remove spaces in objects names

763aaa0

Add changelog entry

111150c

romac marked this pull request as ready for review January 20, 2022 15:27

romac requested a review from ancazamfir as a code owner January 20, 2022 15:27

adizere reviewed Jan 20, 2022

View reviewed changes

adizere mentioned this pull request Jan 24, 2022

Treat Hermes 'empty wallet' error as non-recoverable #1800

Closed

6 tasks

Revert part of logs changes

48616b8

Merge branch 'master' into romac/spawn-scanner

13160a7

romac added 2 commits January 25, 2022 17:12

Use INFO level for spawning logs

c43ce1a

Remove redundant changelog entry

b0eafcf

adizere self-requested a review January 26, 2022 09:11

adizere approved these changes Jan 26, 2022

View reviewed changes

ancazamfir approved these changes Jan 26, 2022

View reviewed changes

romac merged commit edc9a2c into master Jan 26, 2022

romac deleted the romac/spawn-scanner branch January 26, 2022 12:33

seanchen1991 mentioned this pull request Jan 26, 2022

Take advantage of query pagination #1816

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast start for chains configured with an allow list #1705

Fast start for chains configured with an allow list #1705

romac commented Dec 21, 2021 •

edited

Loading

adizere left a comment

seanchen1991 commented Jan 19, 2022

adizere commented Jan 19, 2022 •

edited

Loading

romac commented Jan 20, 2022 •

edited

Loading

romac commented Jan 20, 2022

adizere commented Jan 20, 2022

romac commented Jan 20, 2022 •

edited

Loading

romac commented Jan 20, 2022

romac commented Jan 20, 2022

mzabaluev commented Jan 20, 2022

adizere commented Jan 20, 2022

adizere left a comment •

edited

Loading

adizere Jan 20, 2022

adizere commented Jan 24, 2022

seanchen1991 commented Jan 24, 2022

romac commented Jan 24, 2022

adizere commented Jan 25, 2022

romac commented Jan 25, 2022 •

edited

Loading

ancazamfir left a comment

		@@ -0,0 +1,4 @@
		- Reduce the startup time of the relayer by querying the necessary

Fast start for chains configured with an allow list #1705

Fast start for chains configured with an allow list #1705

Conversation

romac commented Dec 21, 2021 • edited Loading

Description

PR author checklist:

Reviewer checklist:

adizere left a comment

Choose a reason for hiding this comment

seanchen1991 commented Jan 19, 2022

adizere commented Jan 19, 2022 • edited Loading

romac commented Jan 20, 2022 • edited Loading

Result of a full scan

With filter = false (no fast start)

With filter = true and the following filters (fast start)

ibc0:

ibc1:

romac commented Jan 20, 2022

adizere commented Jan 20, 2022

On the Vega <> Agoric testnet.

Nit: Redundancy in the channel & connection workers span.

romac commented Jan 20, 2022 • edited Loading

romac commented Jan 20, 2022

romac commented Jan 20, 2022

mzabaluev commented Jan 20, 2022

adizere commented Jan 20, 2022

adizere left a comment • edited Loading

Choose a reason for hiding this comment

A question

adizere Jan 20, 2022

Choose a reason for hiding this comment

adizere commented Jan 24, 2022

seanchen1991 commented Jan 24, 2022

romac commented Jan 24, 2022

adizere commented Jan 25, 2022

romac commented Jan 25, 2022 • edited Loading

ancazamfir left a comment

Choose a reason for hiding this comment

romac commented Dec 21, 2021 •

edited

Loading

adizere commented Jan 19, 2022 •

edited

Loading

romac commented Jan 20, 2022 •

edited

Loading

With `filter = false` (no fast start)

With `filter = true` and the following filters (fast start)

Nit: Redundancy in the `channel` & `connection` workers span.

romac commented Jan 20, 2022 •

edited

Loading

adizere left a comment •

edited

Loading

romac commented Jan 25, 2022 •

edited

Loading