-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast start for chains configured with an allow list #1705
Conversation
… when some chains are using allow lists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick run, only aesthetic/log suggestions so far, otherwise looking good!
Will continue with more in-depth tests.
Would we want to set fast starting as the default Hermes start up behavior? |
Two further suggestions:
It's unclear how (and if) we should fix (2). |
Added some metrics for queries, here are some preliminary results: Result of a full scan
With
|
Yes, the relayer will use the fast start if possible, ie. when |
The metrics are exactly what we need, amazing work there Romain! On the Vega <> Agoric testnet.With
With
Nit: Redundancy in the
|
How would you suggest we clean them up? Perhaps from
to
|
Well, |
0f301b5
to
8f49559
Compare
@adizere Re:
Not sure if this has anything to do with this PR, maybe we can track this in an issue and investigate at a later time? |
Probably, or maybe not; I'm thinking of |
I see. I guess we can ignore this idea then. Let's keep the spans as they are. Redundancy is not always bad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a relayer operator report on the outcome of using this PR. The Hermes start time reduced from ~40mins to ~10mins. This is anecdotal, however, since I think the 40min time included misbehavior detection. Then the Osmosis epoch time dropped and it messed-up our debugging session...
Will report again tomorrow.
For completeness, this are the query metrics.
Metrics
# TYPE queries counter
queries{chain="axelar-dojo-1",query_type="query_channel"} 2
queries{chain="axelar-dojo-1",query_type="query_client_state"} 201
queries{chain="axelar-dojo-1",query_type="query_connection"} 3
queries{chain="axelar-dojo-1",query_type="query_consensus_state"} 200
queries{chain="axelar-dojo-1",query_type="query_latest_height"} 1
queries{chain="axelar-dojo-1",query_type="query_packet_acknowledgements"} 1
queries{chain="axelar-dojo-1",query_type="query_packet_commitments"} 2
queries{chain="axelar-dojo-1",query_type="query_unreceived_acknowledgements"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_channel"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_client_state"} 143
queries{chain="axelar-testnet-lisbon-2",query_type="query_connection"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_consensus_state"} 142
queries{chain="axelar-testnet-lisbon-2",query_type="query_latest_height"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_packet_acknowledgements"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_packet_commitments"} 1
queries{chain="bombay-12",query_type="query_channel"} 2
queries{chain="bombay-12",query_type="query_client_state"} 1
queries{chain="bombay-12",query_type="query_connection"} 3
queries{chain="bombay-12",query_type="query_latest_height"} 1
queries{chain="bombay-12",query_type="query_packet_commitments"} 1
queries{chain="bombay-12",query_type="query_unreceived_acknowledgements"} 1
queries{chain="columbus-5",query_type="query_channel"} 21
queries{chain="columbus-5",query_type="query_client_state"} 227
queries{chain="columbus-5",query_type="query_connection"} 15
queries{chain="columbus-5",query_type="query_consensus_state"} 209
queries{chain="columbus-5",query_type="query_latest_height"} 1
queries{chain="columbus-5",query_type="query_packet_acknowledgements"} 2
queries{chain="columbus-5",query_type="query_packet_commitments"} 5
queries{chain="columbus-5",query_type="query_status"} 9
queries{chain="columbus-5",query_type="query_unreceived_acknowledgements"} 9
queries{chain="columbus-5",query_type="query_unreceived_packets"} 6
queries{chain="comdex-1",query_type="query_channel"} 2
queries{chain="comdex-1",query_type="query_client_state"} 197
queries{chain="comdex-1",query_type="query_connection"} 3
queries{chain="comdex-1",query_type="query_consensus_state"} 194
queries{chain="comdex-1",query_type="query_latest_height"} 1
queries{chain="comdex-1",query_type="query_packet_acknowledgements"} 1
queries{chain="comdex-1",query_type="query_packet_commitments"} 2
queries{chain="comdex-1",query_type="query_unreceived_acknowledgements"} 1
queries{chain="core-1",query_type="query_channel"} 6
queries{chain="core-1",query_type="query_client_state"} 365
queries{chain="core-1",query_type="query_connection"} 6
queries{chain="core-1",query_type="query_consensus_state"} 360
queries{chain="core-1",query_type="query_latest_height"} 1
queries{chain="core-1",query_type="query_packet_acknowledgements"} 2
queries{chain="core-1",query_type="query_packet_commitments"} 4
queries{chain="core-1",query_type="query_status"} 1
queries{chain="core-1",query_type="query_unreceived_acknowledgements"} 2
queries{chain="core-1",query_type="query_unreceived_packets"} 1
queries{chain="cosmoshub-4",query_type="query_account"} 1
queries{chain="cosmoshub-4",query_type="query_channel"} 67
queries{chain="cosmoshub-4",query_type="query_client_state"} 602
queries{chain="cosmoshub-4",query_type="query_connection"} 29
queries{chain="cosmoshub-4",query_type="query_consensus_state"} 570
queries{chain="cosmoshub-4",query_type="query_latest_height"} 1
queries{chain="cosmoshub-4",query_type="query_packet_acknowledgements"} 5
queries{chain="cosmoshub-4",query_type="query_packet_commitments"} 8
queries{chain="cosmoshub-4",query_type="query_status"} 28
queries{chain="cosmoshub-4",query_type="query_unreceived_acknowledgements"} 44
queries{chain="cosmoshub-4",query_type="query_unreceived_packets"} 22
queries{chain="emoney-3",query_type="query_channel"} 2
queries{chain="emoney-3",query_type="query_client_state"} 373
queries{chain="emoney-3",query_type="query_connection"} 3
queries{chain="emoney-3",query_type="query_consensus_state"} 372
queries{chain="emoney-3",query_type="query_latest_height"} 1
queries{chain="emoney-3",query_type="query_packet_acknowledgements"} 1
queries{chain="emoney-3",query_type="query_packet_commitments"} 2
queries{chain="emoney-3",query_type="query_unreceived_acknowledgements"} 1
queries{chain="impacthub-3",query_type="query_channel"} 4
queries{chain="impacthub-3",query_type="query_client_state"} 406
queries{chain="impacthub-3",query_type="query_connection"} 6
queries{chain="impacthub-3",query_type="query_consensus_state"} 404
queries{chain="impacthub-3",query_type="query_latest_height"} 1
queries{chain="impacthub-3",query_type="query_packet_acknowledgements"} 2
queries{chain="impacthub-3",query_type="query_packet_commitments"} 4
queries{chain="impacthub-3",query_type="query_unreceived_acknowledgements"} 2
queries{chain="juno-1",query_type="query_channel"} 15
queries{chain="juno-1",query_type="query_client_state"} 374
queries{chain="juno-1",query_type="query_connection"} 14
queries{chain="juno-1",query_type="query_consensus_state"} 347
queries{chain="juno-1",query_type="query_latest_height"} 1
queries{chain="juno-1",query_type="query_packet_acknowledgements"} 4
queries{chain="juno-1",query_type="query_packet_commitments"} 7
queries{chain="juno-1",query_type="query_status"} 6
queries{chain="juno-1",query_type="query_txs"} 1
queries{chain="juno-1",query_type="query_unreceived_acknowledgements"} 5
queries{chain="juno-1",query_type="query_unreceived_packets"} 6
queries{chain="kichain-2",query_type="query_channel"} 10
queries{chain="kichain-2",query_type="query_client_state"} 411
queries{chain="kichain-2",query_type="query_connection"} 9
queries{chain="kichain-2",query_type="query_consensus_state"} 404
queries{chain="kichain-2",query_type="query_latest_height"} 1
queries{chain="kichain-2",query_type="query_packet_acknowledgements"} 3
queries{chain="kichain-2",query_type="query_packet_commitments"} 7
queries{chain="kichain-2",query_type="query_status"} 2
queries{chain="kichain-2",query_type="query_unreceived_acknowledgements"} 5
queries{chain="kichain-2",query_type="query_unreceived_packets"} 1
queries{chain="lum-network-1",query_type="query_channel"} 6
queries{chain="lum-network-1",query_type="query_client_state"} 198
queries{chain="lum-network-1",query_type="query_connection"} 6
queries{chain="lum-network-1",query_type="query_consensus_state"} 191
queries{chain="lum-network-1",query_type="query_latest_height"} 1
queries{chain="lum-network-1",query_type="query_packet_acknowledgements"} 2
queries{chain="lum-network-1",query_type="query_packet_commitments"} 4
queries{chain="lum-network-1",query_type="query_status"} 1
queries{chain="lum-network-1",query_type="query_unreceived_acknowledgements"} 2
queries{chain="lum-network-1",query_type="query_unreceived_packets"} 1
queries{chain="osmosis-1",query_type="query_account"} 1
queries{chain="osmosis-1",query_type="query_blocks"} 2
queries{chain="osmosis-1",query_type="query_channel"} 149
queries{chain="osmosis-1",query_type="query_client_state"} 6430
queries{chain="osmosis-1",query_type="query_connection"} 137
queries{chain="osmosis-1",query_type="query_consensus_state"} 6196
queries{chain="osmosis-1",query_type="query_latest_height"} 1
queries{chain="osmosis-1",query_type="query_packet_acknowledgements"} 19
queries{chain="osmosis-1",query_type="query_packet_commitments"} 41
queries{chain="osmosis-1",query_type="query_status"} 35
queries{chain="osmosis-1",query_type="query_txs"} 12
queries{chain="osmosis-1",query_type="query_unreceived_acknowledgements"} 38
queries{chain="osmosis-1",query_type="query_unreceived_packets"} 26
queries{chain="panacea-3",query_type="query_account"} 1
queries{chain="panacea-3",query_type="query_channel"} 132
queries{chain="panacea-3",query_type="query_client_state"} 206
queries{chain="panacea-3",query_type="query_connection"} 4
queries{chain="panacea-3",query_type="query_consensus_state"} 205
queries{chain="panacea-3",query_type="query_latest_height"} 1
queries{chain="panacea-3",query_type="query_packet_acknowledgements"} 2
queries{chain="panacea-3",query_type="query_packet_commitments"} 4
queries{chain="panacea-3",query_type="query_status"} 4
queries{chain="panacea-3",query_type="query_txs"} 1
queries{chain="panacea-3",query_type="query_unreceived_acknowledgements"} 258
queries{chain="panacea-3",query_type="query_unreceived_packets"} 2
queries{chain="sentinelhub-2",query_type="query_channel"} 2
queries{chain="sentinelhub-2",query_type="query_client_state"} 432
queries{chain="sentinelhub-2",query_type="query_connection"} 3
queries{chain="sentinelhub-2",query_type="query_consensus_state"} 428
queries{chain="sentinelhub-2",query_type="query_latest_height"} 1
queries{chain="sentinelhub-2",query_type="query_packet_acknowledgements"} 1
queries{chain="sentinelhub-2",query_type="query_packet_commitments"} 2
queries{chain="sentinelhub-2",query_type="query_unreceived_acknowledgements"} 1
queries{chain="stargaze-1",query_type="query_channel"} 16
queries{chain="stargaze-1",query_type="query_client_state"} 392
queries{chain="stargaze-1",query_type="query_connection"} 14
queries{chain="stargaze-1",query_type="query_consensus_state"} 372
queries{chain="stargaze-1",query_type="query_latest_height"} 1
queries{chain="stargaze-1",query_type="query_packet_acknowledgements"} 2
queries{chain="stargaze-1",query_type="query_packet_commitments"} 6
queries{chain="stargaze-1",query_type="query_status"} 3
queries{chain="stargaze-1",query_type="query_unreceived_acknowledgements"} 6
queries{chain="stargaze-1",query_type="query_unreceived_packets"} 3
queries{chain="vidulum-1",query_type="query_channel"} 2
queries{chain="vidulum-1",query_type="query_client_state"} 195
queries{chain="vidulum-1",query_type="query_connection"} 3
queries{chain="vidulum-1",query_type="query_consensus_state"} 194
queries{chain="vidulum-1",query_type="query_latest_height"} 1
queries{chain="vidulum-1",query_type="query_packet_acknowledgements"} 1
queries{chain="vidulum-1",query_type="query_packet_commitments"} 2
queries{chain="vidulum-1",query_type="query_unreceived_acknowledgements"} 1
# HELP workers Number of workers per object
# TYPE workers gauge
workers{type="client"} 35
workers{type="packet"} 17
A question
One thing that stands out is this:
queries{chain="osmosis-1",query_type="query_client_state"} 6430
6K queries might explain why Hermes might need 10min to start, but not sure.
The question is: At startup time, assuming ~10 preconfigured channels for osmosis-1
, why would Hermes need to query client_state
6K times? We should clarify this.
@@ -0,0 +1,4 @@ | |||
- Reduce the startup time of the relayer by querying the necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
One valuable detail we found while parsing some logs with Romac is that Hermes scanning phase can be blazing fast (under 10 seconds for a relayer operator that has ~40 allowed channels on 17 chains) but the spawning phase is the one consuming most of the time. In an execution where the start-time was 10 minutes, 6 minutes of all that time went to only 3 spawning tasks, as follows:
These are low-hanging fruits we should investigate. If we can eliminate the bottleneck here, the start time will reduce to less than half. |
@adizere and I dug into this performance issue. Our strong suspicion is that the Adi made the observation that these closures are fetching large vectors of data just to calculate a boolean value. Could we make use of |
Very insightful, thanks! Will look into it. |
Building up on this idea, if we choose to optimize the start time, here's a sketch of how we can exploit pagination to reduce pressure/bandwidth on full node: I'd suggest the refactoring based on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow! amazing work here. In addition to the perf improvement I think the scanning/ spawning code looks so much cleaner and easier to understand. Thanks!!
…1705) - Split the worker spawning code into two parts: - a) scan the chains for the clients, connections and channels - b) spawn the appropriate workers based on the resulting scan - When scanning a chain with filtering enabled and an allow list, skip scanning all the clients and query the allowed channels directly, resulting in much fewer queries and a faster start. - Add a `--full-scan` option to `hermes start` to opt out of the fast start mechanism and do a full scan. --- * Introduce a `ChainScanner` to scan the chains for clients, connections and channels * Use `ChainScanner` for spawning workers * Formatting * Add `--full-scan` option to `start` command to force a full scan even when some chains are using allow lists * Remove debug statements and print scanned chains on startup * Changelog entry * Fix duplicate info message * Quote identifiers in log messages * Better error when port/channel does not exists * Add metrics for queries * Small log improvements * Rename queries metric * Use `chain` key for recording chain identifier in tracing logs * Use more structured logging in chain scanner * Fix changelog entry * Improve logs when no workers were spawned * Improve logs when spawning connection and channel workers * Remove spaces in objects names * Add changelog entry * Revert part of logs changes * Use INFO level for spawning logs * Remove redundant changelog entry
…1705) - Split the worker spawning code into two parts: - a) scan the chains for the clients, connections and channels - b) spawn the appropriate workers based on the resulting scan - When scanning a chain with filtering enabled and an allow list, skip scanning all the clients and query the allowed channels directly, resulting in much fewer queries and a faster start. - Add a `--full-scan` option to `hermes start` to opt out of the fast start mechanism and do a full scan. --- * Introduce a `ChainScanner` to scan the chains for clients, connections and channels * Use `ChainScanner` for spawning workers * Formatting * Add `--full-scan` option to `start` command to force a full scan even when some chains are using allow lists * Remove debug statements and print scanned chains on startup * Changelog entry * Fix duplicate info message * Quote identifiers in log messages * Better error when port/channel does not exists * Add metrics for queries * Small log improvements * Rename queries metric * Use `chain` key for recording chain identifier in tracing logs * Use more structured logging in chain scanner * Fix changelog entry * Improve logs when no workers were spawned * Improve logs when spawning connection and channel workers * Remove spaces in objects names * Add changelog entry * Revert part of logs changes * Use INFO level for spawning logs * Remove redundant changelog entry
Closes: #1536
Description
--full-scan
option tohermes start
to opt out of the fast start mechanism and do a full scan.PR author checklist:
unclog
.docs/
).Reviewer checklist:
Files changed
in the GitHub PR explorer.