Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator: failover to secondary node is not working on mainnet #12030

Open
pznamensky opened this issue Sep 2, 2024 · 11 comments
Open

Validator: failover to secondary node is not working on mainnet #12030

pznamensky opened this issue Sep 2, 2024 · 11 comments
Labels
community Issues created by community investigation required

Comments

@pznamensky
Copy link

Describe the bug
Both failover approaches (restart secondary with validator_key.json and hot swap) are not working on validator running v2.1.1.
Secondary node stuck on trying to download blocks.

To Reproduce
To reproduce the problem you should bootstrap secondary node, wait until it will catch up all the blocks (curl -s localhost:3030/status | jq -C .sync_info) and then follow one of two possible approaches as described here:
https://github.com/near/nearcore/releases/tag/2.0.0.
On older versions (1.x) we used to use the 1st approach (copy over validator_key and node_key), but we've tried the 2nd one too and both don't work.
After failing over, node can't catch up blocks with the following messages:

2024-08-28T14:39:42.108268Z  INFO stats: #126754210 Downloading blocks 0.00% (115 left; at 126754210) 32 peers ⬇ 8.00 MB/s ⬆ 6.72 MB/s 0.00 bps 0 gas/s CPU: 165%, Mem: 8.52 GB
...
2024-08-28T14:39:52.108957Z  INFO stats: #126754210 Downloading blocks 0.00% (123 left; at 126754210) 32 peers ⬇ 8.05 MB/s ⬆ 6.75 MB/s 0.00 bps 0 gas/s CPU: 89%, Mem: 8.50 GB
...
2024-08-28T14:40:02.109626Z  INFO stats: #126754210 Downloading blocks 0.00% (131 left; at 126754210) 32 peers ⬇ 8.12 MB/s ⬆ 6.34 MB/s 0.00 bps 0 gas/s CPU: 71%, Mem: 8.54 GB

Please see more logs below.

Expected behavior
We should be able to perform failover.

Version (please complete the following information):

  • nearcore 2.1.1
  • rust version 1.79.0
  • docker 24
  • mainnet

Additional context
Here is log right after send SIGHUP signal to the backup node (see this doc):
neard.log

@andrei-near
Copy link
Contributor

Hi @pznamensky.
We've just tested failover option 2 with nearcore 2.1.1 internally and it went through as expected.
Could you please:

  1. provide the configuration for primary and secondary nodes
  2. confirm what happens with primary and secondary node node_key? This should NOT be moved from primary to secondary and it can lead to nodes not being able to sync (other peers see new IP but key is already associated with primary node IP address so they will not allow syncing)

@pznamensky
Copy link
Author

Hi @andrei-near,
Previously, we used to copy over node_key from primary to secondary, however we've tried without replacing node_key on secondary and that's not worked either.

Here is our config.json with redacted public_addrs field which we use with "restart" approach. In "hot reload" approach tracked_shadow_validator was also added.

Also, I'm not sure if it's related, but in this epoch secondary node is crashing after copying validator_key and restarting neard:

2024-09-06T15:24:20.298880Z  INFO neard: version="2.1.1" build="unknown" latest_protocol=70
2024-09-06T15:24:20.299083Z  WARN neard: /root/.near/config.json: encountered unrecognised fields: consensus.reduce_wait_for_missing_block, consensus.state_fetch_horizon, store.trie_cache_capacities, store.background_migration_threads
2024-09-06T15:24:20.299093Z  INFO config: Validating Config, extracted from config.json...
2024-09-06T15:24:20.301007Z  WARN genesis: Skipped genesis validation
2024-09-06T15:24:20.301022Z  INFO config: Validating Genesis config and records. This could take a few minutes...
2024-09-06T15:24:20.301227Z  INFO config: All validations have passed!
2024-09-06T15:24:20.302980Z  INFO neard: Reset the config "/root/.near/log_config.json" because the config file doesn't exist. err=Os { code: 2, kind: NotFound, message: "No such file or directory" }
2024-09-06T15:24:20.303069Z  WARN neard: /root/.near/config.json: encountered unrecognised fields: consensus.reduce_wait_for_missing_block, consensus.state_fetch_horizon, store.trie_cache_capacities, store.background_migration_threads
2024-09-06T15:24:20.303076Z  INFO config: Validating Config, extracted from config.json...
2024-09-06T15:24:20.303098Z  INFO neard: Hot loading validator key /root/.near/validator_key.json.
2024-09-06T15:24:20.303140Z  INFO near_o11y::reload: Updated the logging layer according to `log_config.json`
2024-09-06T15:24:20.303162Z  INFO db_opener: Opening NodeStorage path="/root/.near/data" cold_path="none"
2024-09-06T15:24:20.303180Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.325975Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.325990Z  INFO db_opener: The database exists. path=/root/.near/data
2024-09-06T15:24:20.325994Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.542850Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.542881Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.563855Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.563874Z  INFO db: Opened a new RocksDB instance. num_instances=1
thread 'main' panicked at chain/client/src/client_actor.rs:168:6:
called `Result::unwrap()` on an `Err` value: Chain(StorageError(StorageInconsistentState("No ChunkExtra for block 7CfkJyEdHnJiZ4kEURa6p8HPEBwDj4eZcqMNKetrV9eD in shard s5.v3")))
stack backtrace:
   0:     0x5641c7fea424 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
   1:     0x5641c68a589b - core::fmt::write::hc090a2ffd6b28c4a
   2:     0x5641c7fb608e - std::io::Write::write_fmt::h8898bac6ff039a23
   3:     0x5641c7febae9 - std::sys_common::backtrace::print::ha96650907276675e
   4:     0x5641c7feb409 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
   5:     0x5641c7fec4e5 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
   6:     0x5641c7febe72 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
   7:     0x5641c7febdc9 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
   8:     0x5641c7febdb6 - rust_begin_unwind
   9:     0x5641c6544072 - core::panicking::panic_fmt::hdc63834ffaaefae5
  10:     0x5641c6544485 - core::result::unwrap_failed::h82b551e0ff2b2176
  11:     0x5641c7b54f03 - nearcore::start_with_config_and_synchronization::haab8ba90a9fdf122
  12:     0x5641c673c498 - neard::cli::RunCmd::run::{{closure}}::h24af1b6de28656ca
  13:     0x5641c6709429 - tokio::task::local::LocalSet::run_until::{{closure}}::h60aa1f7ad8e92db2
  14:     0x5641c672db25 - neard::cli::NeardCmd::parse_and_run::h124534cdc5ba48bd
  15:     0x5641c673fd13 - neard::main::h3bab0c83991e2cc8
  16:     0x5641c66f6253 - std::sys_common::backtrace::__rust_begin_short_backtrace::h1a2f36ec6ad85c72
  17:     0x5641c6769eff - main
  18:     0x7f6c2cb6cd90 - <unknown>
  19:     0x7f6c2cb6ce40 - __libc_start_main
  20:     0x5641c66bd905 - _start
  21:                0x0 - <unknown>

@staffik
Copy link
Contributor

staffik commented Sep 10, 2024

Here is our config.json

Is it config for primary node, failover node, or both?

@pznamensky
Copy link
Author

pznamensky commented Sep 10, 2024

Is it config for primary node, failover node, or both?

That's for the both nodes.

@staffik
Copy link
Contributor

staffik commented Sep 10, 2024

It has tracked_shards=[] and it does not have tracked_shadow_validator, thus failover node does not track anything.
I think that might be why it crashed (No ChunkExtra .. ).

@pznamensky
Copy link
Author

Thanks for the update.
Unfortunately, with tracked_shadow_validator neard keeps crashing with the same error.

@staffik
Copy link
Contributor

staffik commented Sep 16, 2024

Could you send configs for both nodes and share steps that you took?

@pznamensky
Copy link
Author

pznamensky commented Sep 17, 2024

@staffik sure!

config.json with redacted public_addrs field.
Also on secondary node "tracked_shadow_validator": "<our_pool_id_here>" was also added (and restarted).

Steps to perform "hot" failover:

  1. ensure that secondary node is in sync with network via curl -s localhost:3030/status | jq -C .sync_info
  2. stop neard on primary server
  3. copy over validator_key.json to secondary node
  4. send SIGHUP to neard on secondary node

This results into these messages and no progress in downloading blocks:

INFO stats: #126754210 Downloading blocks 0.00% (115 left; at 126754210) 32 peers ⬇ 8.00 MB/s ⬆ 6.72 MB/s 0.00 bps 0 gas/s CPU: 165%, Mem: 8.52 GB
...
ERROR near: Unwrap error: Other: Node has not caught up yet
...
INFO stats: #126754210 Downloading blocks 0.00% (123 left; at 126754210) 32 peers ⬇ 8.05 MB/s ⬆ 6.75 MB/s 0.00 bps 0 gas/s CPU: 89%, Mem: 8.50 GB
...
ERROR near: Unwrap error: Other: Node has not caught up yet
...

In case we restart secondary node instead of sending SIGHUP, it results in crash as in this message: #12030 (comment)

PS: this happens on 2.2.1 too

@andrei-near
Copy link
Contributor

@pznamensky , we repeatedly tested hot failover scenario with 2.20 and 2.2.1 and it works as expected. Failover node must be healthy and run with these params set:
.tracked_shards = []
.store.load_mem_tries_for_tracked_shards = true
.tracked_shadow_validator = "$your_pool_name"

Steps we run for hot-swap failover:

  1. Copy validator key on new node
  2. Stop Neard service on old validator
  3. Send SIGHUP to neard process to pickup validator key - sudo pkill -HUP neard

I would recommend trying to download the freshest snapshot on failover node and starting with above configuration set.

@pznamensky
Copy link
Author

Hi @andrei-near, thanks for checking!
We use a clean server with a fresh snapshot with exactly the same steps and config options that you've described.
Is there a chance that this issue can only be reproduced for our validatar?
Also, what can we do to find the root cause? Try debug mode (if any), dump metrics, etc?

@andrei-near
Copy link
Contributor

andrei-near commented Oct 17, 2024

@pznamensky do you gather metrics from you node somewhere?
I would like to check if the backup node is tracking the active validator shard. If you open you node metrics at http://$your_backup_node_ip:4040/metrics you should see following metrics saying which shard is the backup node tracking

near_client_tracked_shards{shard_id="0"} 0
near_client_tracked_shards{shard_id="1"} 1
near_client_tracked_shards{shard_id="2"} 0
near_client_tracked_shards{shard_id="3"} 0
near_client_tracked_shards{shard_id="4"} 0
near_client_tracked_shards{shard_id="5"} 0

Another option would be to run the backup node with RPC configs:
.tracked_shards = [111]
.store.load_mem_tries_for_tracked_shards = false

and change them to following only when validator key was moved to the node:
.tracked_shards = []
.store.load_mem_tries_for_tracked_shards = true

Downside is that it will require neard restart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Issues created by community investigation required
Projects
None yet
Development

No branches or pull requests

4 participants