Validator: failover to secondary node is not working on mainnet #12030

pznamensky · 2024-09-02T15:23:37Z

Describe the bug
Both failover approaches (restart secondary with validator_key.json and hot swap) are not working on validator running v2.1.1.
Secondary node stuck on trying to download blocks.

To Reproduce
To reproduce the problem you should bootstrap secondary node, wait until it will catch up all the blocks (curl -s localhost:3030/status | jq -C .sync_info) and then follow one of two possible approaches as described here:
https://github.com/near/nearcore/releases/tag/2.0.0.
On older versions (1.x) we used to use the 1st approach (copy over validator_key and node_key), but we've tried the 2nd one too and both don't work.
After failing over, node can't catch up blocks with the following messages:

2024-08-28T14:39:42.108268Z  INFO stats: #126754210 Downloading blocks 0.00% (115 left; at 126754210) 32 peers ⬇ 8.00 MB/s ⬆ 6.72 MB/s 0.00 bps 0 gas/s CPU: 165%, Mem: 8.52 GB
...
2024-08-28T14:39:52.108957Z  INFO stats: #126754210 Downloading blocks 0.00% (123 left; at 126754210) 32 peers ⬇ 8.05 MB/s ⬆ 6.75 MB/s 0.00 bps 0 gas/s CPU: 89%, Mem: 8.50 GB
...
2024-08-28T14:40:02.109626Z  INFO stats: #126754210 Downloading blocks 0.00% (131 left; at 126754210) 32 peers ⬇ 8.12 MB/s ⬆ 6.34 MB/s 0.00 bps 0 gas/s CPU: 71%, Mem: 8.54 GB

Please see more logs below.

Expected behavior
We should be able to perform failover.

Version (please complete the following information):

nearcore 2.1.1
rust version 1.79.0
docker 24
mainnet

Additional context
Here is log right after send SIGHUP signal to the backup node (see this doc):
neard.log

The text was updated successfully, but these errors were encountered:

andrei-near · 2024-09-06T08:33:36Z

Hi @pznamensky.
We've just tested failover option 2 with nearcore 2.1.1 internally and it went through as expected.
Could you please:

provide the configuration for primary and secondary nodes
confirm what happens with primary and secondary node node_key? This should NOT be moved from primary to secondary and it can lead to nodes not being able to sync (other peers see new IP but key is already associated with primary node IP address so they will not allow syncing)

pznamensky · 2024-09-06T16:11:11Z

Hi @andrei-near,
Previously, we used to copy over node_key from primary to secondary, however we've tried without replacing node_key on secondary and that's not worked either.

Here is our config.json with redacted public_addrs field which we use with "restart" approach. In "hot reload" approach tracked_shadow_validator was also added.

Also, I'm not sure if it's related, but in this epoch secondary node is crashing after copying validator_key and restarting neard:

2024-09-06T15:24:20.298880Z  INFO neard: version="2.1.1" build="unknown" latest_protocol=70
2024-09-06T15:24:20.299083Z  WARN neard: /root/.near/config.json: encountered unrecognised fields: consensus.reduce_wait_for_missing_block, consensus.state_fetch_horizon, store.trie_cache_capacities, store.background_migration_threads
2024-09-06T15:24:20.299093Z  INFO config: Validating Config, extracted from config.json...
2024-09-06T15:24:20.301007Z  WARN genesis: Skipped genesis validation
2024-09-06T15:24:20.301022Z  INFO config: Validating Genesis config and records. This could take a few minutes...
2024-09-06T15:24:20.301227Z  INFO config: All validations have passed!
2024-09-06T15:24:20.302980Z  INFO neard: Reset the config "/root/.near/log_config.json" because the config file doesn't exist. err=Os { code: 2, kind: NotFound, message: "No such file or directory" }
2024-09-06T15:24:20.303069Z  WARN neard: /root/.near/config.json: encountered unrecognised fields: consensus.reduce_wait_for_missing_block, consensus.state_fetch_horizon, store.trie_cache_capacities, store.background_migration_threads
2024-09-06T15:24:20.303076Z  INFO config: Validating Config, extracted from config.json...
2024-09-06T15:24:20.303098Z  INFO neard: Hot loading validator key /root/.near/validator_key.json.
2024-09-06T15:24:20.303140Z  INFO near_o11y::reload: Updated the logging layer according to `log_config.json`
2024-09-06T15:24:20.303162Z  INFO db_opener: Opening NodeStorage path="/root/.near/data" cold_path="none"
2024-09-06T15:24:20.303180Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.325975Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.325990Z  INFO db_opener: The database exists. path=/root/.near/data
2024-09-06T15:24:20.325994Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.542850Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.542881Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.563855Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.563874Z  INFO db: Opened a new RocksDB instance. num_instances=1
thread 'main' panicked at chain/client/src/client_actor.rs:168:6:
called `Result::unwrap()` on an `Err` value: Chain(StorageError(StorageInconsistentState("No ChunkExtra for block 7CfkJyEdHnJiZ4kEURa6p8HPEBwDj4eZcqMNKetrV9eD in shard s5.v3")))
stack backtrace:
   0:     0x5641c7fea424 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
   1:     0x5641c68a589b - core::fmt::write::hc090a2ffd6b28c4a
   2:     0x5641c7fb608e - std::io::Write::write_fmt::h8898bac6ff039a23
   3:     0x5641c7febae9 - std::sys_common::backtrace::print::ha96650907276675e
   4:     0x5641c7feb409 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
   5:     0x5641c7fec4e5 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
   6:     0x5641c7febe72 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
   7:     0x5641c7febdc9 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
   8:     0x5641c7febdb6 - rust_begin_unwind
   9:     0x5641c6544072 - core::panicking::panic_fmt::hdc63834ffaaefae5
  10:     0x5641c6544485 - core::result::unwrap_failed::h82b551e0ff2b2176
  11:     0x5641c7b54f03 - nearcore::start_with_config_and_synchronization::haab8ba90a9fdf122
  12:     0x5641c673c498 - neard::cli::RunCmd::run::{{closure}}::h24af1b6de28656ca
  13:     0x5641c6709429 - tokio::task::local::LocalSet::run_until::{{closure}}::h60aa1f7ad8e92db2
  14:     0x5641c672db25 - neard::cli::NeardCmd::parse_and_run::h124534cdc5ba48bd
  15:     0x5641c673fd13 - neard::main::h3bab0c83991e2cc8
  16:     0x5641c66f6253 - std::sys_common::backtrace::__rust_begin_short_backtrace::h1a2f36ec6ad85c72
  17:     0x5641c6769eff - main
  18:     0x7f6c2cb6cd90 - <unknown>
  19:     0x7f6c2cb6ce40 - __libc_start_main
  20:     0x5641c66bd905 - _start
  21:                0x0 - <unknown>

staffik · 2024-09-10T10:30:00Z

Here is our config.json

Is it config for primary node, failover node, or both?

pznamensky · 2024-09-10T14:30:18Z

Is it config for primary node, failover node, or both?

That's for the both nodes.

staffik · 2024-09-10T14:51:54Z

It has tracked_shards=[] and it does not have tracked_shadow_validator, thus failover node does not track anything.
I think that might be why it crashed (No ChunkExtra .. ).

pznamensky · 2024-09-16T11:33:11Z

Thanks for the update.
Unfortunately, with tracked_shadow_validator neard keeps crashing with the same error.

staffik · 2024-09-16T12:21:58Z

Could you send configs for both nodes and share steps that you took?

pznamensky · 2024-09-17T15:23:10Z

@staffik sure!

config.json with redacted public_addrs field.
Also on secondary node "tracked_shadow_validator": "<our_pool_id_here>" was also added (and restarted).

Steps to perform "hot" failover:

ensure that secondary node is in sync with network via curl -s localhost:3030/status | jq -C .sync_info
stop neard on primary server
copy over validator_key.json to secondary node
send SIGHUP to neard on secondary node

This results into these messages and no progress in downloading blocks:

INFO stats: #126754210 Downloading blocks 0.00% (115 left; at 126754210) 32 peers ⬇ 8.00 MB/s ⬆ 6.72 MB/s 0.00 bps 0 gas/s CPU: 165%, Mem: 8.52 GB
...
ERROR near: Unwrap error: Other: Node has not caught up yet
...
INFO stats: #126754210 Downloading blocks 0.00% (123 left; at 126754210) 32 peers ⬇ 8.05 MB/s ⬆ 6.75 MB/s 0.00 bps 0 gas/s CPU: 89%, Mem: 8.50 GB
...
ERROR near: Unwrap error: Other: Node has not caught up yet
...

In case we restart secondary node instead of sending SIGHUP, it results in crash as in this message: #12030 (comment)

PS: this happens on 2.2.1 too

andrei-near · 2024-09-24T07:42:01Z

@pznamensky , we repeatedly tested hot failover scenario with 2.20 and 2.2.1 and it works as expected. Failover node must be healthy and run with these params set:
.tracked_shards = []
.store.load_mem_tries_for_tracked_shards = true
.tracked_shadow_validator = "$your_pool_name"

Steps we run for hot-swap failover:

Copy validator key on new node
Stop Neard service on old validator
Send SIGHUP to neard process to pickup validator key - sudo pkill -HUP neard

I would recommend trying to download the freshest snapshot on failover node and starting with above configuration set.

pznamensky · 2024-09-24T14:15:51Z

Hi @andrei-near, thanks for checking!
We use a clean server with a fresh snapshot with exactly the same steps and config options that you've described.
Is there a chance that this issue can only be reproduced for our validatar?
Also, what can we do to find the root cause? Try debug mode (if any), dump metrics, etc?

andrei-near · 2024-10-17T14:00:27Z

@pznamensky do you gather metrics from you node somewhere?
I would like to check if the backup node is tracking the active validator shard. If you open you node metrics at http://$your_backup_node_ip:4040/metrics you should see following metrics saying which shard is the backup node tracking

near_client_tracked_shards{shard_id="0"} 0
near_client_tracked_shards{shard_id="1"} 1
near_client_tracked_shards{shard_id="2"} 0
near_client_tracked_shards{shard_id="3"} 0
near_client_tracked_shards{shard_id="4"} 0
near_client_tracked_shards{shard_id="5"} 0

Another option would be to run the backup node with RPC configs:
.tracked_shards = [111]
.store.load_mem_tries_for_tracked_shards = false

and change them to following only when validator key was moved to the node:
.tracked_shards = []
.store.load_mem_tries_for_tracked_shards = true

Downside is that it will require neard restart

telezhnaya added investigation required community Issues created by community labels Sep 27, 2024

github-actions bot mentioned this issue Oct 1, 2024

Monthly issue metrics report #12175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validator: failover to secondary node is not working on mainnet #12030

Validator: failover to secondary node is not working on mainnet #12030

pznamensky commented Sep 2, 2024

andrei-near commented Sep 6, 2024

pznamensky commented Sep 6, 2024

staffik commented Sep 10, 2024

pznamensky commented Sep 10, 2024 •

edited

Loading

staffik commented Sep 10, 2024

pznamensky commented Sep 16, 2024

staffik commented Sep 16, 2024

pznamensky commented Sep 17, 2024 •

edited

Loading

andrei-near commented Sep 24, 2024

pznamensky commented Sep 24, 2024

andrei-near commented Oct 17, 2024 •

edited

Loading

Validator: failover to secondary node is not working on mainnet #12030

Validator: failover to secondary node is not working on mainnet #12030

Comments

pznamensky commented Sep 2, 2024

andrei-near commented Sep 6, 2024

pznamensky commented Sep 6, 2024

staffik commented Sep 10, 2024

pznamensky commented Sep 10, 2024 • edited Loading

staffik commented Sep 10, 2024

pznamensky commented Sep 16, 2024

staffik commented Sep 16, 2024

pznamensky commented Sep 17, 2024 • edited Loading

andrei-near commented Sep 24, 2024

pznamensky commented Sep 24, 2024

andrei-near commented Oct 17, 2024 • edited Loading

pznamensky commented Sep 10, 2024 •

edited

Loading

pznamensky commented Sep 17, 2024 •

edited

Loading

andrei-near commented Oct 17, 2024 •

edited

Loading