Description
The PeerNetwork::run
method can error in the event of PoX invalidations. This mostly only happens on testnet (where PoX cycles are much more frequent, so nodes do occasionally perform a PoX "reorg"), but it's possible on mainnet too. Usually this isn't a problem, if PeerNetwork::run
fails, the loop just keeps running. However, when the node is "event driven" (i.e., the event dispatcher is connected to an event consumer) like in the case of an API node, PeerNetwork::run
errors are panicking. That's because the event dispatcher cannot recover from an arbitrary PeerNetwork error.
The code path that causes issues in testnet followers is get_block_availability
:
// what blocks do we have in this range?
let local_blocks = {
let ic = sortdb.index_conn();
let tip = SortitionDB::get_canonical_burn_chain_tip(&ic)?;
...
let local_blocks = ic.get_stacks_header_hashes(
sortition_height_end - sortition_height_start,
&last_ancestor.consensus_hash,
header_cache,
)?;
The bug witnessed is as follows:
get_canonical_burn_chain_tip
is called and returns the then-canonical sortition tip.- the chains coordinator detects a PoX reorg and invalidates the sortition tip
get_stacks_header_hashes
is called and returns anErr
because the supplied tip is markedpox_valid = 0
.
The invalidations are performed in a sqlite transaction, so they are atomic. However, 1 and 3 are not, these operations are performed on just an index_conn
, not a sqlite transation. So perhaps the solution is to use a transaction here.