Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockchain: fix temp fails causing alt blocks to be permanently invalid #9395

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeffro256
Copy link
Contributor

Issue appeared in stressnet where nodes would never switch to the valid chain with longest proof of work, but instead log a message like so:

INFO    global  src/cryptonote_core/blockchain.cpp:2125 ###### REORGANIZE on height: 2534772 of 2534772 with cum_difficulty 1072308428953
INFO    global  src/cryptonote_core/blockchain.cpp:2125  alternative blockchain size: 2 with cum_difficulty 1072309197139
ERROR   txpool  src/cryptonote_core/tx_pool.cpp:596     Failed to find tx_meta in txpool
ERROR   verify  src/cryptonote_core/blockchain.cpp:4239 Block with id: <BLKID> has at least one unknown transaction with id: <TXID>
ERROR   blockchain      src/cryptonote_core/blockchain.cpp:1205 Failed to switch to alternative blockchain
ERROR   blockchain      src/cryptonote_core/blockchain.cpp:1217 The block was inserted as invalid while connecting new alternative chain, block_id: <BLKID>

What I believe was happening is this chain of events:
1. Transaction T is added to mempool
2. Block B1, which contains T, is added as an alt block
3. Mempool conditions cause T to be pushed out
4. Block B2, with previous block B1, is added as an alt block
5. Cumulative difficulty of B2 relative to the main chain triggers a reorg attempt
6. Tx validation of B1 begins, but T is not present in the mempool
7. Method handle_block_to_main_chain returns false
8. As a result, method switch_to_alternative_blockchain adds B1 and B2 to m_invalid_blocks list permanently
9. Peer gets blocked, and node never reaches consensus with other nodes because B1 and B2 stay in the invalid blocks list

I could be wrong about the exact trigger, but regardless, handle_block_to_main_chain returns false for many reasons, some of which are non-deterministic (i.e. the node not having a transaction in the mempool). As such, we should not add blocks to the invalid blocks list every time handle_block_to_main_chain returns false. Additionally, handle_block_to_main_chain already adds blocks to the invalid blocks list for transaction input consensus failures (which should be deterministic). The other big type of failure that could occur s a PoW check, but in that case, we shouldn't add those blocks to the invalid blocks list either, since it is 100% free for a peer to create a block that fails the PoW check, and doing so would cause the list to be filled up with garbage quickly.

There is a discussion to be had whether we should have any invalid blocks cache in the first place...

Thanks to @Rucknium for finding this issue, and to everyone running a stressnet node!

Copy link

@iamamyth iamamyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the premise that an invalid block cache doesn't make much sense: If a non-malicious peer sends a block the node deems invalid, validation should fail fast in common scenarios (unsupported transaction versions, missing parent blocks, etc), rather than relying on caching. Caching just tends to obscure flaws in the efficiency of the validation logic (e.g. one might not notice fail-fast conditions being spotted late in the validation because of the caching).

@j-berman
Copy link
Collaborator

Also looking at the other spot add_block_as_invalid is called:

if(!check_tx_inputs(tx, tvc))
{
MERROR_VER("Block with id: " << id << " has at least one transaction (id: " << tx_id << ") with wrong inputs.");
//TODO: why is this done? make sure that keeping invalid blocks makes sense.
add_block_as_invalid(bl, id);

since check_tx_inputs can be false at one point and become true (example), the cache behavior for invalid blocks looks incorrect.

Note: the fourth and final spot add_block_as_invalid is called:

if (tx_index >= m_blocks_txs_check.size() || memcmp(&m_blocks_txs_check[tx_index++], &tx_id, sizeof(tx_id)) != 0)
{
MERROR_VER("Block with id: " << id << " has at least one transaction (id: " << tx_id << ") with wrong inputs.");
//TODO: why is this done? make sure that keeping invalid blocks makes sense.
add_block_as_invalid(bl, id);

is slated for removal in #9135

@jeffro256 thoughts on just removing the cache in this PR?

@j-berman
Copy link
Collaborator

Hmm, actually it would seem the block's prev_id would need to be different for check_tx_inputs to have a different result (higher height = different prev id), which means the block id would be distinct, and the block id with a truthy check_tx_inputs wouldn't be in the invalid cache

Approving this PR as is, though agree the invalid cache generally seems to introduce a wider vector for abuse than benefit

@Rucknium
Copy link

I ran this patch on a few stressnet nodes (i.e. nodes running https://github.com/spackle-xmr/monero ) and left a few nodes unpatched. My patched nodes never got stuck because of this invalid blocks problem, but my unpatched nodes got stuck four times during this test period. The patch seems to do what it's supposed to do.

AFAIK, the problem is likely to occur when transaction propagation is imperfect. An unpatched node may get a fluffy block, not have all the fluffy block's transactions in its transaction pool, then mark the block as permanently invalid because it can't find all of the block's transactions. On stressnet, this has occurred when (1) there are many large consolidation transactions (e.g. many 144in/2out) or (2) when one or more nodes have their maximum tx pool size set above the default and the aggregate size of unconfirmed txs on the network exceeds the default maximum tx pool size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants