[Bug]: binary should exit with code 1 upon fatal or unrecoverable errors #20745

shupcode · 2024-06-21T01:02:39Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

@MSalopek - from cosmos/gaia#3138 (comment)

Key observations during the gaia v17 chain halt (https://forum.cosmos.network/t/cosmos-hub-v17-1-chain-halt-post-mortem/13899):

Node binary did not terminate and continued to run
pex and p2p peers still connected
RPC/API endpoints still functioning
endpoint /status showing catching_up: false

Many container orchestration tools (like Docker Compose / Kubernetes) and monitoring systems are configured to detect non-zero exit codes as failure indicators, triggering alerts or automated recovery processes.

Allowing a node to remain active despite irrecoverable errors introduces numerous complex edge cases that need careful handling.

When a node in error state continues to connect to peers and respond to API calls, it can generate misleading status indicators. For example, it may report catching_up: false inaccurately.

Monitoring latest_block_time can be challenging. Distinguishing between a legitimate chain halt (e.g., during an upgrade) and a fatal error that results in a stale latest_block_time can be problematic.

Maintaining peer connections during a fatal error serves little purpose and may exacerbate issues.

Conclusion:
There is a crucial need for improved error identification to differentiate between errors requiring immediate node termination (triggering monitoring system alerts) and those that are recoverable. This distinction is essential for maintaining system integrity and reliability.

Cosmos SDK Version

0.47

How to reproduce?

No response

The text was updated successfully, but these errors were encountered:

tac0turtle · 2024-06-25T09:15:50Z

the issue from related to the hub was in a transaction from my understanding, correct me if im wrong. Panics are recovered in the case of transactions as the panic can be due to many reasons and usually means that the tx failed and should move onto the next transaction for execution.

We do exit in many cases but it seems this was due to an error that the state machine could not have known it was unrecoverable. Things that could lead to the node exiting like apphash or lastresultblockhash being different from the rest of the network should exit i agree, this would need to happen on the comet side since we dont have knowledge of it being a mismatch as the hash has not yet come to us in the consensus process.

Id like to understand what went wrong and how to prevent it but it seems the framework acted as intended

tac0turtle · 2024-06-25T10:31:20Z

talking with matija, i believe i have a better understanding of the issue. The chain halts but the process does not exit. This should be a simple fix when we find where it needs to go 😄. we will look into it. Thanks for opening the issue

MSalopek · 2024-07-04T09:10:58Z

Adding some logs for posterity.

The issue occurs at upgrade height, not when a Tx panics. Panics in Tx are handled correctly.

When the upgrade height is reached a panic is shown claiming that there's an upgrade needed (expected).
The node continues peering and does not shut down - can lead to issues (unexpected).

Expected behaviour is for the node to fully shut down during a panic in Begin/EndBlocker.

3:39PM INF indexed block events height=17559999 module=txindex
3:39PM INF Timed out dur=420.808051 height=17560000 module=consensus round=0 step=RoundStepNewHeight
3:39PM INF received proposal module=consensus proposal="Proposal{17560000/0 (37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE:1:000170E3AD91, -1) 9461678612CB @ 2024-06-06T15:39:44.733735052Z}" proposer=F243682DFC3A5AF6BE1934A51777DB33E2CB575C
3:39PM INF received complete proposal block hash=37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE height=17560000 module=consensus
3:39PM INF finalizing commit of block hash=37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE height=17560000 module=consensus num_txs=5 root=A4933AC4DAA9C2331A5B557B3094BAF48A69CABBC0ED3BA11A440D1AD7A6C620
3:39PM ERR UPGRADE "v5.0.0" NEEDED at height: 17560000: { "binaries": { "linux/amd64":"https://github.com/dydxprotocol/v4-chain/releases/download/protocol%2Fv5.0.0/dydxprotocold-v5.0.0-linux-amd64.tar.gz", "linux/arm64":"https://github.com/dydxprotocol/v4-chain/releases/download/protocol%2Fv5.0.0/dydxprotocold-v5.0.0-linux-arm64.tar.gz"}} module=x/upgrade
3:39PM ERR CONSENSUS FAILURE!!! err="runtime error: invalid memory address or nil pointer dereference" module=consensus stack="goroutine 1069 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x5e\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:800 +0x46\npanic({0x47077a0?, 0x75ae2c0?})\n\truntime/panic.go:914 +0x21f\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock.func1()\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/baseapp/abci.go:893 +0x16a\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock(0xc0012866c0, 0xc1a0cba000)\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/baseapp/abci.go:925 +0x24c\ngithub.com/cosmos/cosmos-sdk/server.cometABCIWrapper.FinalizeBlock(...)\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/server/cmt_abci.go:44\ngithub.com/cometbft/cometbft/abci/client.(*unsyncLocalClient).FinalizeBlock(0xc057a78488?, {0x5627a30?, 0x77393c0?}, 0xc0?)\n\tgithub.com/cometbft/cometbft@v0.38.5/abci/client/unsync_local_client.go:137 +0x26\ngithub.com/cometbft/cometbft/proxy.(*appConnConsensus).FinalizeBlock(0xc00dae1110, {0x5627a30, 0x77393c0}, 0x1?)\n\tgithub.com/cometbft/cometbft@v0.38.5/proxy/app_conn.go:104 +0x170\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc00bb4b90a, 0x6}}, {0xc00bb4b9a0, 0xe}, 0x1, 0x10bf1bf, {{0xc2f987ada0, ...}, ...}, ...}, ...)\n\tgithub.com/cometbft/cometbft@v0.38.5/state/execution.go:226 +0x837\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc001353180, 0x10bf1c0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1833 +0xb7d\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc001353180, 0x10bf1c0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1744 +0x349\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1679 +0x9c\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0xc001353180, 0x10bf1c0, 0x0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1717 +0xcd0\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc001353180, 0xc2be783ee0, {0xc002fa8150, 0x28})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:2371 +0x1be5\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc001353180, 0xc2be783ee0, {0xc002fa8150?, 0xc057a79c08?})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:2123 +0x26\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc001353180, {{0x55f3740, 0xc1a08187c8}, {0xc002fa8150, 0x28}})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:928 +0x3ce\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc001353180, 0x0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:835 +0x3d1\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 69\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:397 +0x10c\n"

// node is expected to shut down after printing 'CONSENSUS FAILURE'
// instead, the node continues maintaining network connections 

3:39PM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/mnt/storage/dydx-live/.dydxprotocol/data/cs.wal/wal
3:39PM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/mnt/storage/dydx-live/.dydxprotocol/data/cs.wal/wal
3:39PM INF service stop impl=MConn{176.9.29.248:23856} module=p2p msg="Stopping MConnection service" peer=84995de5103a49513ab84f1b4f3355bc85626a37@176.9.29.248:23856
3:39PM ERR Stopping peer for error err="read tcp 46.4.72.249:49746->176.9.29.248:23856: read: connection reset by peer" module=p2p peer="Peer{MConn{176.9.29.248:23856} 84995de5103a49513ab84f1b4f3355bc85626a37 out}"
3:39PM INF service stop impl="Peer{MConn{176.9.29.248:23856} 84995de5103a49513ab84f1b4f3355bc85626a37 out}" module=p2p msg="Stopping Peer service" peer=84995de5103a49513ab84f1b4f3355bc85626a37@176.9.29.248:23856
3:39PM INF service stop impl=MConn{85.190.134.37:26656} module=p2p msg="Stopping MConnection service" peer=3cd3146398b56a55cfda61eca427f6bb2cf60601@85.190.134.37:26656
3:39PM ERR Stopping peer for error err="read tcp 46.4.72.249:41384->85.190.134.37:26656: read: connection reset by peer" module=p2p peer="Peer{MConn{85.190.134.37:26656} 3cd3146398b56a55cfda61eca427f6bb2cf60601 out}"
3:39PM INF service stop impl="Peer{MConn{85.190.134.37:26656} 3cd3146398b56a55cfda61eca427f6bb2cf60601 out}" module=p2p msg="Stopping Peer service" peer=3cd3146398b56a55cfda61eca427f6bb2cf60601@85.190.134.37:26656
3:39PM INF service stop impl=MConn{[2a01:4f9:3a:11e0::5]:26656} module=p2p msg="Stopping MConnection service" peer=58ea81df95728fafae81c011462432a72d18ad25@[2a01:4f9:3a:11e0::5]:26656

MSalopek · 2024-08-07T17:39:10Z

the issue from related to the hub was in a transaction from my understanding, correct me if im wrong. Panics are recovered in the case of transactions as the panic can be due to many reasons and usually means that the tx failed and should move onto the next transaction for execution.

We do exit in many cases but it seems this was due to an error that the state machine could not have known it was unrecoverable. Things that could lead to the node exiting like apphash or lastresultblockhash being different from the rest of the network should exit i agree, this would need to happen on the comet side since we dont have knowledge of it being a mismatch as the hash has not yet come to us in the consensus process.

Id like to understand what went wrong and how to prevent it but it seems the framework acted as intended

Clarifying some of the assumptions from above:
The panics are not Tx related, these are FinalizeBlock operations.

Adding more finding from gaia testnets:
The same behaviour was observed when using cosmos-sdk v0.50:

When a halt height is set using the app.toml and the node reaches it it seems that p2p an RPC stays active and the nodes produce log output.

shupcode added the T:Bug label Jun 21, 2024

MSalopek mentioned this issue Jul 4, 2024

[Bug]: Node binary should quit with status code 1 when a fatal error occurs. It currently does not. cosmos/gaia#3138

Closed

1 task

lucaslopezf linked a pull request Sep 5, 2024 that will close this issue

feat: Add log monitor draft #21563

Draft

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: binary should exit with code 1 upon fatal or unrecoverable errors #20745

[Bug]: binary should exit with code 1 upon fatal or unrecoverable errors #20745

shupcode commented Jun 21, 2024

tac0turtle commented Jun 25, 2024

tac0turtle commented Jun 25, 2024

MSalopek commented Jul 4, 2024

MSalopek commented Aug 7, 2024 •

edited

Loading

[Bug]: binary should exit with code 1 upon fatal or unrecoverable errors #20745

[Bug]: binary should exit with code 1 upon fatal or unrecoverable errors #20745

Comments

shupcode commented Jun 21, 2024

Is there an existing issue for this?

What happened?

Cosmos SDK Version

How to reproduce?

tac0turtle commented Jun 25, 2024

tac0turtle commented Jun 25, 2024

MSalopek commented Jul 4, 2024

MSalopek commented Aug 7, 2024 • edited Loading

MSalopek commented Aug 7, 2024 •

edited

Loading