Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: binary should exit with code 1 upon fatal or unrecoverable errors #20745

Open
1 task done
shupcode opened this issue Jun 21, 2024 · 4 comments · May be fixed by #21563
Open
1 task done

[Bug]: binary should exit with code 1 upon fatal or unrecoverable errors #20745

shupcode opened this issue Jun 21, 2024 · 4 comments · May be fixed by #21563
Labels

Comments

@shupcode
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

@MSalopek - from cosmos/gaia#3138 (comment)

Key observations during the gaia v17 chain halt (https://forum.cosmos.network/t/cosmos-hub-v17-1-chain-halt-post-mortem/13899):

  • Node binary did not terminate and continued to run
  • pex and p2p peers still connected
  • RPC/API endpoints still functioning
  • endpoint /status showing catching_up: false

Many container orchestration tools (like Docker Compose / Kubernetes) and monitoring systems are configured to detect non-zero exit codes as failure indicators, triggering alerts or automated recovery processes.

Allowing a node to remain active despite irrecoverable errors introduces numerous complex edge cases that need careful handling.

When a node in error state continues to connect to peers and respond to API calls, it can generate misleading status indicators. For example, it may report catching_up: false inaccurately.

Monitoring latest_block_time can be challenging. Distinguishing between a legitimate chain halt (e.g., during an upgrade) and a fatal error that results in a stale latest_block_time can be problematic.

Maintaining peer connections during a fatal error serves little purpose and may exacerbate issues.

Conclusion:
There is a crucial need for improved error identification to differentiate between errors requiring immediate node termination (triggering monitoring system alerts) and those that are recoverable. This distinction is essential for maintaining system integrity and reliability.

Cosmos SDK Version

0.47

How to reproduce?

No response

@shupcode shupcode added the T:Bug label Jun 21, 2024
@tac0turtle
Copy link
Member

the issue from related to the hub was in a transaction from my understanding, correct me if im wrong. Panics are recovered in the case of transactions as the panic can be due to many reasons and usually means that the tx failed and should move onto the next transaction for execution.

We do exit in many cases but it seems this was due to an error that the state machine could not have known it was unrecoverable. Things that could lead to the node exiting like apphash or lastresultblockhash being different from the rest of the network should exit i agree, this would need to happen on the comet side since we dont have knowledge of it being a mismatch as the hash has not yet come to us in the consensus process.

Id like to understand what went wrong and how to prevent it but it seems the framework acted as intended

@tac0turtle
Copy link
Member

talking with matija, i believe i have a better understanding of the issue. The chain halts but the process does not exit. This should be a simple fix when we find where it needs to go 😄. we will look into it. Thanks for opening the issue

@MSalopek
Copy link
Contributor

MSalopek commented Jul 4, 2024

Adding some logs for posterity.

The issue occurs at upgrade height, not when a Tx panics. Panics in Tx are handled correctly.

When the upgrade height is reached a panic is shown claiming that there's an upgrade needed (expected).
The node continues peering and does not shut down - can lead to issues (unexpected).

Expected behaviour is for the node to fully shut down during a panic in Begin/EndBlocker.

3:39PM INF indexed block events height=17559999 module=txindex
3:39PM INF Timed out dur=420.808051 height=17560000 module=consensus round=0 step=RoundStepNewHeight
3:39PM INF received proposal module=consensus proposal="Proposal{17560000/0 (37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE:1:000170E3AD91, -1) 9461678612CB @ 2024-06-06T15:39:44.733735052Z}" proposer=F243682DFC3A5AF6BE1934A51777DB33E2CB575C
3:39PM INF received complete proposal block hash=37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE height=17560000 module=consensus
3:39PM INF finalizing commit of block hash=37CF0093FC7D1ED5C5E7D4F42BFA378729A99EEFC1431CF61B6CC9CD81C6D1FE height=17560000 module=consensus num_txs=5 root=A4933AC4DAA9C2331A5B557B3094BAF48A69CABBC0ED3BA11A440D1AD7A6C620
3:39PM ERR UPGRADE "v5.0.0" NEEDED at height: 17560000: { "binaries": { "linux/amd64":"https://github.com/dydxprotocol/v4-chain/releases/download/protocol%2Fv5.0.0/dydxprotocold-v5.0.0-linux-amd64.tar.gz", "linux/arm64":"https://github.com/dydxprotocol/v4-chain/releases/download/protocol%2Fv5.0.0/dydxprotocold-v5.0.0-linux-arm64.tar.gz"}} module=x/upgrade
3:39PM ERR CONSENSUS FAILURE!!! err="runtime error: invalid memory address or nil pointer dereference" module=consensus stack="goroutine 1069 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x5e\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:800 +0x46\npanic({0x47077a0?, 0x75ae2c0?})\n\truntime/panic.go:914 +0x21f\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock.func1()\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/baseapp/abci.go:893 +0x16a\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock(0xc0012866c0, 0xc1a0cba000)\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/baseapp/abci.go:925 +0x24c\ngithub.com/cosmos/cosmos-sdk/server.cometABCIWrapper.FinalizeBlock(...)\n\tgithub.com/cosmos/cosmos-sdk@v0.50.4/server/cmt_abci.go:44\ngithub.com/cometbft/cometbft/abci/client.(*unsyncLocalClient).FinalizeBlock(0xc057a78488?, {0x5627a30?, 0x77393c0?}, 0xc0?)\n\tgithub.com/cometbft/cometbft@v0.38.5/abci/client/unsync_local_client.go:137 +0x26\ngithub.com/cometbft/cometbft/proxy.(*appConnConsensus).FinalizeBlock(0xc00dae1110, {0x5627a30, 0x77393c0}, 0x1?)\n\tgithub.com/cometbft/cometbft@v0.38.5/proxy/app_conn.go:104 +0x170\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc00bb4b90a, 0x6}}, {0xc00bb4b9a0, 0xe}, 0x1, 0x10bf1bf, {{0xc2f987ada0, ...}, ...}, ...}, ...)\n\tgithub.com/cometbft/cometbft@v0.38.5/state/execution.go:226 +0x837\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0xc001353180, 0x10bf1c0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1833 +0xb7d\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0xc001353180, 0x10bf1c0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1744 +0x349\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1679 +0x9c\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0xc001353180, 0x10bf1c0, 0x0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:1717 +0xcd0\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc001353180, 0xc2be783ee0, {0xc002fa8150, 0x28})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:2371 +0x1be5\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc001353180, 0xc2be783ee0, {0xc002fa8150?, 0xc057a79c08?})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:2123 +0x26\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc001353180, {{0x55f3740, 0xc1a08187c8}, {0xc002fa8150, 0x28}})\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:928 +0x3ce\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc001353180, 0x0)\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:835 +0x3d1\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 69\n\tgithub.com/cometbft/cometbft@v0.38.5/consensus/state.go:397 +0x10c\n"

// node is expected to shut down after printing 'CONSENSUS FAILURE'
// instead, the node continues maintaining network connections 

3:39PM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/mnt/storage/dydx-live/.dydxprotocol/data/cs.wal/wal
3:39PM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/mnt/storage/dydx-live/.dydxprotocol/data/cs.wal/wal
3:39PM INF service stop impl=MConn{176.9.29.248:23856} module=p2p msg="Stopping MConnection service" peer=84995de5103a49513ab84f1b4f3355bc85626a37@176.9.29.248:23856
3:39PM ERR Stopping peer for error err="read tcp 46.4.72.249:49746->176.9.29.248:23856: read: connection reset by peer" module=p2p peer="Peer{MConn{176.9.29.248:23856} 84995de5103a49513ab84f1b4f3355bc85626a37 out}"
3:39PM INF service stop impl="Peer{MConn{176.9.29.248:23856} 84995de5103a49513ab84f1b4f3355bc85626a37 out}" module=p2p msg="Stopping Peer service" peer=84995de5103a49513ab84f1b4f3355bc85626a37@176.9.29.248:23856
3:39PM INF service stop impl=MConn{85.190.134.37:26656} module=p2p msg="Stopping MConnection service" peer=3cd3146398b56a55cfda61eca427f6bb2cf60601@85.190.134.37:26656
3:39PM ERR Stopping peer for error err="read tcp 46.4.72.249:41384->85.190.134.37:26656: read: connection reset by peer" module=p2p peer="Peer{MConn{85.190.134.37:26656} 3cd3146398b56a55cfda61eca427f6bb2cf60601 out}"
3:39PM INF service stop impl="Peer{MConn{85.190.134.37:26656} 3cd3146398b56a55cfda61eca427f6bb2cf60601 out}" module=p2p msg="Stopping Peer service" peer=3cd3146398b56a55cfda61eca427f6bb2cf60601@85.190.134.37:26656
3:39PM INF service stop impl=MConn{[2a01:4f9:3a:11e0::5]:26656} module=p2p msg="Stopping MConnection service" peer=58ea81df95728fafae81c011462432a72d18ad25@[2a01:4f9:3a:11e0::5]:26656

@MSalopek
Copy link
Contributor

MSalopek commented Aug 7, 2024

the issue from related to the hub was in a transaction from my understanding, correct me if im wrong. Panics are recovered in the case of transactions as the panic can be due to many reasons and usually means that the tx failed and should move onto the next transaction for execution.

We do exit in many cases but it seems this was due to an error that the state machine could not have known it was unrecoverable. Things that could lead to the node exiting like apphash or lastresultblockhash being different from the rest of the network should exit i agree, this would need to happen on the comet side since we dont have knowledge of it being a mismatch as the hash has not yet come to us in the consensus process.

Id like to understand what went wrong and how to prevent it but it seems the framework acted as intended

Clarifying some of the assumptions from above:
The panics are not Tx related, these are FinalizeBlock operations.

Adding more finding from gaia testnets:
The same behaviour was observed when using cosmos-sdk v0.50:

  • When a halt height is set using the app.toml and the node reaches it it seems that p2p an RPC stays active and the nodes produce log output.

@lucaslopezf lucaslopezf linked a pull request Sep 5, 2024 that will close this issue
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 📋 Backlog
Development

Successfully merging a pull request may close this issue.

3 participants