Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Signal channel is terminated and empty in v0.9.37 in Kusama validator. #6624

Closed
m-saxemberg opened this issue Jan 24, 2023 · 10 comments
Closed

Comments

@m-saxemberg
Copy link

Using both commands
--chain kusama --validator --db paritydb --in-peers 25 --out-peers 25 --pruning 1000 --wasm-execution
--chain kusama --validator --db rocksdb --in-peers 25 --out-peers 25 --pruning 1000 --wasm-execution

The node synchronizes for a while and then stops with multiple messages saying:
error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))

The error makes the process stop but it can be restarted. It doesn't require a database purge.

Moreover:

Errors like these show up during synchronization:
error=Runtime(RuntimeRequest(Execution { runtime_api_name: "on_chain_votes", source: Application(Execution(Other("Exported method ParachainH ost_on_chain_votes is not found"))) }))
Received msg before first active leaves update. This is not expected - message will be dropped. msg=ActiveDisputes(Sender { complete: false }) 2023-01-24 23:03:30 error=Sender(AskActiveDisputesCanceled) ctx="on_message"

Other non-error messages also appear which may or may not be related which include reindex operations:
reindex 05-18 into 05-19

@sandreim
Copy link
Contributor

Cc @tdimitrov @eskimor

@tdimitrov
Copy link
Contributor

Hey @m-saxemberg, thank you for filing this issue.

The problem is around the first error you've pasted.

The node synchronizes for a while and then stops with multiple messages saying:
error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))

I've tried to recreate the issue locally but it didn't happen. Could you please provide the whole log?

Or even better - can you upload trace logs (run the binary with -lparachain=TRACE?

@clangenb
Copy link

clangenb commented Jan 26, 2023

I observed something similar on v0.9.37, not yet with -lparachain=TRACE, but here is the log: https://gist.github.com/clangenb/0bb0ce81ea3178f598266945ebb0edae

Note: mine is a polkadot validator

@tdimitrov
Copy link
Contributor

Thank you @clangenb. Can you give me a bit more details how do you recreate this?
Is it a fresh node with an empty database and the error occurs during sync? Or after upgrade? I don't understand why I can't simulate it locally.

@clangenb
Copy link

@mosonyi can you provide additional information on that?

@clangenb
Copy link

clangenb commented Jan 27, 2023

Ok, it is hard to narrow down. We only noticed this for the first time after upgrading to v0.9.37, but it did not happen immediately after the upgrade. What we can say though is that this specific machine was having some reliability issues at that time, so it might be due to a corrupted database.

Anyhow, currently it seems that redownloading a db snapshot and continuing from there has resolved the issue.

@eskimor
Copy link
Member

eskimor commented Jan 27, 2023

Ok, closing then. Feel free to re-open if necessary.

@aliXsed
Copy link

aliXsed commented Mar 1, 2023

@tdimitrov @eskimor our parachain client is built on top of 0.9.37. I did a test by upgrading one of the node images and let it go through a major sync. The chain I'm talking about is Nodle on Polkadot. We observed the above issue repeatedly but could help the node to continue by restarting it. Here is our log:

2023-02-27 23:57:17 [Parachain] ⚙️  Syncing  0.0 bps, target=#2156686 (15 peers), best: #1557888 (0xda0e…fa46), finalized #0 (0x97da…7a21), ⬇ 109.3kiB/s ⬆ 0.5kiB/s
2023-02-27 23:57:19 [Relaychain] cannot query the runtime API version: UnknownBlock: State already discarded for 0x97eedfa8db50238ab2e113c59a8045f6dc81d9a44473165063741c2244de5a4e
2023-02-27 23:57:22 [Relaychain] ⚙️  Syncing 341.6 bps, target=#14432105 (40 peers), best: #3148687 (0x9d72…aca4), finalized #3148288 (0x0454…8616), ⬇ 344.0kiB/s ⬆ 187.3kiB/s
2023-02-27 23:57:22 [Parachain] ⚙️  Syncing  0.0 bps, target=#2156686 (15 peers), best: #1557888 (0xda0e…fa46), finalized #0 (0x97da…7a21), ⬇ 62.2kiB/s ⬆ 0.5kiB/s
2023-02-27 23:57:24 [Relaychain] cannot query the runtime API version: UnknownBlock: State already discarded for 0x97eedfa8db50238ab2e113c59a8045f6dc81d9a44473165063741c2244de5a4e
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
2023-02-27 23:57:24 [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] Essential task `overseer` failed. Shutting down service.
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-02-27 23:57:24 [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-02-22 23:41:52 [Relaychain] Skipping chain: unable to gather info about imported block (0x95fa5e63d56839098dfcf3531f5481396915d7a25850f234890243fa3eacea2a, 5863250): block is from an ancient session
2023-02-22 23:41:52 [Relaychain] error=Runtime(RuntimeRequest(Execution { runtime_api_name: "on_chain_votes", source: Application(Execution(Other("Exported method ParachainHost_on_chain_votes is not found"))) }))
2023-02-22 23:41:52 [Relaychain] Received msg before first active leaves update. This is not expected - message will be dropped. msg=ActiveDisputes(Sender { complete: false })
2023-02-22 23:41:52 [Relaychain] error=Sender(AskActiveDisputesCanceled) ctx="on_message"
2023-02-22 23:41:52 [Relaychain] Failed to get session info. relay_parent=0x7d83a73a65cc7ffcad97251320db60230def1e3aa8aa756f4aae40b94a24cf40

This node would eventually stall when it's trying to finalize the parachain blocks and would not print any logs. Which could be a different issue. I'm going to report the last log lines here anyway just in case it's useful:

2023-02-24 03:01:48.684 TRACE main state-db: [Parachain] Importing 2127997, base=2127742                                                                                                                                                                                                                             
2023-02-24 03:01:48.689 DEBUG main db: [Parachain] Opened blockchain db, fetched best = 0xb774c19b2e7d35c1fac372d7cc4046f0320dd9b97727a7df3cadad109d707e85 (2132093)                                                                                                                                                 
2023-02-24 03:01:48.689 DEBUG main db: [Parachain] Opened blockchain db, fetched final = 0x97da7ede98d7bad4e36b4d734b6055425a3be036da2a332ea5a7037656427a21 (0)                                                                                                                                                      
2023-02-24 03:01:48.689 DEBUG main db: [Parachain] Opened blockchain db, fetched final_state = 0x97da7ede98d7bad4e36b4d734b6055425a3be036da2a332ea5a7037656427a21 (0)                                                                                                                                                
2023-02-24 03:01:48.689 DEBUG main db: [Parachain] block_gap=None                                                                                                                                                                                                                                                    
2023-02-24 03:01:48.703 DEBUG main parachain: [Parachain] Restoring chain level monitor from last finalized block: 0 0x97da…7a21

As a last bit of info, all the other parachain nodes were running a previous image built on 0.9.29.

@eskimor
Copy link
Member

eskimor commented Mar 1, 2023

Thanks @aliXsed ! So the node got stuck (how long?), but a restart fixed it - correct?

@aliXsed
Copy link

aliXsed commented Mar 1, 2023

Thanks @aliXsed ! So the node got stuck (how long?), but a restart fixed it - correct?

@eskimor I have mentioned two issues above with two separate logs. The first one was fixed every time we restarted the node. Then later when the node was about to finalize the parachain blocks the second one happened which was persistent even over restarts and the node would stall there for hours. We just had to terminate the node.
With the same image which as I said is built on 0.9.37, no issues were observed for our test net on Rococo where all the nodes are uniformly upgraded to this image. The difference with our main net is first the main net is on Polkadot. Second, we haven't upgraded all the clients with the same image and some of them running older clients.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants