Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not run unneeded subsystems on collator and its alongside node #3061

Merged
merged 10 commits into from
Jan 29, 2024

Conversation

s0me0ne-unkn0wn
Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn commented Jan 25, 2024

Currently, collators and their alongside nodes spin up a full-scale overseer running a bunch of subsystems that are not needed if the node is not a validator. That was considered to be harmless; however, we've got problems with unused subsystems getting stalled for a reason not currently known, resulting in the overseer exiting and bringing down the whole node.

This PR aims to only run needed subsystems on such nodes, replacing the rest with DummySubsystem.

It also enables collator-optimized availability recovery subsystem implementation.

Partially solves #1730.

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/5031914

@s0me0ne-unkn0wn s0me0ne-unkn0wn marked this pull request as ready for review January 25, 2024 19:24
@s0me0ne-unkn0wn s0me0ne-unkn0wn self-assigned this Jan 25, 2024
@s0me0ne-unkn0wn s0me0ne-unkn0wn added the T0-node This PR/Issue is related to the topic “node”. label Jan 25, 2024
Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like this! At the cost of some minor code duplication, we get rid of some hacky conditional logic and gain more flexibility in instantiating different type of overseer for collators.

Noticed a couple of things that can be cleaned up further, but other than that LGTM.

polkadot/node/service/src/overseer.rs Outdated Show resolved Hide resolved
polkadot/node/service/src/overseer.rs Outdated Show resolved Hide resolved
@skunert
Copy link
Contributor

skunert commented Jan 26, 2024

Nice! Do you see anything speaking against reusing this also for the minimal collator node? In the end we have the nearly identical overseer here again:

@alexggh alexggh self-requested a review January 26, 2024 09:29
@s0me0ne-unkn0wn
Copy link
Contributor Author

Nice! Do you see anything speaking against reusing this also for the minimal collator node?

Yeah, that's a good point, and that's what I tried first. It seemed to be quite logical but came out as usual. The current "full scale" overseer is implemented through the OverseerBuilder which is generic over RuntimeClient, and over DefaultSubsystemClient<RuntimeClient> as well. They are two different levels of API implementation and work fine like that. But in the minimal node, the overseer is generic over BlockChainRpcClient, which provides both levels. So to generalize that code, we need to seriously rework BlockChainRpcClient (make it implement ProvideRuntimeApi for parachain host API, and thus we'd be able to stick it into DefaultSubsystemClient). After evaluating the amount of work to achieve that, I decided to postpone it for now. Might be a follow-up, though.

@ordian
Copy link
Member

ordian commented Jan 26, 2024

looks like cumulus pov recovery test is failing, @skunert do you know why wouldn't it work with minimal overseer?

@s0me0ne-unkn0wn
Copy link
Contributor Author

@ordian looks like I forgot to include something required 🤔

2024-01-27 00:00:56.387 DEBUG tokio-runtime-worker parachain::availability-recovery: [pretty-bee-8350] Requesting availability chunks for a candidate common_params.candidate_hash=0xd9a47aaa51662f5b56f371f2d1616950d44bd8ec76d45e8e2b31c648ebd04ea8 desired_requests_count=5 error_count=136315 total_received=136315 threshold=5 already_requesting_count=0
2024-01-27 00:00:56.387 TRACE tokio-runtime-worker parachain::availability-recovery: [pretty-bee-8350] Failure requesting chunk candidate_hash=0xd9a47aaa51662f5b56f371f2d1616950d44bd8ec76d45e8e2b31c648ebd04ea8 err=NetworkError(UnknownProtocol) validator_index=ValidatorIndex(7) traceID=289296499718077548945207771063688194384

@s0me0ne-unkn0wn
Copy link
Contributor Author

Fixed it

Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

@bkchr bkchr requested a review from skunert January 27, 2024 22:08
Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point, and that's what I tried first. It seemed to be quite logical but came out as usual. The current "full scale" overseer is implemented through the OverseerBuilder which is generic over RuntimeClient, and over DefaultSubsystemClient as well. They are two different levels of API implementation and work fine like that.

Hmm I have not taken a deep dive on that code, but on first sight it looks like DefaultSubsystemClient only implements the single trait RuntimeApiSubsystemClient. So why does the overseer building codepath even need to know about it. Can we not modify that code to just take some type that implements RuntimeApiSubsystemClient (and iirc AuxStore and ChainApiBackend). This is satisfied by both BlockChainRpcClient and DefaultSubsystemClient.

But anyway, I don't insist that this needs to be investigated right now. Code looks good 👍

@s0me0ne-unkn0wn
Copy link
Contributor Author

@skunert makes sense! I'll try to follow up on that.

@s0me0ne-unkn0wn s0me0ne-unkn0wn added this pull request to the merge queue Jan 29, 2024
Merged via the queue into master with commit 3e8139e Jan 29, 2024
126 of 127 checks passed
@s0me0ne-unkn0wn s0me0ne-unkn0wn deleted the s0me0ne/1730 branch January 29, 2024 10:33
librelois added a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Feb 19, 2024
github-merge-queue bot pushed a commit that referenced this pull request Feb 28, 2024
resolve #3116

a follow-up on
#3061 (review):

- [x] reuse collator overseer builder for polkadot-node and collator
- [x] run zombienet test (0001-parachains-smoke-test.toml)
- [x] make wasm build errors more user-friendly for an easier problem
detection when using different toolchains in Rust

---------

Co-authored-by: ordian <write@reusable.software>
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
resolve paritytech#3116

a follow-up on
paritytech#3061 (review):

- [x] reuse collator overseer builder for polkadot-node and collator
- [x] run zombienet test (0001-parachains-smoke-test.toml)
- [x] make wasm build errors more user-friendly for an easier problem
detection when using different toolchains in Rust

---------

Co-authored-by: ordian <write@reusable.software>
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants