Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow subsystems to finish work before shutdown: availability recovery and approvals #985

Open
rphmeier opened this issue Nov 28, 2020 · 5 comments
Labels
I4-refactor Code needs refactoring.

Comments

@rphmeier
Copy link
Contributor

rphmeier commented Nov 28, 2020

The approval voting protocol treats validators who broadcast their assignment but not their corresponding approval as "no-shows". In all likelihood, nodes that shut down instantaneously will leave some dangling assignments and thus will accidentally slow down finality. We should alter the handling of the Conclude message in AvailabilityRecovery and ApprovalVoting to finish all in-progress recoveries and approval checks before shutting down the subsystem.

I think this should come with a refactoring of our entire shutdown process for the overseer. The flow should go something like this instead:

  • Overseer broadcasts a signal BeginConclude(response) to each subsystem.
  • It waits T time (~seconds) to receive responses from each subsystem.
  • Once all responses are received or a timeout is reached, broadcast a signal Conclude to each subsystem and then wait for them to shut down.

Subsystems should handle Conclude as an immediate shutdown signal. BeginConclude should instruct the subsystem to prepare to shut down. ApprovalVoting would not broadcast any new assignments after receiving this signal. Backing should not kick off any seconding or validation work after receiving this signal. etc.

@bkchr
Copy link
Member

bkchr commented Nov 28, 2020

I think currently the Conclude signal will not be send when the node is shutting down. Looking at the code it requires manual calling of the code that sends the signal, which is currently not done.

To fix this issue properly, it probably requires some changes to Substrate or at least some better integration into the shutdown process of the node. The node is shut down when some essential task has shutdown or it receives a signal to shutdown. This is directly provided by substrate. When shutting down it will tell the event loop to shutdown, meaning the overseer would need to check for when it is dropped.

@burdges
Copy link

burdges commented Nov 28, 2020

This speeds up finality by like 24 seconds ocasionally, but it's not a priority for testnets or even MVP. Actually testnet might do better seeing the bad behavior. And MVP is whatever you guys think is best.

@rphmeier
Copy link
Contributor Author

In the case of PVF execution, we've also encountered the AmbiguousWorkerDeath error during shutdown: the child processes executing PVFs are killed on ctrl-c, and this can lead to spurious disputes. So two more suggested changes:

  • The candidate-validation subsystem should not begin executing any more PVFs during conclusion (and should inform the caller that the execution request was denied due to shutting down)
  • The disputes subsystems should not initiate or participate in disputes while shutting down, and instead do the work after starting up again.

@eskimor
Copy link
Member

eskimor commented Oct 30, 2022

Ok the problem I am seeing here is two-fold:

  1. What Basti said, this would require substrate changes as substrate is already handling shutdown signals and just stops executing.
  2. Even if we did that, that waiting for a certain amount of time until force quit is inherently racy with what operating systems are doing - they do exactly the same. Even if we had a perfectly clean shutdown procedure in the code, this would not prevent the operating system to kill us before it completes. With child processes it becomes even worse, as the sequence and shutdown strategy is not even well defined. E.g. the operating system might already have killed our child processes, before we managed to complete the "clean shutdown"/deliver the Conclude message to the relevant subsystems.

In any case, the biggest blocker is the handling in substrate actually. I tried sending the Conclude signal on SIGTERM and similar messages, but it did not get through as other shutdown sequences are faster.

For the time being, for the issue of ambiguous worker death: I am leaning towards the retry approach as this makes disputes due to restarts extremely unlikely already - especially if we don't persist the knowledge that we tried once already, because then a restart would mean another two tries (if we are even still caring about the candidate). In addition we can handle the SIGTERM and similar signals in candidate-validation directly and ignore any ambiguous worker death if we received that signal before, maybe even with a little buffering: If we receive ambiguous worker death, we only report it if we don't receive SIGTERM within 100ms or something like that. This is all very hacky though, the simple retry (with delay) approach should probably be good enough for now, until we get to implement the proper shutdown sequence on top.

@rphmeier
Copy link
Contributor Author

rphmeier commented Nov 2, 2022

Is it actually the case that operating systems typically shut down processes within a few seconds of SIGTERM? I'm no expert here, but my experience is that hung processes due to stuck SIGTERM handlers is quite common.

We could and probably should also insert a signal handler into the child processes to communicate back a result that SIGTERM killed the process, and the candidate validation subsystem can handle this accordingly.

Yes, the retry approach is probably necessary in any case, but it seems like we should have both.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added I4-refactor Code needs refactoring. and removed I8-refactor labels Aug 25, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
* Fix overflow in reading gas_price

* Saturating conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I4-refactor Code needs refactoring.
Projects
None yet
Development

No branches or pull requests

5 participants