pvf: fix the deadlock when any of the channels get at capacity #5297

pepyakin · 2022-04-08T18:06:15Z

The PVF host is designed to avoid spawning tasks to minimize knowledge
of outer code. Using async_std::task::spawn (or Tokio's counterpart)
deemed unacceptable, SpawnNamed undesirable. Instead there is only one
task returned that is spawned by the candidate-validation subsystem.
The tasks from the sub-components are polled by that root task.

However, the way the tasks are bundled was incorrect. There was a giant
select that was polling those tasks. Particularly, that implies that as soon as
one of the arms of that select goes into await those sub-tasks stop
getting polled. This is a recipe for a deadlock which indeed happened
here.

Specifically, the deadlock happened during sending messages to the
execute queue by calling
send_execute.
When the channel to the queue reaches the capacity, the control flow is
suspended until the queue handles those messages. Since this code is
essentially reached from one of the select
arms,
the queue won't be given the control and thus no further progress can be
made.

This problem is solved by bundling the tasks one level higher instead,
by selecting over those long-running tasks.

We also stop treating returning from those long-running tasks as error
conditions, since that can happen during legit shutdown.

The PVF host is designed to avoid spawning tasks to minimize knowledge of outer code. Using `async_std::task::spawn` (or Tokio's counterpart) deemed unacceptable, `SpawnNamed` undesirable. Instead there is only one task returned that is spawned by the candidate-validation subsystem. The tasks from the sub-components are polled by that root task. However, the way the tasks are bundled was incorrect. There was a giant select that was polling those tasks. Particularly, that implies that as soon as one of the arms of that select goes into await those sub-tasks stop getting polled. This is a recipe for a deadlock which indeed happened here. Specifically, the deadlock happened during sending messages to the execute queue by calling [`send_execute`](https://github.com/paritytech/polkadot/blob/818da57386900e9b3eb934eaf868161ed3eb6d70/node/core/pvf/src/host.rs#L601). When the channel to the queue reaches the capacity, the control flow is suspended until the queue handles those messages. Since this code is essentially reached from [one of the select arms](https://github.com/paritytech/polkadot/blob/818da57386900e9b3eb934eaf868161ed3eb6d70/node/core/pvf/src/host.rs#L371), the queue won't be given the control and thus no further progress can be made. This problem is solved by bundling the tasks one level higher instead, by `selecting` over those long-running tasks. We also stop treating returning from those long-running tasks as error conditions, since that can happen during legit shutdown.

rphmeier · 2022-04-09T08:10:44Z

@vstakhov if you find it correct, please merge

vstakhov · 2022-04-09T10:57:28Z

bot merge

drahnr · 2022-04-10T07:11:27Z

node/core/pvf/src/host.rs

+			_ = run_host.fuse() => {},
+			_ = run_prepare_queue.fuse() => {},
+			_ = run_prepare_pool.fuse() => {},
+			_ = run_execute_queue.fuse() => {},
+			_ = run_sweeper.fuse() => {},


nit: We should maintain logs here which one terminated.

Which severity you reckon?

error! I'd say, this should really never happen

The reason why I removed never!s is that those paths can theoretically be hit during normal shutdown.

* master: (62 commits) Bump tracing from 0.1.32 to 0.1.33 (#5299) Add staging runtime api (#5048) CI: rename ambiguous jobs (#5313) Prepare for rust 1.60 (#5282) Bump proc-macro2 from 1.0.36 to 1.0.37 (#5265) Fixes the dead lock when any of the channels get at capacity. (#5297) Bump syn from 1.0.90 to 1.0.91 (#5273) create .github/workflows/pr-custom-review.yml (#5280) [ci] fix pallet benchmark script (#5247) (#5292) bump spec_version to 9190 (#5291) bump version to 0.9.19 (#5290) Session is actually ancient not unknown. (#5286) [ci] point benchmark script to sub-commands (#5247) (#5289) Remove Travis CI (#5287) Improve error handling in approval voting block import (#5283) fix .github/pr-custom-review.yml (#5279) Co #11164: Sub-commands for `benchmark` (#5247) Bump clap from 3.1.6 to 3.1.8 (#5240) Custom Slots Migration for Fund Index Change (#5227) create pr-custom-review.yml (#5253) ...

pepyakin requested a review from vstakhov April 8, 2022 18:06

bkchr approved these changes Apr 9, 2022

View reviewed changes

rphmeier approved these changes Apr 9, 2022

View reviewed changes

vstakhov approved these changes Apr 9, 2022

View reviewed changes

paritytech-processbot bot merged commit dc2b2ac into master Apr 9, 2022

paritytech-processbot bot deleted the 04-08-pep-fix-pvf-deadlock branch April 9, 2022 10:57

drahnr reviewed Apr 10, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pvf: fix the deadlock when any of the channels get at capacity #5297

pvf: fix the deadlock when any of the channels get at capacity #5297

pepyakin commented Apr 8, 2022

rphmeier commented Apr 9, 2022

vstakhov commented Apr 9, 2022

drahnr Apr 10, 2022

pepyakin Apr 10, 2022

drahnr Apr 11, 2022 •

edited

Loading

pepyakin Apr 11, 2022

pvf: fix the deadlock when any of the channels get at capacity #5297

pvf: fix the deadlock when any of the channels get at capacity #5297

Conversation

pepyakin commented Apr 8, 2022

rphmeier commented Apr 9, 2022

vstakhov commented Apr 9, 2022

drahnr Apr 10, 2022

Choose a reason for hiding this comment

pepyakin Apr 10, 2022

Choose a reason for hiding this comment

drahnr Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

pepyakin Apr 11, 2022

Choose a reason for hiding this comment

drahnr Apr 11, 2022 •

edited

Loading