-
Notifications
You must be signed in to change notification settings - Fork 1.6k
pvf: fix the deadlock when any of the channels get at capacity #5297
Conversation
The PVF host is designed to avoid spawning tasks to minimize knowledge of outer code. Using `async_std::task::spawn` (or Tokio's counterpart) deemed unacceptable, `SpawnNamed` undesirable. Instead there is only one task returned that is spawned by the candidate-validation subsystem. The tasks from the sub-components are polled by that root task. However, the way the tasks are bundled was incorrect. There was a giant select that was polling those tasks. Particularly, that implies that as soon as one of the arms of that select goes into await those sub-tasks stop getting polled. This is a recipe for a deadlock which indeed happened here. Specifically, the deadlock happened during sending messages to the execute queue by calling [`send_execute`](https://github.com/paritytech/polkadot/blob/818da57386900e9b3eb934eaf868161ed3eb6d70/node/core/pvf/src/host.rs#L601). When the channel to the queue reaches the capacity, the control flow is suspended until the queue handles those messages. Since this code is essentially reached from [one of the select arms](https://github.com/paritytech/polkadot/blob/818da57386900e9b3eb934eaf868161ed3eb6d70/node/core/pvf/src/host.rs#L371), the queue won't be given the control and thus no further progress can be made. This problem is solved by bundling the tasks one level higher instead, by `selecting` over those long-running tasks. We also stop treating returning from those long-running tasks as error conditions, since that can happen during legit shutdown.
@vstakhov if you find it correct, please merge |
bot merge |
_ = run_host.fuse() => {}, | ||
_ = run_prepare_queue.fuse() => {}, | ||
_ = run_prepare_pool.fuse() => {}, | ||
_ = run_execute_queue.fuse() => {}, | ||
_ = run_sweeper.fuse() => {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We should maintain logs here which one terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which severity you reckon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error!
I'd say, this should really never happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I removed never!
s is that those paths can theoretically be hit during normal shutdown.
* master: (62 commits) Bump tracing from 0.1.32 to 0.1.33 (#5299) Add staging runtime api (#5048) CI: rename ambiguous jobs (#5313) Prepare for rust 1.60 (#5282) Bump proc-macro2 from 1.0.36 to 1.0.37 (#5265) Fixes the dead lock when any of the channels get at capacity. (#5297) Bump syn from 1.0.90 to 1.0.91 (#5273) create .github/workflows/pr-custom-review.yml (#5280) [ci] fix pallet benchmark script (#5247) (#5292) bump spec_version to 9190 (#5291) bump version to 0.9.19 (#5290) Session is actually ancient not unknown. (#5286) [ci] point benchmark script to sub-commands (#5247) (#5289) Remove Travis CI (#5287) Improve error handling in approval voting block import (#5283) fix .github/pr-custom-review.yml (#5279) Co #11164: Sub-commands for `benchmark` (#5247) Bump clap from 3.1.6 to 3.1.8 (#5240) Custom Slots Migration for Fund Index Change (#5227) create pr-custom-review.yml (#5253) ...
The PVF host is designed to avoid spawning tasks to minimize knowledge
of outer code. Using
async_std::task::spawn
(or Tokio's counterpart)deemed unacceptable,
SpawnNamed
undesirable. Instead there is only onetask returned that is spawned by the candidate-validation subsystem.
The tasks from the sub-components are polled by that root task.
However, the way the tasks are bundled was incorrect. There was a giant
select that was polling those tasks. Particularly, that implies that as soon as
one of the arms of that select goes into await those sub-tasks stop
getting polled. This is a recipe for a deadlock which indeed happened
here.
Specifically, the deadlock happened during sending messages to the
execute queue by calling
send_execute
.When the channel to the queue reaches the capacity, the control flow is
suspended until the queue handles those messages. Since this code is
essentially reached from one of the select
arms,
the queue won't be given the control and thus no further progress can be
made.
This problem is solved by bundling the tasks one level higher instead,
by
selecting
over those long-running tasks.We also stop treating returning from those long-running tasks as error
conditions, since that can happen during legit shutdown.