Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

pvf: fix the deadlock when any of the channels get at capacity #5297

Merged
merged 1 commit into from
Apr 9, 2022

Conversation

pepyakin
Copy link
Contributor

@pepyakin pepyakin commented Apr 8, 2022

The PVF host is designed to avoid spawning tasks to minimize knowledge
of outer code. Using async_std::task::spawn (or Tokio's counterpart)
deemed unacceptable, SpawnNamed undesirable. Instead there is only one
task returned that is spawned by the candidate-validation subsystem.
The tasks from the sub-components are polled by that root task.

However, the way the tasks are bundled was incorrect. There was a giant
select that was polling those tasks. Particularly, that implies that as soon as
one of the arms of that select goes into await those sub-tasks stop
getting polled. This is a recipe for a deadlock which indeed happened
here.

Specifically, the deadlock happened during sending messages to the
execute queue by calling
send_execute.
When the channel to the queue reaches the capacity, the control flow is
suspended until the queue handles those messages. Since this code is
essentially reached from one of the select
arms
,
the queue won't be given the control and thus no further progress can be
made.

This problem is solved by bundling the tasks one level higher instead,
by selecting over those long-running tasks.

We also stop treating returning from those long-running tasks as error
conditions, since that can happen during legit shutdown.

The PVF host is designed to avoid spawning tasks to minimize knowledge
of outer code. Using `async_std::task::spawn` (or Tokio's counterpart)
deemed unacceptable, `SpawnNamed` undesirable. Instead there is only one
task returned that is spawned by the candidate-validation subsystem.
The tasks from the sub-components are polled by that root task.

However, the way the tasks are bundled was incorrect. There was a giant
select that was polling those tasks. Particularly, that implies that as soon as
one of the arms of that select goes into await those sub-tasks stop
getting polled. This is a recipe for a deadlock which indeed happened
here.

Specifically, the deadlock happened during sending messages to the
execute queue by calling
[`send_execute`](https://github.com/paritytech/polkadot/blob/818da57386900e9b3eb934eaf868161ed3eb6d70/node/core/pvf/src/host.rs#L601).
When the channel to the queue reaches the capacity, the control flow is
suspended until the queue handles those messages. Since this code is
essentially reached from [one of the select
arms](https://github.com/paritytech/polkadot/blob/818da57386900e9b3eb934eaf868161ed3eb6d70/node/core/pvf/src/host.rs#L371),
the queue won't be given the control and thus no further progress can be
made.

This problem is solved by bundling the tasks one level higher instead,
by `selecting` over those long-running tasks.

We also stop treating returning from those long-running tasks as error
conditions, since that can happen during legit shutdown.
@pepyakin pepyakin added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. labels Apr 8, 2022
@pepyakin pepyakin requested a review from vstakhov April 8, 2022 18:06
@rphmeier
Copy link
Contributor

rphmeier commented Apr 9, 2022

@vstakhov if you find it correct, please merge

@vstakhov
Copy link
Contributor

vstakhov commented Apr 9, 2022

bot merge

@paritytech-processbot paritytech-processbot bot merged commit dc2b2ac into master Apr 9, 2022
@paritytech-processbot paritytech-processbot bot deleted the 04-08-pep-fix-pvf-deadlock branch April 9, 2022 10:57
Comment on lines +225 to +229
_ = run_host.fuse() => {},
_ = run_prepare_queue.fuse() => {},
_ = run_prepare_pool.fuse() => {},
_ = run_execute_queue.fuse() => {},
_ = run_sweeper.fuse() => {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should maintain logs here which one terminated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which severity you reckon?

Copy link
Contributor

@drahnr drahnr Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error! I'd say, this should really never happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I removed never!s is that those paths can theoretically be hit during normal shutdown.

ordian added a commit that referenced this pull request Apr 13, 2022
* master: (62 commits)
  Bump tracing from 0.1.32 to 0.1.33 (#5299)
  Add staging runtime api (#5048)
  CI: rename ambiguous jobs (#5313)
  Prepare for rust 1.60 (#5282)
  Bump proc-macro2 from 1.0.36 to 1.0.37 (#5265)
  Fixes the dead lock when any of the channels get at capacity. (#5297)
  Bump syn from 1.0.90 to 1.0.91 (#5273)
  create .github/workflows/pr-custom-review.yml (#5280)
  [ci] fix pallet benchmark script (#5247) (#5292)
  bump spec_version to 9190 (#5291)
  bump version to 0.9.19 (#5290)
  Session is actually ancient not unknown. (#5286)
  [ci] point benchmark script to sub-commands (#5247) (#5289)
  Remove Travis CI (#5287)
  Improve error handling in approval voting block import (#5283)
  fix .github/pr-custom-review.yml (#5279)
  Co #11164: Sub-commands for `benchmark` (#5247)
  Bump clap from 3.1.6 to 3.1.8 (#5240)
  Custom Slots Migration for Fund Index Change (#5227)
  create pr-custom-review.yml (#5253)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants