Skip to content

src: only block on user blocking worker tasks #58047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

joyeecheung
Copy link
Member

@joyeecheung joyeecheung commented Apr 27, 2025

test: remove deadlock workaround

src: add more debug logs and comments in NodePlatform

src: use priority queue to run worker tasks

According to the documentation, the v8 tasks should be executed
based on priority. Previously we always execute the tasks in
FIFO order, this changes the NodePlatform implementation to
execute the higher priority tasks first. The tasks used to
schedule timers for the delayed tasks are run in FIFO order
since priority is irrelavent for the timer scheduling part
while the tasks unwrapped by the timer callbacks are still
ordered by priority.

src: only block on user blocking worker tasks

we should not be blocking on the worker tasks on the
main thread in one go. Doing so leads to two problems:

  1. If any of the worker tasks post another foreground task and wait
    for it to complete, and that foreground task is posted right after
    we flush the foreground task queue and before the foreground thread
    goes into sleep, we'll never be able to wake up to execute that
    foreground task and in turn the worker task will never complete, and
    we have a deadlock.
  2. Worker tasks can be posted from any thread, not necessarily
    associated with the current isolate, and we can be blocking on a
    worker task that is associated with a completely unrelated isolate
    in the event loop. This is suboptimal.

However, not blocking on the worker tasks at all can lead to loss of
some critical user-blocking worker tasks e.g. wasm async compilation
tasks, which should block the main thread until they are completed,
as the documentation suggets. As a compromise, we currently only block
on user-blocking tasks to reduce the chance of deadlocks while making
sure that criticl user-blocking tasks are not lost.

Refs: #47452
Refs: #54918

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Apr 27, 2025
@joyeecheung
Copy link
Member Author

joyeecheung commented Apr 27, 2025

This seems to fix #54918 - I started two stress test jobs before opening this PR:

And the CI seems to be happy about it: https://ci.nodejs.org/job/node-test-commit/79439/

I am being somewhat lazy here to achieve the "only block on user blocking worker tasks" but only counting them as outstanding tasks. Need another look to make sure I am not throwing other tasks into the queue in the delayed scheduler. Also I am just speculating that worker tasks that would post another foreground task shouldn't be kUserBlocking, or at least I haven't found any so far (the ones that are causing the deadlocks that I can reproduce are just kUserVisible GC tasks). May need to confirm this.

@nodejs-github-bot
Copy link
Collaborator

@lpinca
Copy link
Member

lpinca commented Apr 27, 2025

I confirm that this patch fixes the issue on my machine.

Copy link
Member

@Qard Qard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, Joyee! So glad to see this one finally resolved. 👏🏻

@joyeecheung
Copy link
Member Author

joyeecheung commented Apr 30, 2025

I tested this with #58070 which is also seeing a deadlock in the heapdump tests. The good news is that this is suffice to fix the deadlock by default. The bad news is that --concurrent-maglev-high-priority-threads can enqueue a background task that is kUserBlocking and block on foreground tasks, which can trigger the deadlock again (although, that is an opt-in flag, and likely only intended for testing, but that also means this kind of tasks are not off the table).

I think at the end of the day, the ideal solution should still avoid doing a blocking drain of any worker tasks on the main thread (but this requires some other tinkering to make sure that user blocking tasks block the event loop from exiting to prevent important tasks from being dropped), or re-architect the blocking mechanism to allow foreground task posted by worker tasks to wake the main thread up and run foreground tasks (maybe it's simple enough to implement, maybe it needs extra locks which can also incur other problems, I don't know for sure yet). But this fix might still be good enough for now since user blocking background tasks are rare, or at least it does make things better than how things currently are.

@joyeecheung
Copy link
Member Author

joyeecheung commented May 1, 2025

Alternative fix which seems to also fix the user-blocking background tasks locally when built together with #58070: https://github.com/joyeecheung/node/tree/fix-deadlock-2

Starting a stress test here: https://ci.nodejs.org/job/node-stress-single-test/569
CI with #58070: https://ci.nodejs.org/job/node-test-commit/79519/

@targos
Copy link
Member

targos commented May 2, 2025

It seems to introduce timeouts in other tests.

@joyeecheung joyeecheung added the request-ci Add this label to start a Jenkins CI on a PR. label May 2, 2025
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label May 2, 2025
@nodejs-github-bot
Copy link
Collaborator

@joyeecheung
Copy link
Member Author

joyeecheung commented May 2, 2025

Trying to see if it still works after the V8 upgrade. On the other hand it seems I can no longer reproduce the deadlock on macOS using test-file-write-stream4.js. I do notice that the test now run much much faster than before (before the V8 upgrade it takes 52s to finish 1000 runs, now it takes 9s). I'll see if I can reproduce with the heapdump tests locally. EDIT: the heapdump test can reproduce the deadlock locally and has almost the same stacktrace as the one produced by test-file-write-stream4.js

@BridgeAR
Copy link
Member

BridgeAR commented May 3, 2025

@joyeecheung due to the CI constantly failing due to the heapdumps now dying, do you think this could already land in it's current form knowing that it does not yet fully address the problem?

@joyeecheung
Copy link
Member Author

Yeah I think this at least serves as a band-aid to the problem and can avoid a significant subset of the deadlocks for now. I am going to clean up the patch a bit before sending it for review.

According to the documentation, the v8 tasks should be executed
based on priority. Previously we always execute the tasks in
FIFO order, this changes the NodePlatform implementation to
execute the higher priority tasks first. The tasks used to
schedule timers for the delayed tasks are run in FIFO order
since priority is irrelavent for the timer scheduling part
while the tasks unwrapped by the timer callbacks are still
ordered by priority.
we should not be blocking on the worker tasks on the
main thread in one go. Doing so leads to two problems:

1. If any of the worker tasks post another foreground task and wait
   for it to complete, and that foreground task is posted right after
   we flush the foreground task queue and before the foreground thread
   goes into sleep, we'll never be able to wake up to execute that
   foreground task and in turn the worker task will never complete, and
   we have a deadlock.
2. Worker tasks can be posted from any thread, not necessarily
   associated with the current isolate, and we can be blocking on a
   worker task that is associated with a completely unrelated isolate
   in the event loop. This is suboptimal.

However, not blocking on the worker tasks at all can lead to loss of
some critical user-blocking worker tasks e.g. wasm async compilation
tasks, which should block the main thread until they are completed,
as the documentation suggets. As a compromise, we currently only block
on user-blocking tasks to reduce the chance of deadlocks while making
sure that criticl user-blocking tasks are not lost.
@joyeecheung joyeecheung marked this pull request as ready for review May 4, 2025 01:21
@joyeecheung
Copy link
Member Author

joyeecheung commented May 4, 2025

Cleaned up the commits a bit and added more comments. Split into multiple commits - the last one would make the deadlocks less common. I kept the previous ones in case the last one caused more problems than it solves and we need to revert it.

@joyeecheung joyeecheung added the request-ci Add this label to start a Jenkins CI on a PR. label May 4, 2025
@joyeecheung joyeecheung changed the title [WIP] src: only block on user blocking worker tasks src: only block on user blocking worker tasks May 4, 2025
@github-actions github-actions bot added the request-ci-failed An error occurred while starting CI via request-ci label, and manual interventon is needed. label May 4, 2025
Copy link

codecov bot commented May 4, 2025

Codecov Report

Attention: Patch coverage is 51.89873% with 76 lines in your changes missing coverage. Please review.

Project coverage is 90.13%. Comparing base (723d7bb) to head (3fff8ae).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
src/node_platform.cc 49.66% 67 Missing and 9 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #58047      +/-   ##
==========================================
- Coverage   90.17%   90.13%   -0.04%     
==========================================
  Files         630      630              
  Lines      186473   186581     +108     
  Branches    36613    36625      +12     
==========================================
+ Hits       168160   168184      +24     
- Misses      11128    11196      +68     
- Partials     7185     7201      +16     
Files with missing lines Coverage Δ
src/debug_utils.h 80.00% <ø> (ø)
src/node_platform.h 83.33% <100.00%> (+23.33%) ⬆️
src/node_platform.cc 76.24% <49.66%> (-12.36%) ⬇️

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nodejs-github-bot
Copy link
Collaborator

@joyeecheung
Copy link
Member Author

CI seems happy. Can I have some review please? @Qard @nodejs/cpp-reviewers

@joyeecheung joyeecheung added commit-queue Add this label to land a pull request using GitHub Actions. commit-queue-rebase Add this label to allow the Commit Queue to land a PR in several commits. and removed request-ci-failed An error occurred while starting CI via request-ci label, and manual interventon is needed. labels May 4, 2025
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label May 4, 2025
@nodejs-github-bot
Copy link
Collaborator

Landed in 6de55f7...5fb879c

nodejs-github-bot pushed a commit that referenced this pull request May 4, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
nodejs-github-bot pushed a commit that referenced this pull request May 4, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
nodejs-github-bot pushed a commit that referenced this pull request May 4, 2025
According to the documentation, the v8 tasks should be executed
based on priority. Previously we always execute the tasks in
FIFO order, this changes the NodePlatform implementation to
execute the higher priority tasks first. The tasks used to
schedule timers for the delayed tasks are run in FIFO order
since priority is irrelavent for the timer scheduling part
while the tasks unwrapped by the timer callbacks are still
ordered by priority.

PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
nodejs-github-bot pushed a commit that referenced this pull request May 4, 2025
we should not be blocking on the worker tasks on the
main thread in one go. Doing so leads to two problems:

1. If any of the worker tasks post another foreground task and wait
   for it to complete, and that foreground task is posted right after
   we flush the foreground task queue and before the foreground thread
   goes into sleep, we'll never be able to wake up to execute that
   foreground task and in turn the worker task will never complete, and
   we have a deadlock.
2. Worker tasks can be posted from any thread, not necessarily
   associated with the current isolate, and we can be blocking on a
   worker task that is associated with a completely unrelated isolate
   in the event loop. This is suboptimal.

However, not blocking on the worker tasks at all can lead to loss of
some critical user-blocking worker tasks e.g. wasm async compilation
tasks, which should block the main thread until they are completed,
as the documentation suggets. As a compromise, we currently only block
on user-blocking tasks to reduce the chance of deadlocks while making
sure that criticl user-blocking tasks are not lost.

PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
@joyeecheung
Copy link
Member Author

Hmm realized that the commit queue starts counting from the date the PR is opened, not the date the review is open. Anyway landing it now to make the CI greener - if there are any issues we can follow up on it.

template <class T>
class TaskQueue {
public:
// If the entry type has a priority memeber, order the priority queue by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If the entry type has a priority memeber, order the priority queue by
// If the entry type has a priority member, order the priority queue by

tasks_to_run.pop();
// This runs either the ScheduleTasks that scheduels the timers to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This runs either the ScheduleTasks that scheduels the timers to
// This runs either the ScheduleTasks that schedules the timers to

tasks_to_run.pop();
// This runs either the ScheduleTasks that scheduels the timers to
// pop the tasks back into the worker task runner queue, or the
// or the StopTasks to stop the timers and drop all the pending tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// or the StopTasks to stop the timers and drop all the pending tasks.
// StopTasks to stop the timers and drop all the pending tasks.

// We have to use const_cast because std::priority_queue::top() does not
// return a movable item.
std::unique_ptr<Task> task =
std::move(const_cast<std::unique_ptr<Task>&>(tasks_to_run.top()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it save to modify an entry in the priority_queue by moving the content?
I guess it is as long as top() is directly followed by pop() and no push() is inbetween.
Maybe move to a helper as the same comment is here multiple times.

// which should block the main thread until they are completed, as the
// documentation suggets. As a compromise, we currently only block on
// user-blocking tasks to reduce the chance of deadlocks while making sure
// that criticl user-blocking tasks are not lost.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// that criticl user-blocking tasks are not lost.
// that critical user-blocking tasks are not lost.

}

void NodePlatform::PostDelayedTaskOnWorkerThreadImpl(
v8::TaskPriority priority,
std::unique_ptr<v8::Task> task,
double delay_in_seconds,
const v8::SourceLocation& location) {
worker_thread_task_runner_->PostDelayedTask(std::move(task),
delay_in_seconds);
if (debug_log_level_ != PlatformDebugLogLevel::kNone) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this debug logs could be moved into a single helper function.

@Flarna
Copy link
Member

Flarna commented May 5, 2025

@joyeecheung sorry, seems my review was too late, was interrupted a few times. no major findings anyway.

RafaelGSS pushed a commit that referenced this pull request May 5, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
RafaelGSS pushed a commit that referenced this pull request May 5, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
RafaelGSS pushed a commit that referenced this pull request May 5, 2025
According to the documentation, the v8 tasks should be executed
based on priority. Previously we always execute the tasks in
FIFO order, this changes the NodePlatform implementation to
execute the higher priority tasks first. The tasks used to
schedule timers for the delayed tasks are run in FIFO order
since priority is irrelavent for the timer scheduling part
while the tasks unwrapped by the timer callbacks are still
ordered by priority.

PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
RafaelGSS pushed a commit that referenced this pull request May 5, 2025
we should not be blocking on the worker tasks on the
main thread in one go. Doing so leads to two problems:

1. If any of the worker tasks post another foreground task and wait
   for it to complete, and that foreground task is posted right after
   we flush the foreground task queue and before the foreground thread
   goes into sleep, we'll never be able to wake up to execute that
   foreground task and in turn the worker task will never complete, and
   we have a deadlock.
2. Worker tasks can be posted from any thread, not necessarily
   associated with the current isolate, and we can be blocking on a
   worker task that is associated with a completely unrelated isolate
   in the event loop. This is suboptimal.

However, not blocking on the worker tasks at all can lead to loss of
some critical user-blocking worker tasks e.g. wasm async compilation
tasks, which should block the main thread until they are completed,
as the documentation suggets. As a compromise, we currently only block
on user-blocking tasks to reduce the chance of deadlocks while making
sure that criticl user-blocking tasks are not lost.

PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
aduh95 pushed a commit that referenced this pull request May 6, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
aduh95 pushed a commit that referenced this pull request May 6, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
aduh95 pushed a commit that referenced this pull request May 6, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
aduh95 pushed a commit that referenced this pull request May 6, 2025
PR-URL: #58047
Refs: #47452
Refs: #54918
Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. commit-queue-rebase Add this label to allow the Commit Queue to land a PR in several commits. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants