-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
src: only block on user blocking worker tasks #58047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This seems to fix #54918 - I started two stress test jobs before opening this PR:
And the CI seems to be happy about it: https://ci.nodejs.org/job/node-test-commit/79439/ I am being somewhat lazy here to achieve the "only block on user blocking worker tasks" but only counting them as outstanding tasks. Need another look to make sure I am not throwing other tasks into the queue in the delayed scheduler. Also I am just speculating that worker tasks that would post another foreground task shouldn't be kUserBlocking, or at least I haven't found any so far (the ones that are causing the deadlocks that I can reproduce are just kUserVisible GC tasks). May need to confirm this. |
I confirm that this patch fixes the issue on my machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work, Joyee! So glad to see this one finally resolved. 👏🏻
I tested this with #58070 which is also seeing a deadlock in the heapdump tests. The good news is that this is suffice to fix the deadlock by default. The bad news is that I think at the end of the day, the ideal solution should still avoid doing a blocking drain of any worker tasks on the main thread (but this requires some other tinkering to make sure that user blocking tasks block the event loop from exiting to prevent important tasks from being dropped), or re-architect the blocking mechanism to allow foreground task posted by worker tasks to wake the main thread up and run foreground tasks (maybe it's simple enough to implement, maybe it needs extra locks which can also incur other problems, I don't know for sure yet). But this fix might still be good enough for now since user blocking background tasks are rare, or at least it does make things better than how things currently are. |
Alternative fix which seems to also fix the user-blocking background tasks locally when built together with #58070: https://github.com/joyeecheung/node/tree/fix-deadlock-2 Starting a stress test here: https://ci.nodejs.org/job/node-stress-single-test/569 |
It seems to introduce timeouts in other tests. |
Trying to see if it still works after the V8 upgrade. On the other hand it seems I can no longer reproduce the deadlock on macOS using |
@joyeecheung due to the CI constantly failing due to the heapdumps now dying, do you think this could already land in it's current form knowing that it does not yet fully address the problem? |
Yeah I think this at least serves as a band-aid to the problem and can avoid a significant subset of the deadlocks for now. I am going to clean up the patch a bit before sending it for review. |
According to the documentation, the v8 tasks should be executed based on priority. Previously we always execute the tasks in FIFO order, this changes the NodePlatform implementation to execute the higher priority tasks first. The tasks used to schedule timers for the delayed tasks are run in FIFO order since priority is irrelavent for the timer scheduling part while the tasks unwrapped by the timer callbacks are still ordered by priority.
we should not be blocking on the worker tasks on the main thread in one go. Doing so leads to two problems: 1. If any of the worker tasks post another foreground task and wait for it to complete, and that foreground task is posted right after we flush the foreground task queue and before the foreground thread goes into sleep, we'll never be able to wake up to execute that foreground task and in turn the worker task will never complete, and we have a deadlock. 2. Worker tasks can be posted from any thread, not necessarily associated with the current isolate, and we can be blocking on a worker task that is associated with a completely unrelated isolate in the event loop. This is suboptimal. However, not blocking on the worker tasks at all can lead to loss of some critical user-blocking worker tasks e.g. wasm async compilation tasks, which should block the main thread until they are completed, as the documentation suggets. As a compromise, we currently only block on user-blocking tasks to reduce the chance of deadlocks while making sure that criticl user-blocking tasks are not lost.
60268bd
to
3fff8ae
Compare
Cleaned up the commits a bit and added more comments. Split into multiple commits - the last one would make the deadlocks less common. I kept the previous ones in case the last one caused more problems than it solves and we need to revert it. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #58047 +/- ##
==========================================
- Coverage 90.17% 90.13% -0.04%
==========================================
Files 630 630
Lines 186473 186581 +108
Branches 36613 36625 +12
==========================================
+ Hits 168160 168184 +24
- Misses 11128 11196 +68
- Partials 7185 7201 +16
🚀 New features to boost your workflow:
|
CI seems happy. Can I have some review please? @Qard @nodejs/cpp-reviewers |
Landed in 6de55f7...5fb879c |
According to the documentation, the v8 tasks should be executed based on priority. Previously we always execute the tasks in FIFO order, this changes the NodePlatform implementation to execute the higher priority tasks first. The tasks used to schedule timers for the delayed tasks are run in FIFO order since priority is irrelavent for the timer scheduling part while the tasks unwrapped by the timer callbacks are still ordered by priority. PR-URL: #58047 Refs: #47452 Refs: #54918 Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
we should not be blocking on the worker tasks on the main thread in one go. Doing so leads to two problems: 1. If any of the worker tasks post another foreground task and wait for it to complete, and that foreground task is posted right after we flush the foreground task queue and before the foreground thread goes into sleep, we'll never be able to wake up to execute that foreground task and in turn the worker task will never complete, and we have a deadlock. 2. Worker tasks can be posted from any thread, not necessarily associated with the current isolate, and we can be blocking on a worker task that is associated with a completely unrelated isolate in the event loop. This is suboptimal. However, not blocking on the worker tasks at all can lead to loss of some critical user-blocking worker tasks e.g. wasm async compilation tasks, which should block the main thread until they are completed, as the documentation suggets. As a compromise, we currently only block on user-blocking tasks to reduce the chance of deadlocks while making sure that criticl user-blocking tasks are not lost. PR-URL: #58047 Refs: #47452 Refs: #54918 Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
Hmm realized that the commit queue starts counting from the date the PR is opened, not the date the review is open. Anyway landing it now to make the CI greener - if there are any issues we can follow up on it. |
template <class T> | ||
class TaskQueue { | ||
public: | ||
// If the entry type has a priority memeber, order the priority queue by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If the entry type has a priority memeber, order the priority queue by | |
// If the entry type has a priority member, order the priority queue by |
tasks_to_run.pop(); | ||
// This runs either the ScheduleTasks that scheduels the timers to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This runs either the ScheduleTasks that scheduels the timers to | |
// This runs either the ScheduleTasks that schedules the timers to |
tasks_to_run.pop(); | ||
// This runs either the ScheduleTasks that scheduels the timers to | ||
// pop the tasks back into the worker task runner queue, or the | ||
// or the StopTasks to stop the timers and drop all the pending tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// or the StopTasks to stop the timers and drop all the pending tasks. | |
// StopTasks to stop the timers and drop all the pending tasks. |
// We have to use const_cast because std::priority_queue::top() does not | ||
// return a movable item. | ||
std::unique_ptr<Task> task = | ||
std::move(const_cast<std::unique_ptr<Task>&>(tasks_to_run.top())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it save to modify an entry in the priority_queue
by moving the content?
I guess it is as long as top()
is directly followed by pop()
and no push()
is inbetween.
Maybe move to a helper as the same comment is here multiple times.
// which should block the main thread until they are completed, as the | ||
// documentation suggets. As a compromise, we currently only block on | ||
// user-blocking tasks to reduce the chance of deadlocks while making sure | ||
// that criticl user-blocking tasks are not lost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// that criticl user-blocking tasks are not lost. | |
// that critical user-blocking tasks are not lost. |
} | ||
|
||
void NodePlatform::PostDelayedTaskOnWorkerThreadImpl( | ||
v8::TaskPriority priority, | ||
std::unique_ptr<v8::Task> task, | ||
double delay_in_seconds, | ||
const v8::SourceLocation& location) { | ||
worker_thread_task_runner_->PostDelayedTask(std::move(task), | ||
delay_in_seconds); | ||
if (debug_log_level_ != PlatformDebugLogLevel::kNone) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all this debug logs could be moved into a single helper function.
@joyeecheung sorry, seems my review was too late, was interrupted a few times. no major findings anyway. |
According to the documentation, the v8 tasks should be executed based on priority. Previously we always execute the tasks in FIFO order, this changes the NodePlatform implementation to execute the higher priority tasks first. The tasks used to schedule timers for the delayed tasks are run in FIFO order since priority is irrelavent for the timer scheduling part while the tasks unwrapped by the timer callbacks are still ordered by priority. PR-URL: #58047 Refs: #47452 Refs: #54918 Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
we should not be blocking on the worker tasks on the main thread in one go. Doing so leads to two problems: 1. If any of the worker tasks post another foreground task and wait for it to complete, and that foreground task is posted right after we flush the foreground task queue and before the foreground thread goes into sleep, we'll never be able to wake up to execute that foreground task and in turn the worker task will never complete, and we have a deadlock. 2. Worker tasks can be posted from any thread, not necessarily associated with the current isolate, and we can be blocking on a worker task that is associated with a completely unrelated isolate in the event loop. This is suboptimal. However, not blocking on the worker tasks at all can lead to loss of some critical user-blocking worker tasks e.g. wasm async compilation tasks, which should block the main thread until they are completed, as the documentation suggets. As a compromise, we currently only block on user-blocking tasks to reduce the chance of deadlocks while making sure that criticl user-blocking tasks are not lost. PR-URL: #58047 Refs: #47452 Refs: #54918 Reviewed-By: Stephen Belanger <admin@stephenbelanger.com>
test: remove deadlock workaround
src: add more debug logs and comments in NodePlatform
src: use priority queue to run worker tasks
According to the documentation, the v8 tasks should be executed
based on priority. Previously we always execute the tasks in
FIFO order, this changes the NodePlatform implementation to
execute the higher priority tasks first. The tasks used to
schedule timers for the delayed tasks are run in FIFO order
since priority is irrelavent for the timer scheduling part
while the tasks unwrapped by the timer callbacks are still
ordered by priority.
src: only block on user blocking worker tasks
we should not be blocking on the worker tasks on the
main thread in one go. Doing so leads to two problems:
for it to complete, and that foreground task is posted right after
we flush the foreground task queue and before the foreground thread
goes into sleep, we'll never be able to wake up to execute that
foreground task and in turn the worker task will never complete, and
we have a deadlock.
associated with the current isolate, and we can be blocking on a
worker task that is associated with a completely unrelated isolate
in the event loop. This is suboptimal.
However, not blocking on the worker tasks at all can lead to loss of
some critical user-blocking worker tasks e.g. wasm async compilation
tasks, which should block the main thread until they are completed,
as the documentation suggets. As a compromise, we currently only block
on user-blocking tasks to reduce the chance of deadlocks while making
sure that criticl user-blocking tasks are not lost.
Refs: #47452
Refs: #54918