-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: fix platform shutdown deadlock #56827
base: main
Are you sure you want to change the base?
Conversation
Each worker is signalling its own completion of tasks independently and so should only be signalling for one corresponding drain otherwise the count of outstanding tasks goes out of sync and the process will never stop waiting for tasks when it should be exiting. It just needs to be calling Signal rather than Broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #56827 +/- ##
=======================================
Coverage 89.20% 89.20%
=======================================
Files 663 663
Lines 192012 192012
Branches 36929 36933 +4
=======================================
+ Hits 171286 171293 +7
- Misses 13582 13590 +8
+ Partials 7144 7129 -15
|
This seems to bring new flakyness to |
I think it may also have not actually fixed the issue, just made it more rare. I've been doing a very long run locally here. Before this I encountered the issue at 9.9k iterations, with this it eventually popped up at 117k iterations. |
Yes, unfortunately it does not seem to fix the issue:
|
A
|
I suspect it's a spurious wakeup that's not handled safely and so the |
By looking at the traces, it seems to me that the problem is that one of the tasks in the Platform workers never ends thus DrainTasks never ends. In the traces from #56827 (comment), the |
According to the V8 team, the issue is in Node.js, see https://issues.chromium.org/issues/374285493. |
Thanks for the reference. Yes, I think the solution is not calling |
That doesn't seem to be the case though as all the workers and the dispatcher are stuck in uv_cond_wait at the same time, indicating none of them are working, they're all waiting for each other. I think the issue might be with how |
Aren't they waiting in different cond variables though. Maybe I'm wrong, but they way I see it is:
So, |
The main thread and the workers are all in uv_cond_wait at the same time though. If the V8 worker enters uv_cond_wait while main is already waiting then it will just continue to wait until main is ready before it can unlock and apply its own operation. Meanwhile though, the workers are all waiting as they seem to think there are zero tasks to process. If main was in uv_cond_wait at the time then it must have believed otherwise. For it to have reached It may be possible for the main thread and GC thread to mutually lock each other, but the workers would still pick up the values from BlockingPop, continue operating, and then signal the drain condvar. It doesn't make sense that they would get stuck in uv_cond_wait too unless the counts went out of sync and the main thread was waiting on a task that had not been delivered. |
I mostly agree. Indeed all tasks have been sent and picked up by the workers, but there's only one left to complete as explained above ( |
But all workers are in uv_cond_wait here. Thread 1 is main, Thread 2 is the While It would be the case that after processing everything in the queue the worker threads would then all be in uv_cond_wait due to having nothing to operate on, but in that case |
It seems that #56842 has more test timeouts than usual. Maybe it will be easier for you to debug/reproduce with it? |
Fixes #54918
Each worker is signalling its own completion of tasks independently and so should only be signalling for one corresponding drain otherwise the count of outstanding tasks goes out of sync and the process will never stop waiting for tasks when it should be exiting.
It just needs to be calling Signal rather than Broadcast.
Not sure if there was a reason for it to be a broadcast in the first place, but if so then the
outstanding_tasks_
count adjustment needs to factor that in properly.