Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: fix platform shutdown deadlock #56827

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Qard
Copy link
Member

@Qard Qard commented Jan 30, 2025

Fixes #54918

Each worker is signalling its own completion of tasks independently and so should only be signalling for one corresponding drain otherwise the count of outstanding tasks goes out of sync and the process will never stop waiting for tasks when it should be exiting.

It just needs to be calling Signal rather than Broadcast.

Not sure if there was a reason for it to be a broadcast in the first place, but if so then the outstanding_tasks_ count adjustment needs to factor that in properly.

Each worker is signalling its own completion of tasks independently
and so should only be signalling for one corresponding drain otherwise
the count of outstanding tasks goes out of sync and the process will
never stop waiting for tasks when it should be exiting.

It just needs to be calling Signal rather than Broadcast.
@Qard Qard added confirmed-bug Issues with confirmed bugs. c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. v8 platform Issues and PRs related to Node's v8::Platform implementation. labels Jan 30, 2025
@nodejs-github-bot nodejs-github-bot added the needs-ci PRs that need a full CI run. label Jan 30, 2025
@Qard Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Jan 30, 2025
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jan 30, 2025
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.20%. Comparing base (304bb9c) to head (d674a68).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #56827   +/-   ##
=======================================
  Coverage   89.20%   89.20%           
=======================================
  Files         663      663           
  Lines      192012   192012           
  Branches    36929    36933    +4     
=======================================
+ Hits       171286   171293    +7     
- Misses      13582    13590    +8     
+ Partials     7144     7129   -15     
Files with missing lines Coverage Δ
src/node_platform.cc 87.83% <100.00%> (ø)

... and 23 files with indirect coverage changes

@targos
Copy link
Member

targos commented Jan 30, 2025

This seems to bring new flakyness to --loader tests, which involve a worker.

@Qard
Copy link
Member Author

Qard commented Jan 30, 2025

I think it may also have not actually fixed the issue, just made it more rare. I've been doing a very long run locally here. Before this I encountered the issue at 9.9k iterations, with this it eventually popped up at 117k iterations.

@lpinca
Copy link
Member

lpinca commented Jan 30, 2025

Yes, unfortunately it does not seem to fix the issue:

$ python3 tools/test.py --repeat=1000 test/parallel/test-net-write-fully-async-hex-string.js
[00:35|%  98|+ 984|-   0]: release test-net-write-fully-async-hex-string
$ lldb -p 8578
(lldb) process attach --pid 8578
Process 8578 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff8111b1aaa libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`__psynch_cvwait:
->  0x7ff8111b1aaa <+10>: jae    0x7ff8111b1ab4 ; <+20>
    0x7ff8111b1aac <+12>: movq   %rax, %rdi
    0x7ff8111b1aaf <+15>: jmp    0x7ff8111af662 ; cerror_nocancel
    0x7ff8111b1ab4 <+20>: retq   
Target 0: (node) stopped.
Executable module set to "/Users/luigi/code/node/out/Release/node".
Architecture set to: x86_64-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff8111b1aaa libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff8111f07a8 libsystem_pthread.dylib`_pthread_cond_wait + 1193
    frame #2: 0x0000000104ffb893 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001041148bb node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x00007fdaa3204e78, mutex=0x00007fdaa3204e08) at node_mutex.h:175:5 [opt]
    frame #4: 0x00000001041148a1 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x00007fdaa3204e78, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x00000001041148a1 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x00007fdaa3204e08) at node_platform.cc:640:20 [opt]
    frame #6: 0x000000010411488b node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x00007fdaa3204e08) at node_platform.cc:214:25 [opt]
    frame #7: 0x000000010411488b node`node::NodePlatform::DrainTasks(this=0x0000600002740000, isolate=<unavailable>) at node_platform.cc:467:33 [opt]
    frame #8: 0x0000000103f7ca60 node`node::SpinEventLoopInternal(env=0x00007fdaa5049c00) at embed_helpers.cc:44:17 [opt]
    frame #9: 0x00000001040d9b33 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x00007fdaa5049c00) at node_main_instance.cc:111:9 [opt]
    frame #10: 0x00000001040d9a99 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
    frame #11: 0x00000001040414e6 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1488:24 [opt]
    frame #12: 0x0000000104041342 node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1495:27 [opt]
    frame #13: 0x00007ff810e612cd dyld`start + 1805
(lldb)

@Qard
Copy link
Member Author

Qard commented Jan 31, 2025

A bt all seems to point to a deadlock in the communication through WorkerThreadsTaskRunner::DelayedTaskScheduler to the TaskQueue worker threads. I see NodePlatform stuck on DrainTasks() while all the TaskQueue workers are stuck on BlockingPop(). Probably something to do with the timing around the ConditionVariable synchronization and how that interacts with the stopped_ state of the TaskQueue. 🤔

(lldb) bt all
  thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x000000010063d664 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x0000000143706ff8, mutex=0x0000000143706f88) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010063d658 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x0000000143706ff8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010063d658 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x0000000143706f88) at node_platform.cc:639:20 [opt]
    frame #6: 0x000000010063d640 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x0000000143706f88) at node_platform.cc:211:25 [opt]
    frame #7: 0x000000010063d640 node`node::NodePlatform::DrainTasks(this=0x0000600002f0c0a0, isolate=<unavailable>) at node_platform.cc:466:33 [opt]
    frame #8: 0x00000001004cca50 node`node::SpinEventLoopInternal(env=0x0000000133810200) at embed_helpers.cc:44:17 [opt]
    frame #9: 0x000000010060a0a8 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x0000000133810200) at node_main_instance.cc:111:9 [opt]
    frame #10: 0x000000010060a014 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
    frame #11: 0x000000010057fc98 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1488:24 [opt]
    frame #12: 0x000000010057fc0c node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1495:27 [opt]
    frame #13: 0x0000000199c94274 dyld`start + 2840
* thread #2
    frame #0: 0x0000000199fdbefc libsystem_kernel.dylib`kevent + 8
  * frame #1: 0x00000001013eb980 node`uv__io_poll(loop=0x000000014401dce8, timeout=-1) at kqueue.c:286:12 [opt]
    frame #2: 0x00000001013d83ec node`uv_run(loop=0x000000014401dce8, mode=UV_RUN_DEFAULT) at core.c:460:5 [opt]
    frame #3: 0x000000010063ddf4 node`node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Run(this=0x000000014401dc00) at node_platform.cc:95:5 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x000000010063dfdc node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::LibuvMutexTraits::cond_wait(cond=0x0000000143706fc8, mutex=0x0000000143706f88) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x0000000143706fc8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop(this=0x0000000143706f88) at node_platform.cc:617:22 [opt]
    frame #6: 0x000000010063ba90 node`node::(anonymous namespace)::PlatformWorkerThread(data=0x0000600001007c90) at node_platform.cc:42:61 [opt]
    frame #7: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x000000010063dfdc node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::LibuvMutexTraits::cond_wait(cond=0x0000000143706fc8, mutex=0x0000000143706f88) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x0000000143706fc8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop(this=0x0000000143706f88) at node_platform.cc:617:22 [opt]
    frame #6: 0x000000010063ba90 node`node::(anonymous namespace)::PlatformWorkerThread(data=0x0000600001007cc0) at node_platform.cc:42:61 [opt]
    frame #7: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x000000010063dfdc node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::LibuvMutexTraits::cond_wait(cond=0x0000000143706fc8, mutex=0x0000000143706f88) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x0000000143706fc8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop(this=0x0000000143706f88) at node_platform.cc:617:22 [opt]
    frame #6: 0x000000010063ba90 node`node::(anonymous namespace)::PlatformWorkerThread(data=0x0000600001007cf0) at node_platform.cc:42:61 [opt]
    frame #7: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x000000010099e35c node`void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(heap::base::Stack*, void*, void const*) [inlined] v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0::operator()(this=<unavailable>) const at collection-barrier.cc:125:18 [opt]
    frame #3: 0x000000010099e330 node`void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(heap::base::Stack*, void*, void const*) [inlined] void v8::internal::LocalHeap::ParkAndExecuteCallback<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(this=0x000000017216e3a8, callback=(unnamed class) @ 0x0000600001428df0) at local-heap-inl.h:66:5 [opt]
    frame #4: 0x000000010099e308 node`void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(heap::base::Stack*, void*, void const*) [inlined] void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(this=<unavailable>)::$_0)::'lambda'()::operator()() const at local-heap-inl.h:83:28 [opt]
    frame #5: 0x000000010099e2fc node`void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(stack=0x0000000145012420, argument=<unavailable>, stack_end=<unavailable>) at stack.h:204:5 [opt]
    frame #6: 0x00000001013a9eb8 node`PushAllRegistersAndIterateStack + 40
    frame #7: 0x000000010099de20 node`v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*) [inlined] void v8::internal::LocalHeap::ExecuteWithStackMarker<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(this=0x000000017216e3a8, callback=<unavailable>) at local-heap-inl.h:0 [opt]
    frame #8: 0x000000010099dddc node`v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*) [inlined] void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(this=0x000000017216e3a8, callback=(unnamed class) @ 0x0000600001428e20) at local-heap-inl.h:82:3 [opt]
    frame #9: 0x000000010099dddc node`v8::internal::CollectionBarrier::AwaitCollectionBackground(this=0x0000600002f00140, local_heap=0x000000017216e3a8) at collection-barrier.cc:117:15 [opt]
    frame #10: 0x00000001009f2f70 node`v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [inlined] v8::internal::HeapAllocator::CollectGarbage(this=0x000000017216e3f8, allocation=kOld) at heap-allocator.cc:140:12 [opt]
    frame #11: 0x00000001009f2f58 node`v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(this=0x000000017216e3f8, size=16, allocation=kOld, origin=kRuntime, alignment=kTaggedAligned) at heap-allocator.cc:120:5 [opt]
    frame #12: 0x00000001009f3ac4 node`v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(this=0x000000017216e3f8, size=16, allocation=kOld, origin=kRuntime, alignment=kTaggedAligned) at heap-allocator.cc:148:7 [opt]
    frame #13: 0x00000001009c8170 node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(v8::internal::NewCodeOptions const&) [inlined] v8::internal::FactoryBase<v8::internal::LocalFactory>::AllocateRaw(this=0x000000017216e3a0, size=<unavailable>, allocation=kOld, alignment=kTaggedAligned) at factory-base.cc:1282:18 [opt]
    frame #14: 0x00000001009c8164 node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(v8::internal::NewCodeOptions const&) [inlined] v8::internal::FactoryBase<v8::internal::LocalFactory>::AllocateRawWithImmortalMap(this=0x000000017216e3a0, size=<unavailable>, allocation=kOld, map=Tagged<v8::internal::Map> @ x21, alignment=kTaggedAligned) at factory-base.cc:1273:31 [opt]
    frame #15: 0x00000001009c8164 node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(v8::internal::NewCodeOptions const&) [inlined] v8::internal::FactoryBase<v8::internal::LocalFactory>::NewWithImmortalMap(this=0x000000017216e3a0, map=Tagged<v8::internal::Map> @ x21, allocation=kOld) at factory-base.cc:1262:10 [opt]
    frame #16: 0x00000001009c815c node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(v8::internal::NewCodeOptions const&) [inlined] v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCodeWrapper(this=0x000000017216e3a0) at factory-base.cc:160:25 [opt]
    frame #17: 0x00000001009c8154 node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(this=0x000000017216e3a0, options=0x000000017216d918) at factory-base.cc:82:39 [opt]
    frame #18: 0x00000001009d4b30 node`v8::internal::TorqueGeneratedFactory<v8::internal::Factory>::NewUncompiledDataWithPreparseData(v8::internal::Handle<v8::internal::String>, int, int, v8::internal::Handle<v8::internal::PreparseData>, v8::internal::AllocationType) [inlined] v8::internal::CombinedWriteBarrier(host=<unavailable>, slot=<unavailable>, value=<unavailable>, mode=<unavailable>) - 18446744069404275919 [opt]
    frame #19: 0x0000000100f23a5c node`v8::internal::maglev::MaglevCodeGenerator::BuildCodeObject(this=<unavailable>, local_isolate=<unavailable>) at maglev-code-generator.cc:1919:8 [opt]
    frame #20: 0x0000000100f21f84 node`v8::internal::maglev::MaglevCodeGenerator::Assemble(this=0x0000000143604700) at maglev-code-generator.cc:1710:9 [opt]
    frame #21: 0x0000000100f4d1c8 node`v8::internal::maglev::MaglevCompiler::Compile(local_isolate=0x000000017216e3a0, compilation_info=0x0000600002c005a0) at maglev-compiler.cc:208:36 [opt]
    frame #22: 0x0000000100fb1774 node`v8::internal::maglev::MaglevCompilationJob::ExecuteJobImpl(this=0x0000600002c00630, stats=<unavailable>, local_isolate=0x000000017216e3a0) at maglev-concurrent-dispatcher.cc:137:8 [opt]
    frame #23: 0x0000000100878c1c node`v8::internal::OptimizedCompilationJob::ExecuteJob(this=0x0000600002c00630, stats=<unavailable>, local_isolate=<unavailable>) at compiler.cc:485:22 [opt]
    frame #24: 0x0000000100fb2d14 node`v8::internal::maglev::MaglevConcurrentDispatcher::JobTask::Run(this=0x0000600001e2c260, delegate=0x000000017216ef08) at maglev-concurrent-dispatcher.cc:283:18 [opt]
    frame #25: 0x000000010181dcb4 node`v8::platform::DefaultJobWorker::Run(this=0x0000600001e2e920) at default-job.h:147:18 [opt]
    frame #26: 0x000000010063baa4 node`node::(anonymous namespace)::PlatformWorkerThread(data=0x0000600001007d20) at node_platform.cc:43:11 [opt]
    frame #27: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x0000000199fd5db0 libsystem_kernel.dylib`semaphore_wait_trap + 8
    frame #1: 0x00000001013e68dc node`uv_sem_wait(sem=0x000000010561fc7c) at thread.c:492:9 [opt]
    frame #2: 0x00000001006f2930 node`node::inspector::(anonymous namespace)::StartIoThreadMain(unused=<unavailable>) at inspector_agent.cc:90:5 [opt]
    frame #3: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x0000000199fd7a84 libsystem_kernel.dylib`__workq_kernreturn + 8
  thread #9, name = 'libuv-worker'
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001013d45f0 node`worker(arg=0x0000000000000000) at threadpool.c:77:7 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #10, name = 'libuv-worker'
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001013d45f0 node`worker(arg=0x0000000000000000) at threadpool.c:77:7 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #11, name = 'libuv-worker'
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001013d45f0 node`worker(arg=0x0000000000000000) at threadpool.c:77:7 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #12, name = 'libuv-worker'
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001013d45f0 node`worker(arg=0x0000000000000000) at threadpool.c:77:7 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136

@Qard
Copy link
Member Author

Qard commented Jan 31, 2025

I suspect it's a spurious wakeup that's not handled safely and so the outstanding_tasks_ count goes out of sync.

@santigimeno
Copy link
Member

By looking at the traces, it seems to me that the problem is that one of the tasks in the Platform workers never ends thus DrainTasks never ends. In the traces from #56827 (comment), the Thread #6 is stuck in the cond_wait in heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl(), maybe a deadlock in v8?

@lpinca
Copy link
Member

lpinca commented Jan 31, 2025

According to the V8 team, the issue is in Node.js, see https://issues.chromium.org/issues/374285493.

@santigimeno
Copy link
Member

According to the V8 team, the issue is in Node.js, see https://issues.chromium.org/issues/374285493.

Thanks for the reference. Yes, I think the solution is not calling DrainTasks() from the main thread, as proposed in the first commit from #47452. But that was breaking some other tests, as mentioned in the issue 🤷‍♂️

@Qard
Copy link
Member Author

Qard commented Feb 1, 2025

By looking at the traces, it seems to me that the problem is that one of the tasks in the Platform workers never ends thus DrainTasks never ends.

That doesn't seem to be the case though as all the workers and the dispatcher are stuck in uv_cond_wait at the same time, indicating none of them are working, they're all waiting for each other.

I think the issue might be with how TaskQueue is used all over the place in a mix of synchronized and not synchronized ways. Some places are using BlockingPop which maintains the outstanding_tasks_ count while others use Pop or PopAll which seems to not maintain that count. The routing through DelayedTaskScheduler is a bit complicated and might be doing something odd there, but I wasn't able to dig enough into that in the time I had the other day.

@santigimeno
Copy link
Member

santigimeno commented Feb 1, 2025

That doesn't seem to be the case though as all the workers and the dispatcher are stuck in uv_cond_wait at the same time, indicating none of them are working, they're all waiting for each other.

Aren't they waiting in different cond variables though. Maybe I'm wrong, but they way I see it is:

  • There are 3 idle workers waiting on new tasks to be pushed:
    tasks_available_.Wait(scoped_lock);
    . They wait on the tasks_available_ cond variable.
  • The worker on Thread #6 which is performing a Maglev task and from my reading of https://issues.chromium.org/issues/374285493#comment6 is waiting on a cond variable for GC to perform something in the main thread.
  • The main thread which is waiting for the tasks to drain waiting on the tasks_drained_ cond variable here and also blocking Thread #6 as it's not allowing for the GC to perform.

So, DrainTasks() never exits because the Maglev task never ends because the main thread never idles because it's waiting on DrainTasks() to end. Hopefully I'm making some sense here. BTW, if the DrainTasks() call is removed as in 1bc4970, the error never happens (the Maglev task is able to end).

@Qard
Copy link
Member Author

Qard commented Feb 2, 2025

The main thread and the workers are all in uv_cond_wait at the same time though. If the V8 worker enters uv_cond_wait while main is already waiting then it will just continue to wait until main is ready before it can unlock and apply its own operation. Meanwhile though, the workers are all waiting as they seem to think there are zero tasks to process. If main was in uv_cond_wait at the time then it must have believed otherwise.

For it to have reached DrainTasks means the V8 tasks have already been allocated and shipped to the worker threads.

It may be possible for the main thread and GC thread to mutually lock each other, but the workers would still pick up the values from BlockingPop, continue operating, and then signal the drain condvar. It doesn't make sense that they would get stuck in uv_cond_wait too unless the counts went out of sync and the main thread was waiting on a task that had not been delivered.

@santigimeno
Copy link
Member

The main thread and the workers are all in uv_cond_wait at the same time though. If the V8 worker enters uv_cond_wait while main is already waiting then it will just continue to wait until main is ready before it can unlock and apply its own operation. Meanwhile though, the workers are all waiting as they seem to think there are zero tasks to process. If main was in uv_cond_wait at the time then it must have believed otherwise.

For it to have reached DrainTasks means the V8 tasks have already been allocated and shipped to the worker threads.

I mostly agree. Indeed all tasks have been sent and picked up by the workers, but there's only one left to complete as explained above (outstanding_tasks_ === 1 at this point). The other workers don't have anything else to process and would remain so until all tasks are drained, but in this case, that'll never happen due to the deadlock. I'm not sure, being the main thread stuck, how new tasks would be posted to the thread pool.

@Qard
Copy link
Member Author

Qard commented Feb 3, 2025

But all workers are in uv_cond_wait here. Thread 1 is main, Thread 2 is the DelayedTaskScheduler worker for scheduling tasks into the workers after a delay, which would be entering uv_async_t triggers from looping in uv_run if there were any work to be processed there, and then threads 3 through 5 are the workers, which are all locked at the same time.

While outstanding_tasks_ being 1 should indicate that a task is being processed, that seems to not actually be true as the main thread halts in DrainTasks which means it could not have any tasks which have not yet been pushed to the queue already and had their signal sent. At the same time all worker threads are also in uv_cond_wait which means none have any tasks to operate on. If the main thread gets locked by the GC thread during DrainTasks it may prevent it from reacting to any drain signals, but it won't stop the workers from receiving the already sent signal, picking up all the tasks in the queue with BlockingPop and then calling NotifyOfCompletion for each, which would reduce the outstanding_tasks_ counter for each task and attempt to send the drain signal, if the count reaches zero. The worker does not wait for a signal between tasks if there are remaining tasks already in the queue, so it must process all events in the queue before it returns to waiting.

It would be the case that after processing everything in the queue the worker threads would then all be in uv_cond_wait due to having nothing to operate on, but in that case outstanding_tasks_ would have to be zero, which it is not. It doesn't mechanically make sense for the workers to all be in uv_cond_wait when outstanding_tasks_ is non-zero unless the count got out of sync somehow.

@targos
Copy link
Member

targos commented Feb 4, 2025

It seems that #56842 has more test timeouts than usual. Maybe it will be easier for you to debug/reproduce with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. confirmed-bug Issues with confirmed bugs. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. v8 platform Issues and PRs related to Node's v8::Platform implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock at process shutdown
6 participants