Description
Bug report
Running pool_in_threads.py
in a loop will occasionally lead to deadlock even with #118745 applied.
Here's a summary of the state I observed:
Thread 28
Thread 28 holds the GIL and is blocked on the QSBR shared mutex:
#6 0x000055989dd7d8a6 in _PySemaphore_PlatformWait (sema=sema@entry=0x7f8ba4ff8c60, timeout=timeout@entry=-1) at Python/parking_lot.c:142
#7 0x000055989dd7d9dc in _PySemaphore_Wait (sema=sema@entry=0x7f8ba4ff8c60, timeout=timeout@entry=-1, detach=detach@entry=1) at Python/parking_lot.c:213
#8 0x000055989dd7db65 in _PyParkingLot_Park (addr=addr@entry=0x55989e0db1c8 <_PyRuntime+136840>, expected=expected@entry=0x7f8ba4ff8cf7, size=size@entry=1, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x7f8ba4ff8d00, detach=detach@entry=1) at Python/parking_lot.c:316
#9 0x000055989dd7494d in _PyMutex_LockTimed (m=m@entry=0x55989e0db1c8 <_PyRuntime+136840>, timeout=timeout@entry=-1, flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:112
#10 0x000055989dd74a4e in _PyMutex_LockSlow (m=m@entry=0x55989e0db1c8 <_PyRuntime+136840>) at Python/lock.c:53
#11 0x000055989dd9c8e1 in PyMutex_Lock (m=0x55989e0db1c8 <_PyRuntime+136840>) at ./Include/internal/pycore_lock.h:75
#12 _Py_qsbr_unregister (tstate=tstate@entry=0x7f8bec00bad0) at Python/qsbr.c:239
#13 0x000055989dd92142 in tstate_delete_common (tstate=tstate@entry=0x7f8bec00bad0) at Python/pystate.c:1797
#14 0x000055989dd92831 in _PyThreadState_DeleteCurrent (tstate=tstate@entry=0x7f8bec00bad0) at Python/pystate.c:1845
It's blocked trying to lock the QSBR shared mutex:
Line 194 in 9fa206a
Normally, this would release the GIL while blocking, but we clear the thread state before calling tstate_delete_common
:
Lines 1844 to 1845 in 9fa206a
We only detach (and release the GIL) if we both have a thread state and it's currently attached:
Lines 204 to 207 in 9fa206a
Note that the PyThreadState in this case is actually attached, just not visible from _PyThreadState_GET()
.
Thread 4
Thread 4 holds the shared mutex and is blocked trying to acquire the GIL:
#6 take_gil (tstate=tstate@entry=0x55989eeb6e00) at Python/ceval_gil.c:331
#7 0x000055989dd4c93c in _PyEval_AcquireLock (tstate=tstate@entry=0x55989eeb6e00) at Python/ceval_gil.c:585
#8 0x000055989dd92ca3 in _PyThreadState_Attach (tstate=tstate@entry=0x55989eeb6e00) at Python/pystate.c:2071
#9 0x000055989dd4c9bb in PyEval_AcquireThread (tstate=tstate@entry=0x55989eeb6e00) at Python/ceval_gil.c:602
#10 0x000055989dd7d9eb in _PySemaphore_Wait (sema=sema@entry=0x7f8bf3ffe240, timeout=timeout@entry=-1, detach=detach@entry=1) at Python/parking_lot.c:215
#11 0x000055989dd7db65 in _PyParkingLot_Park (addr=addr@entry=0x55989e0c0108 <_PyRuntime+26056>, expected=expected@entry=0x7f8bf3ffe2c8, size=size@entry=8, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x0, detach=detach@entry=1) at Python/parking_lot.c:316
#12 0x000055989dd747ce in rwmutex_set_parked_and_wait (rwmutex=rwmutex@entry=0x55989e0c0108 <_PyRuntime+26056>, bits=bits@entry=1) at Python/lock.c:386
#13 0x000055989dd74e38 in _PyRWMutex_RLock (rwmutex=0x55989e0c0108 <_PyRuntime+26056>) at Python/lock.c:404
#14 0x000055989dd9357a in stop_the_world (stw=stw@entry=0x55989e0db190 <_PyRuntime+136784>) at Python/pystate.c:2234
#15 0x000055989dd936d3 in _PyEval_StopTheWorld (interp=interp@entry=0x55989e0d8780 <_PyRuntime+126016>) at Python/pystate.c:2331
#16 0x000055989dd9c761 in _Py_qsbr_reserve (interp=interp@entry=0x55989e0d8780 <_PyRuntime+126016>) at Python/qsbr.c:201
#17 0x000055989dd908d4 in new_threadstate (interp=0x55989e0d8780 <_PyRuntime+126016>, whence=whence@entry=2) at Python/pystate.c:1543
#18 0x000055989dd92769 in _PyThreadState_New (interp=<optimized out>, whence=whence@entry=2) at Python/pystate.c:1624