Skip to content

ProcessPoolExecutor shutdown hangs after future cancel was requested #94440

Closed
@yonatanp

Description

@yonatanp

Bug report

With a ProcessPoolExecutor, after submitting and quickly canceling a future, a call to shutdown(wait=True) would hang indefinitely.
This happens pretty much on all platforms and all recent Python versions.

Here is a minimal reproduction:

import concurrent.futures

ppe = concurrent.futures.ProcessPoolExecutor(1)
ppe.submit(int).result()
ppe.submit(int).cancel()
ppe.shutdown(wait=True)

The first submission gets the executor going and creates its internal queue_management_thread.
The second submission appears to get that thread to loop, enter a wait state, and never receive a wakeup event.

Introducing a tiny sleep between the second submit and its cancel request makes the issue disappear. From my initial observation it looks like something in the way the queue_management_worker internal loop is structured doesn't handle this edge case well.

Shutting down with wait=False would return immediately as expected, but the queue_management_thread would then die with an unhandled OSError: handle is closed exception.

Environment

  • Discovered on macOS-12.2.1 with cpython 3.8.5.
  • Reproduced in Ubuntu and Windows (x64) as well, and in cpython versions 3.7 to 3.11.0-beta.3.
  • Reproduced in pypy3.8 as well, but not consistently. Seen for example in Ubuntu with Python 3.8.13 (PyPy 7.3.9).

Additional info

When tested with pytest-timeout under Ubuntu and cpython 3.8.13, these are the tracebacks at the moment of timing out:

_____________________________________ test _____________________________________
    @pytest.mark.timeout(10)
    def test():
        ppe = concurrent.futures.ProcessPoolExecutor(1)
        ppe.submit(int).result()
        ppe.submit(int).cancel()
>       ppe.shutdown(wait=True)
test_reproduce_python_bug.py:14:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/concurrent/futures/process.py:686: in shutdown
    self._queue_management_thread.join()
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py:1011: in join
    self._wait_for_tstate_lock()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Thread(QueueManagerThread, started daemon 140003176535808)>
block = True, timeout = -1
    def _wait_for_tstate_lock(self, block=True, timeout=-1):
        # Issue #18808: wait for the thread state to be gone.
        # At the end of the thread's life, after all knowledge of the thread
        # is removed from C data structures, C code releases our _tstate_lock.
        # This method passes its arguments to _tstate_lock.acquire().
        # If the lock is acquired, the C code is done, and self._stop() is
        # called.  That sets ._is_stopped to True, and ._tstate_lock to None.
        lock = self._tstate_lock
        if lock is None:  # already determined that the C code is done
            assert self._is_stopped
>       elif lock.acquire(block, timeout):
E       Failed: Timeout >10.0s
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py:1027: Failed
----------------------------- Captured stderr call -----------------------------
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
~~~~~~~~~~~~~~~~~ Stack of QueueFeederThread (140003159754496) ~~~~~~~~~~~~~~~~~
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/queues.py", line 227, in _feed
    nwait()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
~~~~~~~~~~~~~~~~ Stack of QueueManagerThread (140003176535808) ~~~~~~~~~~~~~~~~~
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/concurrent/futures/process.py", line 362, in _queue_management_worker
    ready = mp.connection.wait(readers + worker_sentinels)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

Tracebacks in PyPy are similar on the concurrent.futures.process level. Tracebacks in Windows are different in the lower-level areas, but again similar on the concurrent.futures.process level.

Linked PRs:

Linked PRs

Metadata

Metadata

Assignees

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions