Skip to content

Issue with Python 3.11 and dask[distributed] with high number of threads #116969

Closed
@diegorusso

Description

@diegorusso

Bug report

Bug description:

I have noticed that the dask benchmark in pyperformance hangs when running it with Python 3.11 with a "high" number of cores on the machine. I have seen issues with 191 and 384 cores.

I started investigated the problem and seen that the issue manifested itself on a machine with a high number of cores.
The benchmarks that hangs is https://github.com/python/pyperformance/blob/main/pyperformance/data-files/benchmarks/bm_dask/run_benchmark.py

When the Worker class get instantiated, it sets the nthreads to the number of CPUs present on the system (here the code)

When this number is relatively high, it causes Python3.11 to hang and all the underlying threads to deadlock on the GIL.

To replicate the issue:

  • make a copy of the dask benchmark file
  • set the nthreads of the Worker class to a relatively high number (E.g. 1000).
async with Worker(scheduler.address, nthreads=1000):
...
  • Create/activate a venv with Python 3.11 and install the dependencies
pip install dask[distributed]==2022.2.0 pyperf
  • Run a quick stress test
while true; do python run_benchmark.py; done 

and wait to hang. It does it at random time.

With the process hanging, gdb shows on a thread (out of the hundreds):

 (gdb) thread 4
[Switching to thread 4 (Thread 0x7f5aeffff640 (LWP 402351))]
#0  __futex_abstimed_wait_common64 (private=-1457409528, cancel=true, abstime=0x7f5aefffde20, op=137, expected=0, futex_word=0x5640a959d354 <_PyRuntime+436>) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) py-bt
Traceback (most recent call first):
  Waiting for the GIL
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_common.py", line 788, in open_binary
    return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_pslinux.py", line 1967, in memory_info
    with open_binary("%s/%s/statm" % (self._procfs_path, self.pid)) as f:
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    return fun(self, *args, **kwargs)
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/__init__.py", line 1102, in memory_info
    return self._proc.memory_info()
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_common.py", line 495, in wrapper
    return fun(self)
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/distributed/utils_perf.py", line 188, in _gc_callback
    rss = self._proc.memory_info().rss
  <built-in method _current_frames of module object at remote 0x7f5dc0a32ca0>
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/distributed/profile.py", line 270, in _watch
    frame = sys._current_frames()[thread_id]
  File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 1002, in _bootstrap
    self._bootstrap_inner()

A strace of a thread shows (continuously)

...
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=468031783}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=473122144}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=478228035}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=483319687}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=488417438}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=493521779}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=498608771}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=503711922}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=508813993}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=513919325}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=519022166}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
...

I tried upgrading Dask[distributed] the latest version but I have the same effects. I think there is something going on in Python 3.11.
This happens only with Python 3.11: 3.9 and 3.12 work as expected.

I've seen it on x86, aarch64 still to test.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    type-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions