Description
Bug report
Bug description:
I have noticed that the dask benchmark in pyperformance hangs when running it with Python 3.11 with a "high" number of cores on the machine. I have seen issues with 191 and 384 cores.
I started investigated the problem and seen that the issue manifested itself on a machine with a high number of cores.
The benchmarks that hangs is https://github.com/python/pyperformance/blob/main/pyperformance/data-files/benchmarks/bm_dask/run_benchmark.py
When the Worker class get instantiated, it sets the nthreads to the number of CPUs present on the system (here the code)
When this number is relatively high, it causes Python3.11 to hang and all the underlying threads to deadlock on the GIL.
To replicate the issue:
- make a copy of the dask benchmark file
- set the nthreads of the Worker class to a relatively high number (E.g. 1000).
async with Worker(scheduler.address, nthreads=1000):
...
- Create/activate a venv with Python 3.11 and install the dependencies
pip install dask[distributed]==2022.2.0 pyperf
- Run a quick stress test
while true; do python run_benchmark.py; done
and wait to hang. It does it at random time.
With the process hanging, gdb shows on a thread (out of the hundreds):
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f5aeffff640 (LWP 402351))]
#0 __futex_abstimed_wait_common64 (private=-1457409528, cancel=true, abstime=0x7f5aefffde20, op=137, expected=0, futex_word=0x5640a959d354 <_PyRuntime+436>) at ./nptl/futex-internal.c:57
57 in ./nptl/futex-internal.c
(gdb) py-bt
Traceback (most recent call first):
Waiting for the GIL
File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_common.py", line 788, in open_binary
return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_pslinux.py", line 1967, in memory_info
with open_binary("%s/%s/statm" % (self._procfs_path, self.pid)) as f:
File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_pslinux.py", line 1714, in wrapper
return fun(self, *args, **kwargs)
File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/__init__.py", line 1102, in memory_info
return self._proc.memory_info()
File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_common.py", line 495, in wrapper
return fun(self)
File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/distributed/utils_perf.py", line 188, in _gc_callback
rss = self._proc.memory_info().rss
<built-in method _current_frames of module object at remote 0x7f5dc0a32ca0>
File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/distributed/profile.py", line 270, in _watch
frame = sys._current_frames()[thread_id]
File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 1002, in _bootstrap
self._bootstrap_inner()
A strace of a thread shows (continuously)
...
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=468031783}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=473122144}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=478228035}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=483319687}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=488417438}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=493521779}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=498608771}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=503711922}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=508813993}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=513919325}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=519022166}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
...
I tried upgrading Dask[distributed] the latest version but I have the same effects. I think there is something going on in Python 3.11.
This happens only with Python 3.11: 3.9 and 3.12 work as expected.
I've seen it on x86, aarch64 still to test.
CPython versions tested on:
3.11
Operating systems tested on:
Linux