Description
I have a trouble with finding a cause of a rare deadlock in my parallel quicksort. I tried my best, with a big help from @chriselrod, to find out the problem but I still have no idea how to fix it.
I am reporting this here since I think there is some chance that there is a bug in julia
(scheduler? compiler?). (In other words I'm hopelessly lost now and all I can do is "blame" the upstream.)
MWE: Install the latest ThreadsX (v0.1.1) and BenchmarkTools. Then run the following code eventually causes a deadlock when JULIA_NUM_THREADS
is appropriate (for me it's 13 for @chriselrod it's 8 but not 7):
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
I've tried git bisect
with this script and found that #32599 changes the way the problem(s) manifest:
- After threads: further optimize scheduler based on #32551 #32599 (dc46ddd): running the MWE eventually hangs (it hung within 18 minuets once but it continued to run more than 30 minutes once).
- Before threads: further optimize scheduler based on #32551 #32599 (f5dbc47): running the MWE does not cause the hang.
- In both cases, the MWE can cause a segmentation fault but it is very rare; i.e., the MWE can run for more than one hour without a problem.
With f5dbc47, I observed this segmentation fault:
seed[] += 1 = 247
[ Info: %CPU = 1167.0
[ Info: 45.550472802933335 minutes without a hang....
signal (11): Segmentation fault
in expression starting at none:3
setindex! at ./Base.jl:0 [inlined]
setindex! at ./subarray.jl:301 [inlined]
partition_sizes! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:167
#97 at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158 [inlined]
#3 at ./threadingconstructs.jl:126
unknown function (ip: 0x7f54cc050ee6)
_jl_invoke at /julia/src/gf.c:2144 [inlined]
jl_apply_generic at /julia/src/gf.c:2328
jl_apply at /julia/src/julia.h:1695 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 6273831512 (Pool: 6273798676; Big: 32836); GC: 17817
However, I don't understand how this function partition_sizes!
can cause a segfault.
@chriselrod also observed another segfault with 65b8e7e:
signal (11): Segmentation fault
in expression starting at REPL[3]:1
- at ./int.jl:52 [inlined]
unsafe_length at ./range.jl:517 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x7f67b598804f)
unknown function (ip: (nil))
Allocations: 10113947179 (Pool: 10113915735; Big: 31444); GC: 29811
It is puzzling for me that how the code involved in the above stacktrace can cause a segfault:
-
at ./int.jl:52
unsafe_length
at ./range.jl:517
unsafe_indices
at ./abstractarray.jl:99
_indices_sub
at ./subarray.jl:409
axes
at ./subarray.jl:404
axes1
at ./abstractarray.jl:95
eachindex
at ./abstractarray.jl:267
eachindex
at ./abstractarray.jl:270
eachindex
at ./abstractarray.jl:260
partition_sizes!
at ThreadsX.jl/src/quicksort.jl:163
We (me and @chriselrod) tried to get a better error report by running the MWE with --check-bounds=yes
. But we couldn't get a hang or segfault even after running the MWE for hours.
I highly appreciate any help/advice from core devs or anyone who knows how threading works in Julia. I'm currently trying to build ASAN-enabled julia
with help from @vchuravy #35338, hoping that it gives us some information. But I'm not sure how much I can discover by doing this as I don't know anything about LLVM or C++.