Description
I think it is very likely that PR #32599 introduced a bug that causes deadlocks. I have two MEWs below.
Since these MWEs require running for a very long time and there is a non-negligible change that the MWEs cause the segfault fixed by #35387, I back-ported #35387 to the version before and after #32599:
pre32599-pr35387
(tkf@247e385): [LateGCLowering] Fix skipped Select lifting #35387 is merged to the commit just before threads: further optimize scheduler based on #32551 #32599 is mergedpost32599-pr35387
(tkf@eff0664): [LateGCLowering] Fix skipped Select lifting #35387 is merged to the commit where threads: further optimize scheduler based on #32551 #32599 is merged
The MWEs below cause a deadlock in post32599-pr35387
and master
but not in pre32599-pr35387
(the MWEs run until the end).
(I used JULIA_NUM_THREADS=13
when running the following MWEs. The specific number of threads to invoke this deadlock seem to be different for each machine. @chriselrod seems to need only JULIA_NUM_THREADS=8
.)
MWE 1
This is the snippet based on the one #35387 (comment) by @chriselrod. In my machine, single @benchmark tproduct_fm($x)
didn't cause the deadlock so I put it in a loop. In the first run, it caused a deadlock after printing i = 84
.
using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra, BenchmarkTools
function tproduct_fm(x)
@fastmath begin
X3 = cbrt.(x)
X9 = cbrt.(X3).^2
end
lx = length(x)
H = Matrix{eltype(x)}(undef, lx, lx)
@sync for x_index_chunk in Iterators.partition(eachindex(x), lx ÷ 2nthreads())
@spawn begin
for i ∈ x_index_chunk
x3 = X3[i]
x9 = X9[i]
js = firstindex(x):i
@fastmath for j in js
A = x3 + X3[j]
B = x3 * X3[j]
C = x9 + X9[j]
f = A^2 * C
g = -(B*A)
H[j, i] = f * g
end
end
end
end
return Symmetric(H)
end
x = rand(1000);
for i in 1:1000
@show i
@benchmark tproduct_fm($x)
end
Example backtrace after Ctrl-C
^C
signal (2): Interrupt
in expression starting at /tmp/tproduct_fm.jl:32
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
...
MWE 2
This is the MWE I've reported in #35341. In the first run, it caused a deadlock after printing seed = 287
.
using BenchmarkTools, ThreadsX, Random
for seed in 1:1000
@btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000)))
end
Example backtrace after Ctrl-C
^CERROR: InterruptException:
Stacktrace:
[1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
[2] wait() at ./task.jl:709
[3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
[4] _wait(::Task) at ./task.jl:238
[5] sync_end(::Array{Any,1}) at ./task.jl:294
[6] macro expansion at ./task.jl:335 [inlined]
[7] maptasks(::ThreadsX.Implementations.var"#97#98"{Float64,Base.Order.ForwardOrdering}, ::Base.Iterators.Zip{Tuple{Base.Iterators.PartitionIterator{SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}},Base.Iterators.PartitionIterator{Array{Int8,1}}}}) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
[8] _quicksort!(::Array{Float64,1}, ::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Int64,Int64}, ::Base.Order.ForwardOrdering, ::Array{Int8,1}, ::Bool, ::Bool) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
[9] sort!(::Array{Float64,1}, ::Int64, ::Int64, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}, ::Base.Order.ForwardOrdering) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:22
[10] _sort! at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:130 [inlined]
[11] #sort!#86 at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:170 [inlined]
[12] sort! at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:156 [inlined]
[13] #sort#85 at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:143 [inlined]
[14] sort at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:143 [inlined]
[15] ##core#1405(::Array{Float64,1}) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
[16] ##sample#1406(::BenchmarkTools.Parameters) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
[17] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#1404")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String,
kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:411
[18] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#1404")}, ::BenchmarkTools.Parameters) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:399
[19] #invokelatest#1 at ./essentials.jl:710 [inlined]
[20] invokelatest at ./essentials.jl:709 [inlined]
[21] #run_result#37 at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
[22] run_result at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined] (repeats 2 times)
[23] top-level scope at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:483
[24] top-level scope at none:1
How to solve this?
The next step for me is to try rr
as it helped a lot to fixed one of the bugs I reported in #35341.
Meanwhile, it'd be nice if core devs try reproducing the deadlock in their machines. I know @Keno tried doing this but I don't know if he tried different JULIA_NUM_THREADS
. I'll also try to find some relationship between JULIA_NUM_THREADS
and the average time/probability of the deadlock (e.g., does it have to be within a "hotspot" or is it just have to be large enough).
Also, it'd be nice if people familiar with threading internals can review #32599 again. I think it's fair to say that this is not the most thoroughly reviewed PR. (Needless to say, this is not a complaint as I understand there are only very few people who can review it and I imagine they are very busy.)
Finally, I think it makes sense to consider taking this bug into account in the release strategy of 1.5. Maybe revert #32599 if it cannot be fixed before the feature freeze? Or, in 1.5-RC announcement, ask people with multi-threaded code base to try stress-test with the RC?