Skip to content

PR #32599 introduced deadlocks #35441

Closed
Closed
@tkf

Description

@tkf

I think it is very likely that PR #32599 introduced a bug that causes deadlocks. I have two MEWs below.

Since these MWEs require running for a very long time and there is a non-negligible change that the MWEs cause the segfault fixed by #35387, I back-ported #35387 to the version before and after #32599:

The MWEs below cause a deadlock in post32599-pr35387 and master but not in pre32599-pr35387 (the MWEs run until the end).

(I used JULIA_NUM_THREADS=13 when running the following MWEs. The specific number of threads to invoke this deadlock seem to be different for each machine. @chriselrod seems to need only JULIA_NUM_THREADS=8.)

MWE 1

This is the snippet based on the one #35387 (comment) by @chriselrod. In my machine, single @benchmark tproduct_fm($x) didn't cause the deadlock so I put it in a loop. In the first run, it caused a deadlock after printing i = 84.

using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra, BenchmarkTools

function tproduct_fm(x)
    @fastmath begin
        X3 = cbrt.(x)
        X9 = cbrt.(X3).^2
    end
    lx = length(x)
    H = Matrix{eltype(x)}(undef, lx, lx)
    @sync for x_index_chunk in Iterators.partition(eachindex(x), lx ÷ 2nthreads())
        @spawn begin
            for i  x_index_chunk
                x3 = X3[i]
                x9 = X9[i]
                js = firstindex(x):i
                @fastmath for j in js
                    A = x3 + X3[j]
                    B = x3 * X3[j]
                    C = x9 + X9[j]
                    f = A^2 * C
                    g = -(B*A)
                    H[j, i] = f * g
                end
            end
        end
    end
    return Symmetric(H)
end
x = rand(1000);

for i in 1:1000
    @show i
    @benchmark tproduct_fm($x)
end
Example backtrace after Ctrl-C
^C                                                       
signal (2): Interrupt                                   
in expression starting at /tmp/tproduct_fm.jl:32         
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475                     
poptaskref at ./task.jl:702                                  
wait at ./task.jl:709                                                    
task_done_hook at ./task.jl:444                                
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]     
jl_apply_generic at /post32599-pr35387/src/gf.c:2328            
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]          
jl_finish_task at /post32599-pr35387/src/task.c:198   
start_task at /post32599-pr35387/src/task.c:697          
unknown function (ip: (nil))                                      
unknown function (ip: (nil))                             
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
...

MWE 2

This is the MWE I've reported in #35341. In the first run, it caused a deadlock after printing seed = 287.

using BenchmarkTools, ThreadsX, Random
for seed in 1:1000
    @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000)))
end
Example backtrace after Ctrl-C
^CERROR: InterruptException:
Stacktrace:
 [1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
 [2] wait() at ./task.jl:709
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Array{Any,1}) at ./task.jl:294
 [6] macro expansion at ./task.jl:335 [inlined]
 [7] maptasks(::ThreadsX.Implementations.var"#97#98"{Float64,Base.Order.ForwardOrdering}, ::Base.Iterators.Zip{Tuple{Base.Iterators.PartitionIterator{SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}},Base.Iterators.PartitionIterator{Array{Int8,1}}}}) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
 [8] _quicksort!(::Array{Float64,1}, ::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Int64,Int64}, ::Base.Order.ForwardOrdering, ::Array{Int8,1}, ::Bool, ::Bool) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
 [9] sort!(::Array{Float64,1}, ::Int64, ::Int64, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}, ::Base.Order.ForwardOrdering) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:22
 [10] _sort! at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:130 [inlined]
 [11] #sort!#86 at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:170 [inlined]
 [12] sort! at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:156 [inlined]
 [13] #sort#85 at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:143 [inlined]
 [14] sort at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:143 [inlined]
 [15] ##core#1405(::Array{Float64,1}) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
 [16] ##sample#1406(::BenchmarkTools.Parameters) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
 [17] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#1404")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String,
kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:411
 [18] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#1404")}, ::BenchmarkTools.Parameters) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:399
 [19] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [20] invokelatest at ./essentials.jl:709 [inlined]
 [21] #run_result#37 at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
 [22] run_result at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined] (repeats 2 times)
 [23] top-level scope at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:483
 [24] top-level scope at none:1

How to solve this?

The next step for me is to try rr as it helped a lot to fixed one of the bugs I reported in #35341.

Meanwhile, it'd be nice if core devs try reproducing the deadlock in their machines. I know @Keno tried doing this but I don't know if he tried different JULIA_NUM_THREADS. I'll also try to find some relationship between JULIA_NUM_THREADS and the average time/probability of the deadlock (e.g., does it have to be within a "hotspot" or is it just have to be large enough).

Also, it'd be nice if people familiar with threading internals can review #32599 again. I think it's fair to say that this is not the most thoroughly reviewed PR. (Needless to say, this is not a complaint as I understand there are only very few people who can review it and I imagine they are very busy.)

Finally, I think it makes sense to consider taking this bug into account in the release strategy of 1.5. Maybe revert #32599 if it cannot be fixed before the feature freeze? Or, in 1.5-RC announcement, ask people with multi-threaded code base to try stress-test with the RC?

Metadata

Metadata

Assignees

No one assigned

    Labels

    multithreadingBase.Threads and related functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions