Skip to content

Rare deadlock and segmentation fault in a parallel quicksort implementation (with an MWE) #35341

Closed
@tkf

Description

@tkf

I have a trouble with finding a cause of a rare deadlock in my parallel quicksort. I tried my best, with a big help from @chriselrod, to find out the problem but I still have no idea how to fix it.

I am reporting this here since I think there is some chance that there is a bug in julia (scheduler? compiler?). (In other words I'm hopelessly lost now and all I can do is "blame" the upstream.)

MWE: Install the latest ThreadsX (v0.1.1) and BenchmarkTools. Then run the following code eventually causes a deadlock when JULIA_NUM_THREADS is appropriate (for me it's 13 for @chriselrod it's 8 but not 7):

using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
    @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end

I've tried git bisect with this script and found that #32599 changes the way the problem(s) manifest:

With f5dbc47, I observed this segmentation fault:

seed[] += 1 = 247
[ Info: %CPU = 1167.0
[ Info: 45.550472802933335 minutes without a hang....

signal (11): Segmentation fault
in expression starting at none:3
setindex! at ./Base.jl:0 [inlined]
setindex! at ./subarray.jl:301 [inlined]
partition_sizes! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:167
#97 at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158 [inlined]
#3 at ./threadingconstructs.jl:126
unknown function (ip: 0x7f54cc050ee6)
_jl_invoke at /julia/src/gf.c:2144 [inlined]
jl_apply_generic at /julia/src/gf.c:2328
jl_apply at /julia/src/julia.h:1695 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 6273831512 (Pool: 6273798676; Big: 32836); GC: 17817

However, I don't understand how this function partition_sizes! can cause a segfault.

@chriselrod also observed another segfault with 65b8e7e:

signal (11): Segmentation fault
in expression starting at REPL[3]:1
- at ./int.jl:52 [inlined]
unsafe_length at ./range.jl:517 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x7f67b598804f)
unknown function (ip: (nil))
Allocations: 10113947179 (Pool: 10113915735; Big: 31444); GC: 29811

It is puzzling for me that how the code involved in the above stacktrace can cause a segfault:

- at ./int.jl:52
unsafe_length at ./range.jl:517
unsafe_indices at ./abstractarray.jl:99
_indices_sub at ./subarray.jl:409
axes at ./subarray.jl:404
axes1 at ./abstractarray.jl:95
eachindex at ./abstractarray.jl:267
eachindex at ./abstractarray.jl:270
eachindex at ./abstractarray.jl:260
partition_sizes! at ThreadsX.jl/src/quicksort.jl:163

We (me and @chriselrod) tried to get a better error report by running the MWE with --check-bounds=yes. But we couldn't get a hang or segfault even after running the MWE for hours.

I highly appreciate any help/advice from core devs or anyone who knows how threading works in Julia. I'm currently trying to build ASAN-enabled julia with help from @vchuravy #35338, hoping that it gives us some information. But I'm not sure how much I can discover by doing this as I don't know anything about LLVM or C++.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions