Rare deadlock and segmentation fault in a parallel quicksort implementation (with an MWE)

I have a trouble with finding a cause of [a rare deadlock in my parallel quicksort](https://github.com/tkf/ThreadsX.jl/issues/104).  I tried my best, with a big help from  @chriselrod, to find out the problem but I still have no idea how to fix it.

I am reporting this here since I think there is _some_ chance that there is a bug in `julia` (scheduler? compiler?).  (In other words I'm hopelessly lost now and all I can do is "blame" the upstream.)

MWE: Install the latest ThreadsX (v0.1.1) and BenchmarkTools. Then run the following code eventually causes a deadlock when `JULIA_NUM_THREADS` is _appropriate_ (for me it's 13 for @chriselrod it's 8 but not 7):

```julia
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
    @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
```

I've tried `git bisect` with [this script](https://github.com/tkf/ThreadsX.jl/issues/104#issuecomment-606249114) and found that #32599 changes the way the problem(s) manifest:

* After #32599 (dc46ddd857d9fcf178ee5074f635dc29c5b1ade6): running the MWE eventually hangs (it hung within 18 minuets once but it continued to run more than 30 minutes once).
* Before #32599 (f5dbc471efc3334a8c532f56fb40b06427fa52c7): running the MWE does not cause the hang.
* In both cases, the MWE can cause a segmentation fault but it is very rare; i.e., the MWE can run for more than one hour without a problem.

With f5dbc471efc3334a8c532f56fb40b06427fa52c7, I observed this segmentation fault:

```
seed[] += 1 = 247
[ Info: %CPU = 1167.0
[ Info: 45.550472802933335 minutes without a hang....

signal (11): Segmentation fault
in expression starting at none:3
setindex! at ./Base.jl:0 [inlined]
setindex! at ./subarray.jl:301 [inlined]
partition_sizes! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:167
#97 at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158 [inlined]
#3 at ./threadingconstructs.jl:126
unknown function (ip: 0x7f54cc050ee6)
_jl_invoke at /julia/src/gf.c:2144 [inlined]
jl_apply_generic at /julia/src/gf.c:2328
jl_apply at /julia/src/julia.h:1695 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 6273831512 (Pool: 6273798676; Big: 32836); GC: 17817
```

However, I don't understand how [this function `partition_sizes!`](https://github.com/tkf/ThreadsX.jl/blob/2e630bff52db2e8136e3607a1d44d86811723c1e/src/quicksort.jl#L160-L172) can cause a segfault.

@chriselrod also [observed](https://github.com/tkf/ThreadsX.jl/issues/104#issuecomment-606736472) another segfault with 65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718:

```
signal (11): Segmentation fault
in expression starting at REPL[3]:1
- at ./int.jl:52 [inlined]
unsafe_length at ./range.jl:517 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x7f67b598804f)
unknown function (ip: (nil))
Allocations: 10113947179 (Pool: 10113915735; Big: 31444); GC: 29811
```

It is puzzling for me that how the code involved in the above stacktrace can cause a segfault:

`-` at [./int.jl:52](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/int.jl#L52)
`unsafe_length` at [./range.jl:517](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/range.jl#L517)
`unsafe_indices` at [./abstractarray.jl:99](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/abstractarray.jl#L99)
`_indices_sub` at [./subarray.jl:409](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/subarray.jl#L409)
`axes` at [./subarray.jl:404](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/subarray.jl#L404)
`axes1` at [./abstractarray.jl:95](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/abstractarray.jl#L95)
`eachindex` at [./abstractarray.jl:267](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/abstractarray.jl#L267)
`eachindex` at [./abstractarray.jl:270](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/abstractarray.jl#L270)
`eachindex` at [./abstractarray.jl:260](https://github.com/JuliaLang/julia/blob/65b8e7ed72541d1dd5bc1f4dcabc5245e1dba718/base/abstractarray.jl#L260)
`partition_sizes!` at [ThreadsX.jl/src/quicksort.jl:163](https://github.com/tkf/ThreadsX.jl/blob/v0.1.1/src/quicksort.jl#L163)

We (me and @chriselrod) tried to get a better error report by running the MWE with `--check-bounds=yes`. But [we couldn't get a hang or segfault](https://github.com/tkf/ThreadsX.jl/issues/104#issuecomment-607092045) even after running the MWE for hours.

I highly appreciate any help/advice from core devs or anyone who knows how threading works in Julia.  I'm currently trying to build ASAN-enabled `julia` with help from @vchuravy https://github.com/JuliaLang/julia/issues/35338, hoping that it gives us some information.  But I'm not sure how much I can discover by doing this as I don't know anything about LLVM or C++.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rare deadlock and segmentation fault in a parallel quicksort implementation (with an MWE) #35341

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rare deadlock and segmentation fault in a parallel quicksort implementation (with an MWE) #35341

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions