Skip to content

Segmentation fault with Distributed when --threads is set #54253

Open
@Socob

Description

@Socob

I’m getting segmentation faults when using Distributed while passing --threads to Julia, even when I’m not actually using any of those threads (see the MWE below). Needless to say, this is a huge problem when doing hybrid distributed- and shared-memory parallelization!

$ julia test.jl
start
      From worker 12:	
      From worker 12:	[58424] signal (11.1): Segmentation fault
      From worker 12:	in expression starting at none:1
      From worker 12:	Allocations: 101999211 (Pool: 93311196; Big: 8688015); GC: 1591
Worker 12 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:947
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:955
 [3] unsafe_read
   @ ./io.jl:774 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:773
 [5] read!
   @ ./io.jl:775 [inlined]
 [6] deserialize_hdr_raw
   @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
ERROR: LoadError: ProcessExitedException(12)
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:448
 [2] macro expansion
   @ ./task.jl:480 [inlined]
 [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
   @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/macros.jl:219
 [4] macro expansion
   @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/macros.jl:203 [inlined]
 [5] main()
   @ Main ~/test.jl:9
 [6] top-level scope
   @ ~/keeper/Documents/docs/postdocs/work/parity_violation/analytic4PC/run3.jl:30
in expression starting at ~/test.jl:29

Using the commented line instead (without --threads), I’m not getting any segmentation faults.

Triggering the segfault does seem to depend on the number of worker processes, in that with a small number of workers, the issue is not triggered (or at least not consistently). It also doesn’t appear immediately, but after some non-deterministic time. The details may be machine-specific, but I’ve reproduced this on several different machines.

I don’t have any attempts at an explanation, since I don’t see how merely setting the number of Julia threads would affect this code.


  1. The output of versioninfo():
    Julia Version 1.10.2
    Commit bd47eca2c8a (2024-03-01 10:14 UTC)
    Build Info:
      Official https://julialang.org/ release
    Platform Info:
      OS: Linux (x86_64-linux-gnu)
      CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
      WORD_SIZE: 64
      LIBM: libopenlibm
      LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
    Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
    
  2. How you installed Julia: juliaup
  3. A minimal working example (MWE), also known as a minimum reproducible example:
    using Distributed
    
    function main()
        arr = zeros(1000, 10000)
        arr .= 1.0
        println("start"); flush(stdout)
        @everywhere workers() begin
            # dummy calculation
            arr = $arr
            for i in 1:size(arr, 2)
                sum(
                    sum(1.1 .* @view arr[:, i])
                    for _ in 1:5000
                )
            end
        end
        println("DONE"); flush(stdout)
    end
    
    addprocs(
        15;
        # results in segfault
        exeflags=`--startup-file=no --threads=16`
        # no segfault!
    #    exeflags=`--startup-file=no`
    )
    main()

Metadata

Metadata

Assignees

No one assigned

    Labels

    multithreadingBase.Threads and related functionalityparallelismParallel or distributed computation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions