Description
I’m getting segmentation faults when using Distributed
while passing --threads
to Julia, even when I’m not actually using any of those threads (see the MWE below). Needless to say, this is a huge problem when doing hybrid distributed- and shared-memory parallelization!
$ julia test.jl start From worker 12: From worker 12: [58424] signal (11.1): Segmentation fault From worker 12: in expression starting at none:1 From worker 12: Allocations: 101999211 (Pool: 93311196; Big: 8688015); GC: 1591 Worker 12 terminated. Unhandled Task ERROR: EOFError: read end of file Stacktrace: [1] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64) @ Base ./stream.jl:947 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64) @ Base ./stream.jl:955 [3] unsafe_read @ ./io.jl:774 [inlined] [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64) @ Base ./io.jl:773 [5] read! @ ./io.jl:775 [inlined] [6] deserialize_hdr_raw @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined] [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})() @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121 ERROR: LoadError: ProcessExitedException(12) Stacktrace: [1] sync_end(c::Channel{Any}) @ Base ./task.jl:448 [2] macro expansion @ ./task.jl:480 [inlined] [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr) @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/macros.jl:219 [4] macro expansion @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/macros.jl:203 [inlined] [5] main() @ Main ~/test.jl:9 [6] top-level scope @ ~/keeper/Documents/docs/postdocs/work/parity_violation/analytic4PC/run3.jl:30 in expression starting at ~/test.jl:29
Using the commented line instead (without --threads
), I’m not getting any segmentation faults.
Triggering the segfault does seem to depend on the number of worker processes, in that with a small number of workers, the issue is not triggered (or at least not consistently). It also doesn’t appear immediately, but after some non-deterministic time. The details may be machine-specific, but I’ve reproduced this on several different machines.
I don’t have any attempts at an explanation, since I don’t see how merely setting the number of Julia threads would affect this code.
- The output of
versioninfo()
:Julia Version 1.10.2 Commit bd47eca2c8a (2024-03-01 10:14 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, znver2) Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
- How you installed Julia: juliaup
- A minimal working example (MWE), also known as a minimum reproducible example:
using Distributed function main() arr = zeros(1000, 10000) arr .= 1.0 println("start"); flush(stdout) @everywhere workers() begin # dummy calculation arr = $arr for i in 1:size(arr, 2) sum( sum(1.1 .* @view arr[:, i]) for _ in 1:5000 ) end end println("DONE"); flush(stdout) end addprocs( 15; # results in segfault exeflags=`--startup-file=no --threads=16` # no segfault! # exeflags=`--startup-file=no` ) main()