Description
openedon Jan 2, 2019
In this issue, I will provide a brief overview of the details in this discourse topic..
First, the environment is a cluster with a head node and 18 compute nodes with Slurm. The entire issue I had for two days was that on the head node I had v1.0.1 and on the compute nodes it was still 0.6. This was essentially the problem. But I'll start from the beginning.
It all started when running addprocs(SlurmManager(1))
, I got cryptic exception thrown. The worker julia binary that was launched on the compute node crashed with the following exception (written to stdout/stderr)
julia_worker:9009#172.16.x.x
MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549)CapturedException(MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549), Any[(setindex!(::Array{Tuple,1}, ::Symbol, ::Int64) at array.jl:583, 1), ((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
After two days of debugging and a million println
statements to track the functions, I realized the problem was in the process_hdr
function which is called from a try/catch
block in message_handler_loop
. The process_hdr
function verifies that the version of the launched worker matches that of the master process.
if length(version) < HDR_VERSION_LEN
println("about to throw an error")
error("Version read failed. Connection closed by peer.")
end
If the versions fail, there is an error
thrown with a meaningful error message. If I had seen this error it would've saved me quite a bit of time. Since this function was called in message_handler_loop
, shouldn't it be the case that this error message is propagated to the catch part? I.e.
function message_handler_loop(r_stream::IO, w_stream::IO, incoming::Bool)
try
version = process_hdr(r_stream, incoming) ## ERROR IS THROWN HERE.
...
catch e
if wpid < 1
println(stderr, e, CapturedException(e, catch_backtrace()))
println(stderr, "Process($(myid())) - Unknown remote, closing connection.")
elseif !(wpid in map_del_wrkr)
...
end
...
end
end
In particular, in message_handler_loop
println(stderr, e, CapturedException(e, catch_backtrace()))
println(stderr, "Process($(myid())) - Unknown remote, closing connection.")
The e
here should be the error thrown above. However, it's some cryptic exception about convert
function of trying to put
a symbol into an array of tuples. I don't know where that error message is coming from. I am guessing the launched worker binary which is v0.6 is crashing when trying to communicate (likely due to breaking change from 0.6 to 1.0). But even if the worker binary is crashing, shouldn't it be the case that the main master process can print out the correct error? Since the master process couldn't connect within 60 seconds, it simply terminates the worker and prints Worker x terminated.
I can't really reproduce this anymore since as of this morning the sysadmin has upgraded all nodes to 1.0.3. However, I hope I've provided enough information for someone who knows their way around Distributed
to provide some input.
Edit: Maybe #28878 fixes this issue, but unfortunately I won't be able to test it (it would require to someone put 0.6 back on the compute nodes which our administration won't allow).
Edit 2: A way to reproduce this is to deliberately throw an error from process_hdr
without running the if
statement. That I can do if one requires more information.