error messages getting lost in async methods in Distributed

In this issue, I will provide a brief overview of the details in this [discourse topic.](https://discourse.julialang.org/t/there-is-a-bug-in-this-function-and-i-cant-figure-out-what-it-is/19150/6).

First, the environment is a cluster with a head node and 18 compute nodes with Slurm. The entire issue I had for two days was that on the head node I had v1.0.1 and on the compute nodes it was still 0.6. This was essentially the problem. But I'll start from the beginning. 

It all started when running `addprocs(SlurmManager(1))`, I got cryptic exception thrown. The worker julia binary that was launched on the compute node crashed with the following exception (written to stdout/stderr)
```
julia_worker:9009#172.16.x.x
MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549)CapturedException(MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549), Any[(setindex!(::Array{Tuple,1}, ::Symbol, ::Int64) at array.jl:583, 1), ((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
```
After two days of debugging and a million `println` statements to track the functions, I realized the problem was in the `process_hdr` function which is called from a `try/catch` block in `message_handler_loop`. The `process_hdr` function verifies that the version of the launched worker matches that of the master process.

```
   if length(version) < HDR_VERSION_LEN
        println("about to throw an error")
        error("Version read failed. Connection closed by peer.")
    end
```
If the versions fail, there is an `error` thrown with a meaningful error message. If I had seen this error it would've saved me quite a bit of time. Since this function was called in `message_handler_loop`, shouldn't it be the case that this error message is propagated to the catch part? I.e. 

```
function message_handler_loop(r_stream::IO, w_stream::IO, incoming::Bool)
    try
        version = process_hdr(r_stream, incoming) ## ERROR IS THROWN HERE.
        ...
    catch e
        if wpid < 1
            println(stderr, e, CapturedException(e, catch_backtrace()))
            println(stderr, "Process($(myid())) - Unknown remote, closing connection.")
        elseif !(wpid in map_del_wrkr)
            ...
        end
        ...
    end
end
```

In particular, in `message_handler_loop`
```
println(stderr, e, CapturedException(e, catch_backtrace()))
println(stderr, "Process($(myid())) - Unknown remote, closing connection.")
```

The `e` here should be the error thrown above. However, it's some cryptic exception about `convert` function of trying to `put` a symbol into an array of tuples. I don't know where that error message is coming from. I am guessing the launched worker binary which is v0.6 is crashing when trying to communicate (likely due to breaking change from 0.6 to 1.0). But even if the worker binary is crashing, shouldn't it be the case that the main master process can print out the correct error? Since the master process couldn't connect within 60 seconds, it simply terminates the worker and prints `Worker x terminated.`

I can't really reproduce this anymore since as of this morning the sysadmin has upgraded all nodes to 1.0.3. However, I hope I've provided enough information for someone who knows their way around `Distributed` to provide some input. 

Edit: Maybe https://github.com/JuliaLang/julia/pull/28878 fixes this issue, but unfortunately I won't be able to test it (it would require to someone put 0.6 back on the compute nodes which our administration won't allow).

Edit 2: A way to reproduce this is to deliberately throw an error from `process_hdr` without running the `if` statement. That I can do if one requires more information.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error messages getting lost in async methods in Distributed #30558

affans
openedon Jan 2, 2019

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

error messages getting lost in async methods in Distributed #30558

Description

affansopenedon Jan 2, 2019

Metadata