Skip to content

error messages getting lost in async methods in Distributed #30558

Open

Description

In this issue, I will provide a brief overview of the details in this discourse topic..

First, the environment is a cluster with a head node and 18 compute nodes with Slurm. The entire issue I had for two days was that on the head node I had v1.0.1 and on the compute nodes it was still 0.6. This was essentially the problem. But I'll start from the beginning.

It all started when running addprocs(SlurmManager(1)), I got cryptic exception thrown. The worker julia binary that was launched on the compute node crashed with the following exception (written to stdout/stderr)

julia_worker:9009#172.16.x.x
MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549)CapturedException(MethodError(convert, (Tuple, :all_to_all), 0x0000000000005549), Any[(setindex!(::Array{Tuple,1}, ::Symbol, ::Int64) at array.jl:583, 1), ((::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at event.jl:73, 1)])
Process(1) - Unknown remote, closing connection.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

After two days of debugging and a million println statements to track the functions, I realized the problem was in the process_hdr function which is called from a try/catch block in message_handler_loop. The process_hdr function verifies that the version of the launched worker matches that of the master process.

   if length(version) < HDR_VERSION_LEN
        println("about to throw an error")
        error("Version read failed. Connection closed by peer.")
    end

If the versions fail, there is an error thrown with a meaningful error message. If I had seen this error it would've saved me quite a bit of time. Since this function was called in message_handler_loop, shouldn't it be the case that this error message is propagated to the catch part? I.e.

function message_handler_loop(r_stream::IO, w_stream::IO, incoming::Bool)
    try
        version = process_hdr(r_stream, incoming) ## ERROR IS THROWN HERE.
        ...
    catch e
        if wpid < 1
            println(stderr, e, CapturedException(e, catch_backtrace()))
            println(stderr, "Process($(myid())) - Unknown remote, closing connection.")
        elseif !(wpid in map_del_wrkr)
            ...
        end
        ...
    end
end

In particular, in message_handler_loop

println(stderr, e, CapturedException(e, catch_backtrace()))
println(stderr, "Process($(myid())) - Unknown remote, closing connection.")

The e here should be the error thrown above. However, it's some cryptic exception about convert function of trying to put a symbol into an array of tuples. I don't know where that error message is coming from. I am guessing the launched worker binary which is v0.6 is crashing when trying to communicate (likely due to breaking change from 0.6 to 1.0). But even if the worker binary is crashing, shouldn't it be the case that the main master process can print out the correct error? Since the master process couldn't connect within 60 seconds, it simply terminates the worker and prints Worker x terminated.

I can't really reproduce this anymore since as of this morning the sysadmin has upgraded all nodes to 1.0.3. However, I hope I've provided enough information for someone who knows their way around Distributed to provide some input.

Edit: Maybe #28878 fixes this issue, but unfortunately I won't be able to test it (it would require to someone put 0.6 back on the compute nodes which our administration won't allow).

Edit 2: A way to reproduce this is to deliberately throw an error from process_hdr without running the if statement. That I can do if one requires more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions