Skip to content

Error: "attempt to send to unknown socket" after rmprocs #53780

Open

Description

I am intermittently encountering: attempt to send to unknown socket when put!'ing into a RemoteChannel after rmprocs.

Here is the reproducer: https://gist.github.com/JBlaschke/8965e70acf52700605ac9db4af7eaf62 -- it works as follows:

  1. Start 2 processes: addprocs(2)
  2. Set up two RemoteChannels: ch_in and ch_out for inputs and outputs.
  3. Start worker processes that take!s from ch_in and put!s a result in ch_out.
  4. put! a bunch of data into ch_in
  5. rmproc(3)
  6. put! a bunch of data into ch_in

You should see the following output:

[2, 3]
      From worker 2:	hi there, I'm running on pid=2
      From worker 3:	hi there, I'm running on pid=3
Taken: 3
Taken: 2
      From worker 2:	hi there, I'm running on pid=2
      From worker 3:	hi there, I'm running on pid=3
Taken: 4
Taken: 5
[2]
┌ Error: Fatal error on process 1
│   exception =
│    attempt to send to unknown socket
│    Stacktrace:
│     [1] error(s::String)
│       @ Base ./error.jl:35
│     [2] send_msg_unknown(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed ~/local/juliaup/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/messages.jl:99
│     [3] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed ~/local/juliaup/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/messages.jl:115
│     [4] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::Int64)
│       @ Distributed ~/local/juliaup/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:102
│     [5] (::Distributed.var"#109#111"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
│       @ Distributed ~/local/juliaup/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:295
└ @ Distributed ~/local/juliaup/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:106
      From worker 2:	hi there, I'm running on pid=2
Taken: 7
      From worker 2:	hi there, I'm running on pid=2
Taken: 8
      From worker 2:	hi there, I'm running on pid=2
Taken: 9

Note that the error does not recur after occurring once.

This bug is intermittent, and therefore some of the timings are undoubtedly tuned to my system. However I found that waiting between rmprocs(3) and the next put! does not change the behaviour.

Background

  • versioninfo:
julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
  • Julia is installed using Juliaup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    parallelismParallel or distributed computation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions