Restarting Lighthouse sometimes stalls due to in-use sockets #2254

michaelsproul · 2021-03-10T23:12:05Z

Description

Some users have reported that Lighthouse cannot be restarted quickly due to TCP ports not being freed immediately after process exit. After a bit of research, it seems that this is a consequence of TCP's design, and that most operating systems wait 30-120 seconds after socket closure in order to avoid delayed packets being sent to a new listener. This thread has a good summary: https://stackoverflow.com/questions/3229860/what-is-the-meaning-of-so-reuseaddr-setsockopt-option-linux/3233022#3233022

If we establish that Lighthouse's networking stack is robust against delayed packets, we could opt into receiving them by setting the SO_REUSEADDR flag when binding TCP sockets. Actually doing this could be a bit tricky, because we might have to punch through Tokio & LibP2P's abstractions, but perhaps they already provide configuration options.

Until then, anyone who experiences issues rebinding sockets can wait out the TIME_WAIT period. You can see sockets in this state using a command like:

ss --numeric -o state time-wait

Version

v1.1.3, likely v1.2.0 as well

The text was updated successfully, but these errors were encountered:

michaelsproul · 2021-03-10T23:15:58Z

This may also be the source of some CI failures:

Mar 01 06:56:30.744 CRIT Unable to listen on libp2p address      listen_multiaddr: /ip4/0.0.0.0/tcp/42426, error: Other(Custom { kind: Other, error: Other(A(A(Underlying(Os { code: 98, kind: AddrInUse, message: "Address already in use" })))) }), service: libp2p, service: node_2
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "Failed to start network: Error(Libp2p(Msg(\"Libp2p was unable to listen on the given listen address.\")), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })"', testing/simulator/src/eth1_sim.rs:223:41
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

https://github.com/sigp/lighthouse/runs/2001822504?check_suite_focus=true

pawanjay176 · 2021-03-16T19:42:32Z

Turns out libp2p has SO_REUSEADDR set for all its tcp sockets by default (https://github.com/libp2p/rust-libp2p/blob/master/transports/tcp/src/lib.rs#L330).

From the socket docs for SO_REUSEADDR

When the listening socket is bound to INADDR_ANY with a specific port then it is not possible to bind to this port for any local address.

We always bind to 0.0.0.0 in libp2p and I think that's why we sometimes see AddrInUse errors inspite of our sockets having the SO_REUSEADDR option set.

From what I have gathered from the stack overflow thread you linked above and https://stackoverflow.com/a/14388707/4914568
, we can rebind the socket on 0.0.0.0: port only after the closing socket atleast reaches TIME_WAIT state.

Maybe ensuring that we close all libp2p listener streams(https://docs.rs/libp2p/0.35.1/libp2p/swarm/struct.ExpandedSwarm.html#method.remove_listener) before dropping libp2p would ensure that we don't have any hanging connections?

remyroy · 2021-04-06T19:25:07Z

Properly closing all connections and using SO_REUSEADDR would really be useful for an update process with minimal downtime (stopping services, updating binary, restarting services).

remyroy · 2021-04-14T03:19:20Z

I just upgraded to v1.3.0 and I did not have this issue compared to recent previous versions. Was this recently fixed?

michaelsproul · 2021-04-14T04:33:40Z

I just upgraded to v1.3.0 and I did not have this issue compared to recent previous versions. Was this recently fixed?

Not as far as I'm aware. As @pawanjay176 pointed out we're already using SO_REUSEADDR, so there must be something else at play (and maybe only in some cases, like if there are certain connections in use when the process exits).

We could maybe try SO_REUSEPORT instead, as it seems to be the more modern flag. Or as Pawan says, see if we can trace the issue to an unclosed libp2p stream.

AgeManning · 2021-11-30T22:34:36Z

We've had a number of version updates since this, I've not witnessed this issue. Lets re-open if it arises again.

michaelsproul · 2022-10-28T02:56:22Z

Several users have reported seeing this recently so I'm going to reopen it (related GitHub issue: #3500).

karalabe · 2023-10-04T06:47:42Z

Still borked

Oct 04 06:42:28.585 CRIT Failed to start beacon node             reason: Failed to start network: Error(Libp2p(Msg("Io(Os { code: 48, kind: AddrInUse, message: \"Address already in use\" })")), State { next_error: None, backtrace: InternalBacktrace { backtrace: Some(   0: backtrace::capture::Backtrace::new_unresolved
   1: error_chain::backtrace::imp::InternalBacktrace::new
   2: <error_chain::State as core::default::Default>::default
   3: <lighthouse_network::types::error::Error as core::convert::From<alloc::string::String>>::from
   4: lighthouse_network::service::Network<AppReqId,TSpec>::new::{{closure}}
   5: beacon_node::ProductionBeaconNode<E>::new::{{closure}}
   6: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
   7: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   8: <futures_util::future::future::flatten::Flatten<Fut,<Fut as core::future::future::Future>::Output> as core::future::future::Future>::poll
   9: tokio::runtime::task::core::Core<T,S>::poll
  10: tokio::runtime::task::harness::Harness<T,S>::poll
  11: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  12: tokio::runtime::context::scoped::Scoped<T>::set
  13: tokio::runtime::context::runtime::enter_runtime
  14: tokio::runtime::scheduler::multi_thread::worker::run
  15: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
  16: tokio::runtime::task::harness::Harness<T,S>::poll
  17: tokio::runtime::blocking::pool::Inner::run
  18: std::sys_common::backtrace::__rust_begin_short_backtrace
  19: core::ops::function::FnOnce::call_once{{vtable.shim}}
  20: std::sys::unix::thread::Thread::new::thread_start
  21: __pthread_joiner_wake
) } })

I've stopped lighthouse half a day ago and can't start it back up, just keeps failing with this.

michaelsproul · 2023-10-04T06:54:09Z

@karalabe Half a day seems too long for socket reuse. Are you 100% you don't have something listening on 9000/9001? Since v4.5.0 we also listen on 9001 UDP for QUIC.

If there's something listening, it should show under lsof:

sudo lsof -nP -iTCP -sTCP:LISTEN
sudo lsof -nP -iUDP

karalabe · 2023-10-04T06:57:56Z

This seems weird. Seems there's a "thing" in macos, where sometimes a UDP port survives the owner process terminating... and there seems to be no way of closing it afterwards.

With netstat, I can indeed see UDP 9000 open:

udp4  309463      0  *.9000                                        *.*

However, lsof does not find port 9000 assigned to any process.

Seems this thing is a recurring event on macos and there's no real answer anywhere: https://stackoverflow.com/questions/40512274/release-udp-port-used-by-dead-process-on-os-x

karalabe · 2023-10-04T06:59:53Z

Hah, dafuq ethereum/go-ethereum#18443

michaelsproul added Networking A1 labels Mar 10, 2021

michaelsproul changed the title ~~Consider using SO_REUSEADDR for TCP sockets~~ Restarting Lighthouse sometimes stalls due to in-use sockets Apr 14, 2021

AgeManning closed this as completed Nov 30, 2021

michaelsproul reopened this Oct 28, 2022

michaelsproul mentioned this issue Oct 28, 2022

invalid address-in-use error on macOS? #3500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting Lighthouse sometimes stalls due to in-use sockets #2254

Restarting Lighthouse sometimes stalls due to in-use sockets #2254

michaelsproul commented Mar 10, 2021

michaelsproul commented Mar 10, 2021

pawanjay176 commented Mar 16, 2021

remyroy commented Apr 6, 2021 •

edited

Loading

remyroy commented Apr 14, 2021

michaelsproul commented Apr 14, 2021

AgeManning commented Nov 30, 2021

michaelsproul commented Oct 28, 2022

karalabe commented Oct 4, 2023

michaelsproul commented Oct 4, 2023

karalabe commented Oct 4, 2023

karalabe commented Oct 4, 2023

Restarting Lighthouse sometimes stalls due to in-use sockets #2254

Restarting Lighthouse sometimes stalls due to in-use sockets #2254

Comments

michaelsproul commented Mar 10, 2021

Description

Version

michaelsproul commented Mar 10, 2021

pawanjay176 commented Mar 16, 2021

remyroy commented Apr 6, 2021 • edited Loading

remyroy commented Apr 14, 2021

michaelsproul commented Apr 14, 2021

AgeManning commented Nov 30, 2021

michaelsproul commented Oct 28, 2022

karalabe commented Oct 4, 2023

michaelsproul commented Oct 4, 2023

karalabe commented Oct 4, 2023

karalabe commented Oct 4, 2023

remyroy commented Apr 6, 2021 •

edited

Loading