Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting Lighthouse sometimes stalls due to in-use sockets #2254

Open
michaelsproul opened this issue Mar 10, 2021 · 11 comments
Open

Restarting Lighthouse sometimes stalls due to in-use sockets #2254

michaelsproul opened this issue Mar 10, 2021 · 11 comments

Comments

@michaelsproul
Copy link
Member

Description

Some users have reported that Lighthouse cannot be restarted quickly due to TCP ports not being freed immediately after process exit. After a bit of research, it seems that this is a consequence of TCP's design, and that most operating systems wait 30-120 seconds after socket closure in order to avoid delayed packets being sent to a new listener. This thread has a good summary: https://stackoverflow.com/questions/3229860/what-is-the-meaning-of-so-reuseaddr-setsockopt-option-linux/3233022#3233022

If we establish that Lighthouse's networking stack is robust against delayed packets, we could opt into receiving them by setting the SO_REUSEADDR flag when binding TCP sockets. Actually doing this could be a bit tricky, because we might have to punch through Tokio & LibP2P's abstractions, but perhaps they already provide configuration options.

Until then, anyone who experiences issues rebinding sockets can wait out the TIME_WAIT period. You can see sockets in this state using a command like:

ss --numeric -o state time-wait

Version

v1.1.3, likely v1.2.0 as well

@michaelsproul
Copy link
Member Author

This may also be the source of some CI failures:

Mar 01 06:56:30.744 CRIT Unable to listen on libp2p address      listen_multiaddr: /ip4/0.0.0.0/tcp/42426, error: Other(Custom { kind: Other, error: Other(A(A(Underlying(Os { code: 98, kind: AddrInUse, message: "Address already in use" })))) }), service: libp2p, service: node_2
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "Failed to start network: Error(Libp2p(Msg(\"Libp2p was unable to listen on the given listen address.\")), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })"', testing/simulator/src/eth1_sim.rs:223:41
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

https://github.com/sigp/lighthouse/runs/2001822504?check_suite_focus=true

@pawanjay176
Copy link
Member

Turns out libp2p has SO_REUSEADDR set for all its tcp sockets by default (https://github.com/libp2p/rust-libp2p/blob/master/transports/tcp/src/lib.rs#L330).

From the socket docs for SO_REUSEADDR

When the listening socket is bound to INADDR_ANY with a specific port then it is not possible to bind to this port for any local address.

We always bind to 0.0.0.0 in libp2p and I think that's why we sometimes see AddrInUse errors inspite of our sockets having the SO_REUSEADDR option set.

From what I have gathered from the stack overflow thread you linked above and https://stackoverflow.com/a/14388707/4914568
, we can rebind the socket on 0.0.0.0: port only after the closing socket atleast reaches TIME_WAIT state.

Maybe ensuring that we close all libp2p listener streams(https://docs.rs/libp2p/0.35.1/libp2p/swarm/struct.ExpandedSwarm.html#method.remove_listener) before dropping libp2p would ensure that we don't have any hanging connections?

@remyroy
Copy link
Contributor

remyroy commented Apr 6, 2021

Properly closing all connections and using SO_REUSEADDR would really be useful for an update process with minimal downtime (stopping services, updating binary, restarting services).

@remyroy
Copy link
Contributor

remyroy commented Apr 14, 2021

I just upgraded to v1.3.0 and I did not have this issue compared to recent previous versions. Was this recently fixed?

@michaelsproul michaelsproul changed the title Consider using SO_REUSEADDR for TCP sockets Restarting Lighthouse sometimes stalls due to in-use sockets Apr 14, 2021
@michaelsproul
Copy link
Member Author

I just upgraded to v1.3.0 and I did not have this issue compared to recent previous versions. Was this recently fixed?

Not as far as I'm aware. As @pawanjay176 pointed out we're already using SO_REUSEADDR, so there must be something else at play (and maybe only in some cases, like if there are certain connections in use when the process exits).

We could maybe try SO_REUSEPORT instead, as it seems to be the more modern flag. Or as Pawan says, see if we can trace the issue to an unclosed libp2p stream.

@AgeManning
Copy link
Member

We've had a number of version updates since this, I've not witnessed this issue. Lets re-open if it arises again.

@michaelsproul
Copy link
Member Author

Several users have reported seeing this recently so I'm going to reopen it (related GitHub issue: #3500).

@karalabe
Copy link

karalabe commented Oct 4, 2023

Still borked

Oct 04 06:42:28.585 CRIT Failed to start beacon node             reason: Failed to start network: Error(Libp2p(Msg("Io(Os { code: 48, kind: AddrInUse, message: \"Address already in use\" })")), State { next_error: None, backtrace: InternalBacktrace { backtrace: Some(   0: backtrace::capture::Backtrace::new_unresolved
   1: error_chain::backtrace::imp::InternalBacktrace::new
   2: <error_chain::State as core::default::Default>::default
   3: <lighthouse_network::types::error::Error as core::convert::From<alloc::string::String>>::from
   4: lighthouse_network::service::Network<AppReqId,TSpec>::new::{{closure}}
   5: beacon_node::ProductionBeaconNode<E>::new::{{closure}}
   6: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
   7: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   8: <futures_util::future::future::flatten::Flatten<Fut,<Fut as core::future::future::Future>::Output> as core::future::future::Future>::poll
   9: tokio::runtime::task::core::Core<T,S>::poll
  10: tokio::runtime::task::harness::Harness<T,S>::poll
  11: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  12: tokio::runtime::context::scoped::Scoped<T>::set
  13: tokio::runtime::context::runtime::enter_runtime
  14: tokio::runtime::scheduler::multi_thread::worker::run
  15: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
  16: tokio::runtime::task::harness::Harness<T,S>::poll
  17: tokio::runtime::blocking::pool::Inner::run
  18: std::sys_common::backtrace::__rust_begin_short_backtrace
  19: core::ops::function::FnOnce::call_once{{vtable.shim}}
  20: std::sys::unix::thread::Thread::new::thread_start
  21: __pthread_joiner_wake
) } })

I've stopped lighthouse half a day ago and can't start it back up, just keeps failing with this.

@michaelsproul
Copy link
Member Author

@karalabe Half a day seems too long for socket reuse. Are you 100% you don't have something listening on 9000/9001? Since v4.5.0 we also listen on 9001 UDP for QUIC.

If there's something listening, it should show under lsof:

sudo lsof -nP -iTCP -sTCP:LISTEN
sudo lsof -nP -iUDP

@karalabe
Copy link

karalabe commented Oct 4, 2023

This seems weird. Seems there's a "thing" in macos, where sometimes a UDP port survives the owner process terminating... and there seems to be no way of closing it afterwards.

With netstat, I can indeed see UDP 9000 open:

udp4  309463      0  *.9000                                        *.*

However, lsof does not find port 9000 assigned to any process.

Seems this thing is a recurring event on macos and there's no real answer anywhere: https://stackoverflow.com/questions/40512274/release-udp-port-used-by-dead-process-on-os-x

@karalabe
Copy link

karalabe commented Oct 4, 2023

Hah, dafuq ethereum/go-ethereum#18443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants