Skip to content

rust-v0.53 WebSocket quicksink panic (intermittent) #19

@dhuseby

Description

@dhuseby

Repo: libp2p/rust-libp2p
Commit: b7914e407da34c99fb76dcc300b3d44b9af97fac
Transport: ws, Secure: tls, Muxer: yamux

Summary

rust-v0.53 panics in WebSocket quicksink when go-v0.40 dials via ws/tls/yamux. The listener crashes with "SinkImpl::poll_ready called after error" during TLS/yamux session teardown. The dialer successfully completes its handshake and measurement (exit code 0), but the listener crashes, causing the test to fail.

This bug is intermittent. It failed in the first test run (transport-f89ec4b4-162830-14-02-2026, 2026-02-14) but passed in a subsequent run with the same configuration. Reproducibility depends on the timing of yamux session teardown.

Failing tests (1)

  • go-v0.40 x rust-v0.53 (ws, tls, yamux)

Error output

thread 'tokio-runtime-worker' panicked at transports/websocket/src/quicksink.rs:159:30:
SinkImpl::poll_ready called after error.

Stack trace

quicksink::SinkImpl::poll_ready
  -> framed::Connection::poll_ready
  -> AsyncWrite::poll_write_vectored
  -> futures_rustls::Stream::write_io
  -> yamux::frame::io::Io::poll_ready

Root cause analysis

The panic occurs in quicksink::SinkImpl::poll_ready at transports/websocket/src/quicksink.rs:159. The quicksink state machine transitions to an error state when the TLS/yamux layer encounters a connection close, but the async runtime schedules another poll_ready call before the stream is dropped. This is a classic "use after error" bug in async Rust sink implementations.

Why only go-v0.40

The timing window is narrow. go-v0.40 likely has slightly different connection teardown timing compared to other go versions (v0.38-v0.45), making it the only version that triggers the race. Other go versions either close the connection cleanly before the sink polls again, or close it fast enough that the sink task is cancelled first.

Fix recommendations

Immediate fix: In quicksink.rs:159, instead of panicking when poll_ready is called after error, return Poll::Ready(Err(..)) with the cached error. This converts the panic into a graceful error propagation.

Proper fix: The quicksink module should track error state and short-circuit all subsequent poll calls with the stored error, similar to how futures::SinkExt handles post-error states:

fn poll_ready(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
    if let Some(err) = self.cached_error.take() {
        return Poll::Ready(Err(err));
    }
    // ... normal poll logic
}

Upstream status: This was fixed in later rust-libp2p versions (v0.54+) which replaced the custom quicksink with futures::SinkExt. The fix is only relevant for v0.53 compatibility.

Workaround

Since the dialer (go-v0.40) actually completes successfully (exit code 0) and only the listener panics during teardown, the test framework could potentially check dialer success independently. However, the listener panic causes docker-compose --abort-on-container-exit to report failure.

Notes

This is a race condition in the WebSocket transport sink state machine. After a TLS or yamux error occurs, the quicksink module's poll_ready is called again when it should not be. Only the specific combination of go-v0.40 as dialer triggers this, suggesting timing-dependent behavior in session shutdown.

Metadata

Metadata

Assignees

No one assigned

    Labels

    rustrust-libp2p related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions