Skip to content

websocket/stream: Fix unexpected EOF on Poll::Pending state poisoning #327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 19, 2025

Conversation

lexnv
Copy link
Collaborator

@lexnv lexnv commented Feb 18, 2025

This PR updates the following dependencies:

  • tokio tungstenite
  • URL crate to facilitate the new tungstenite version

The tokio tungstenite update changed the API of webscoket messages, which motivated the following changes:

  • Whenpoll_ready or poll_flush returned Poll::Pending, the websocket state would forever remain poisoned.
    • This has the side effect of returning UnexpectedEof on consecutive calls, which led to the connection being terminated
  • The state machine of the websocket is simplified to just 3 states
  • Writer buffer and pointer are merged under one BytesMut instance
  • poll_read implementation is simplified as well to make the code more readable

lexnv added 12 commits February 17, 2025 20:13
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
…em allocs

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv self-assigned this Feb 18, 2025
Copy link
Collaborator

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch with a poisoned state being left!

@lexnv lexnv merged commit 6e40fbb into master Feb 19, 2025
8 checks passed
@lexnv lexnv deleted the lexnv/websocket-update branch February 19, 2025 15:58
lexnv added a commit that referenced this pull request Feb 20, 2025
## [0.9.1] - 2025-01-19

This release enhances compatibility between litep2p and libp2p by
backporting the latest Yamux updates. Additionally, it includes various
improvements and fixes to boost the stability and performance of the
WebSocket stream and the multistream-select protocol.

### Changed

- yamux: Switch to upstream implementation while keeping the controller
API ([#320](#320))
- req-resp: Replace SubstreamSet with FuturesStream
([#321](#321))
- cargo: Bring up to date multiple dependencies
([#324](#324))
- build(deps): bump hickory-proto from 0.24.1 to 0.24.3
([#323](#323))
- build(deps): bump openssl from 0.10.66 to 0.10.70
([#322](#322))

### Fixed

- websocket/stream: Fix unexpected EOF on `Poll::Pending` state
poisoning ([#327](#327))
- websocket/stream: Avoid memory allocations on flushing
([#325](#325))
- multistream-select: Enforce `io::error` instead of empty protocols
([#318](#318))
- multistream: Do not wait for negotiation in poll_close
([#319](#319))

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Feb 20, 2025
This PR updates litep2p to version 0.9.1. The yamux config is entirely
removed to mirror the libp2p yamux upstream version.
While at it, I had to bump indexmap and URL as well. 


## [0.9.1] - 2025-01-19

This release enhances compatibility between litep2p and libp2p by using
the latest Yamux upstream version. Additionally, it includes various
improvements and fixes to boost the stability and performance of the
WebSocket stream and the multistream-select protocol.

### Changed

- yamux: Switch to upstream implementation while keeping the controller
API ([#320](paritytech/litep2p#320))
- req-resp: Replace SubstreamSet with FuturesStream
([#321](paritytech/litep2p#321))
- cargo: Bring up to date multiple dependencies
([#324](paritytech/litep2p#324))
- build(deps): bump hickory-proto from 0.24.1 to 0.24.3
([#323](paritytech/litep2p#323))
- build(deps): bump openssl from 0.10.66 to 0.10.70
([#322](paritytech/litep2p#322))

### Fixed

- websocket/stream: Fix unexpected EOF on `Poll::Pending` state
poisoning ([#327](paritytech/litep2p#327))
- websocket/stream: Avoid memory allocations on flushing
([#325](paritytech/litep2p#325))
- multistream-select: Enforce `io::error` instead of empty protocols
([#318](paritytech/litep2p#318))
- multistream: Do not wait for negotiation in poll_close
([#319](paritytech/litep2p#319))

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
github-actions bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Feb 20, 2025
This PR updates litep2p to version 0.9.1. The yamux config is entirely
removed to mirror the libp2p yamux upstream version.
While at it, I had to bump indexmap and URL as well.

## [0.9.1] - 2025-01-19

This release enhances compatibility between litep2p and libp2p by using
the latest Yamux upstream version. Additionally, it includes various
improvements and fixes to boost the stability and performance of the
WebSocket stream and the multistream-select protocol.

### Changed

- yamux: Switch to upstream implementation while keeping the controller
API ([#320](paritytech/litep2p#320))
- req-resp: Replace SubstreamSet with FuturesStream
([#321](paritytech/litep2p#321))
- cargo: Bring up to date multiple dependencies
([#324](paritytech/litep2p#324))
- build(deps): bump hickory-proto from 0.24.1 to 0.24.3
([#323](paritytech/litep2p#323))
- build(deps): bump openssl from 0.10.66 to 0.10.70
([#322](paritytech/litep2p#322))

### Fixed

- websocket/stream: Fix unexpected EOF on `Poll::Pending` state
poisoning ([#327](paritytech/litep2p#327))
- websocket/stream: Avoid memory allocations on flushing
([#325](paritytech/litep2p#325))
- multistream-select: Enforce `io::error` instead of empty protocols
([#318](paritytech/litep2p#318))
- multistream: Do not wait for negotiation in poll_close
([#319](paritytech/litep2p#319))

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
(cherry picked from commit 42e9de7)
teor2345 pushed a commit to autonomys/polkadot-sdk that referenced this pull request Mar 24, 2025
…ech#7640) - conflicts resolved using --strategy-option=theirs

This PR updates litep2p to version 0.9.1. The yamux config is entirely
removed to mirror the libp2p yamux upstream version.
While at it, I had to bump indexmap and URL as well.

This release enhances compatibility between litep2p and libp2p by using
the latest Yamux upstream version. Additionally, it includes various
improvements and fixes to boost the stability and performance of the
WebSocket stream and the multistream-select protocol.

- yamux: Switch to upstream implementation while keeping the controller
API ([paritytech#320](paritytech/litep2p#320))
- req-resp: Replace SubstreamSet with FuturesStream
([paritytech#321](paritytech/litep2p#321))
- cargo: Bring up to date multiple dependencies
([paritytech#324](paritytech/litep2p#324))
- build(deps): bump hickory-proto from 0.24.1 to 0.24.3
([paritytech#323](paritytech/litep2p#323))
- build(deps): bump openssl from 0.10.66 to 0.10.70
([paritytech#322](paritytech/litep2p#322))

- websocket/stream: Fix unexpected EOF on `Poll::Pending` state
poisoning ([paritytech#327](paritytech/litep2p#327))
- websocket/stream: Avoid memory allocations on flushing
([paritytech#325](paritytech/litep2p#325))
- multistream-select: Enforce `io::error` instead of empty protocols
([paritytech#318](paritytech/litep2p#318))
- multistream: Do not wait for negotiation in poll_close
([paritytech#319](paritytech/litep2p#319))

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv added a commit that referenced this pull request May 22, 2025
This PR greatly improves the WebSocket connection stability by relying
on the interval buffers of tungstenite instead of buffering at a higher
level. The fix passes through the messages to the tungstenite socket
directly.

This is a long-lasting issue (reproducible on all older versions
silently with IO errors) that manifested as a decryption error after the
state fixes:
- #325
- #327

Issue context:
- node is under stress due to handling multiple substreams
- the issue affected only long running WebSocket substreams and
manifested as an IO error from crypto/noise decoding
- tungstenite `WebSocketStream` already has a 128KiB buffer for writing
- litep2p has a **redundant** 8 KiB buffer for writing 
- litep2p buffered internally multiple packets, tunstenite accepted the
batch. I expect this creates a wrongly framed packet that fails to
decode at the crypto/noise level


## Investigation

We have noted several errors that manifested as crypto/nosie decoding
failures on our Kusama validators:
- paritytech/polkadot-sdk#8525

```rust
litep2p::crypto::noise: failed to decrypt message error=Decrypt
```

Upon further investigation, the errors affected only WebSocket
connections. The issue could be reproduced by running a local node in
Kusama with more than 500 peers in and out. As well as running
subp2p-explorer with adjusted protocols:

```yaml
2025-05-15T14:58:08.095961Z ERROR {peer_id=peer_id=12D3KooWGsDvWrbApFTCpF8h7YCKHuvJbok6HAq5ZnPgE9LGWnsv}:
litep2p::crypto::noise: failed to decrypt message for bigger buffers error=Decrypt peer=PeerId("12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu")

2025-05-15T14:58:08.096419Z DEBUG 
{peer_id=peer_id=12D3KooWGsDvWrbApFTCpF8h7YCKHuvJbok6HAq5ZnPgE9LGWnsv}:
litep2p::websocket::connection: connection closed with error peer=PeerId("12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu") error=Decode(Io(Custom { kind: Other, error: "failed to decrypt message bigger buffers: decrypt error 12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu" }))
```



The issue also reproduced on the zombinet PR, which uses litep2p:
- paritytech/polkadot-sdk#8461

```yaml
2025-05-14 09:37:30.805  INFO tokio-runtime-worker sync: Warp sync is complete, continuing with state sync.    

2025-05-14 09:37:33.189 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:33.283 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:34.764 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
	
2025-05-14 09:37:35.656  INFO tokio-runtime-worker substrate: ⚙️  State sync, Downloading state, 22%, 2.21 Mib (0 peers), best: #0 (0xc5e7…d059), finalized #0 (0xc5e7…d059), ⬇ 707.8kiB/s ⬆ 0.5kiB/s    
	
2025-05-14 09:37:40.657  INFO tokio-runtime-worker substrate: ⚙️  State sync, Downloading state, 22%, 2.21 Mib (3 peers), best: #0 (0xc5e7…d059), finalized #0 (0xc5e7…d059), ⬇ 1.0kiB/s ⬆ 1.0kiB/s    
```

## Testing Done

### Performance

Tested the performance with litep2p-perf using the following branch:
-
https://github.com/lexnv/litep2p-perf/compare/lexnv/websocket-tests?expand=1

| Status     | Data Size | Time (s) | Bandwidth (Mbit/s) |
|------------|-----------|----------|-------------------|
| **Before** |           |          |                   |
| Uploaded   | 256.00 MiB| 15.1152  | 135.49            |
| Downloaded | 256.00 MiB| 13.2296  | 154.80            |
| **After**  |           |          |                   |
| Uploaded   | 256.00 MiB| 15.7178  | 130.30            |
| Downloaded | 256.00 MiB| 13.2435  | 154.64            |

From the performance table, we are within 3% of the original buggy
implementation. I would lean towards a normal variation in our results.
Therefore, the performance remains unimpacted.

### Repro Case

Have added a custom user protocol as part of our testing to filter out
these errors.

- The protocol opens 16 outbound substreams on the connection
established event. Therefore, it will handle 16 outbound substreams and
16 inbound substreams
- The outbound substreams will push a configurable number of packets,
each of size 128 bytes, to the remote peer. While the inbound substreams
will read the same number of packets from the remote peer.
 
Before this PR, the TCP was unaffected and the websocket reproduces the
decrypt failure. After this PR, the test passes.


Closes: paritytech/polkadot-sdk#8525

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants