You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
websocket: Fix connection stability on decrypt messages (#393)
This PR greatly improves the WebSocket connection stability by relying
on the interval buffers of tungstenite instead of buffering at a higher
level. The fix passes through the messages to the tungstenite socket
directly.
This is a long-lasting issue (reproducible on all older versions
silently with IO errors) that manifested as a decryption error after the
state fixes:
- #325
- #327
Issue context:
- node is under stress due to handling multiple substreams
- the issue affected only long running WebSocket substreams and
manifested as an IO error from crypto/noise decoding
- tungstenite `WebSocketStream` already has a 128KiB buffer for writing
- litep2p has a **redundant** 8 KiB buffer for writing
- litep2p buffered internally multiple packets, tunstenite accepted the
batch. I expect this creates a wrongly framed packet that fails to
decode at the crypto/noise level
## Investigation
We have noted several errors that manifested as crypto/nosie decoding
failures on our Kusama validators:
- paritytech/polkadot-sdk#8525
```rust
litep2p::crypto::noise: failed to decrypt message error=Decrypt
```
Upon further investigation, the errors affected only WebSocket
connections. The issue could be reproduced by running a local node in
Kusama with more than 500 peers in and out. As well as running
subp2p-explorer with adjusted protocols:
```yaml
2025-05-15T14:58:08.095961Z ERROR {peer_id=peer_id=12D3KooWGsDvWrbApFTCpF8h7YCKHuvJbok6HAq5ZnPgE9LGWnsv}:
litep2p::crypto::noise: failed to decrypt message for bigger buffers error=Decrypt peer=PeerId("12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu")
2025-05-15T14:58:08.096419Z DEBUG
{peer_id=peer_id=12D3KooWGsDvWrbApFTCpF8h7YCKHuvJbok6HAq5ZnPgE9LGWnsv}:
litep2p::websocket::connection: connection closed with error peer=PeerId("12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu") error=Decode(Io(Custom { kind: Other, error: "failed to decrypt message bigger buffers: decrypt error 12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu" }))
```
The issue also reproduced on the zombinet PR, which uses litep2p:
- paritytech/polkadot-sdk#8461
```yaml
2025-05-14 09:37:30.805 INFO tokio-runtime-worker sync: Warp sync is complete, continuing with state sync.
2025-05-14 09:37:33.189 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:33.283 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:34.764 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:35.656 INFO tokio-runtime-worker substrate: ⚙️ State sync, Downloading state, 22%, 2.21 Mib (0 peers), best: #0 (0xc5e7…d059), finalized #0 (0xc5e7…d059), ⬇ 707.8kiB/s ⬆ 0.5kiB/s
2025-05-14 09:37:40.657 INFO tokio-runtime-worker substrate: ⚙️ State sync, Downloading state, 22%, 2.21 Mib (3 peers), best: #0 (0xc5e7…d059), finalized #0 (0xc5e7…d059), ⬇ 1.0kiB/s ⬆ 1.0kiB/s
```
## Testing Done
### Performance
Tested the performance with litep2p-perf using the following branch:
-
https://github.com/lexnv/litep2p-perf/compare/lexnv/websocket-tests?expand=1
| Status | Data Size | Time (s) | Bandwidth (Mbit/s) |
|------------|-----------|----------|-------------------|
| **Before** | | | |
| Uploaded | 256.00 MiB| 15.1152 | 135.49 |
| Downloaded | 256.00 MiB| 13.2296 | 154.80 |
| **After** | | | |
| Uploaded | 256.00 MiB| 15.7178 | 130.30 |
| Downloaded | 256.00 MiB| 13.2435 | 154.64 |
From the performance table, we are within 3% of the original buggy
implementation. I would lean towards a normal variation in our results.
Therefore, the performance remains unimpacted.
### Repro Case
Have added a custom user protocol as part of our testing to filter out
these errors.
- The protocol opens 16 outbound substreams on the connection
established event. Therefore, it will handle 16 outbound substreams and
16 inbound substreams
- The outbound substreams will push a configurable number of packets,
each of size 128 bytes, to the remote peer. While the inbound substreams
will read the same number of packets from the remote peer.
Before this PR, the TCP was unaffected and the websocket reproduces the
decrypt failure. After this PR, the test passes.
Closes: paritytech/polkadot-sdk#8525
---------
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
0 commit comments