Skip to content

websocket: Fix connection stability on decrypt messages #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 22, 2025

Conversation

lexnv
Copy link
Collaborator

@lexnv lexnv commented May 16, 2025

This PR greatly improves the WebSocket connection stability by relying on the interval buffers of tungstenite instead of buffering at a higher level. The fix passes through the messages to the tungstenite socket directly.

This is a long-lasting issue (reproducible on all older versions silently with IO errors) that manifested as a decryption error after the state fixes:

Issue context:

  • node is under stress due to handling multiple substreams
  • the issue affected only long running WebSocket substreams and manifested as an IO error from crypto/noise decoding
  • tungstenite WebSocketStream already has a 128KiB buffer for writing
  • litep2p has a redundant 8 KiB buffer for writing
  • litep2p buffered internally multiple packets, tunstenite accepted the batch. I expect this creates a wrongly framed packet that fails to decode at the crypto/noise level

Investigation

We have noted several errors that manifested as crypto/nosie decoding failures on our Kusama validators:

litep2p::crypto::noise: failed to decrypt message error=Decrypt

Upon further investigation, the errors affected only WebSocket connections. The issue could be reproduced by running a local node in Kusama with more than 500 peers in and out. As well as running subp2p-explorer with adjusted protocols:

2025-05-15T14:58:08.095961Z ERROR {peer_id=peer_id=12D3KooWGsDvWrbApFTCpF8h7YCKHuvJbok6HAq5ZnPgE9LGWnsv}:
litep2p::crypto::noise: failed to decrypt message for bigger buffers error=Decrypt peer=PeerId("12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu")

2025-05-15T14:58:08.096419Z DEBUG 
{peer_id=peer_id=12D3KooWGsDvWrbApFTCpF8h7YCKHuvJbok6HAq5ZnPgE9LGWnsv}:
litep2p::websocket::connection: connection closed with error peer=PeerId("12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu") error=Decode(Io(Custom { kind: Other, error: "failed to decrypt message bigger buffers: decrypt error 12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu" }))

The issue also reproduced on the zombinet PR, which uses litep2p:

2025-05-14 09:37:30.805  INFO tokio-runtime-worker sync: Warp sync is complete, continuing with state sync.    

2025-05-14 09:37:33.189 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:33.283 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:34.764 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
	
2025-05-14 09:37:35.656  INFO tokio-runtime-worker substrate: ⚙️  State sync, Downloading state, 22%, 2.21 Mib (0 peers), best: #0 (0xc5e7…d059), finalized #0 (0xc5e7…d059), ⬇ 707.8kiB/s ⬆ 0.5kiB/s    
	
2025-05-14 09:37:40.657  INFO tokio-runtime-worker substrate: ⚙️  State sync, Downloading state, 22%, 2.21 Mib (3 peers), best: #0 (0xc5e7…d059), finalized #0 (0xc5e7…d059), ⬇ 1.0kiB/s ⬆ 1.0kiB/s    

Testing Done

Performance

Tested the performance with litep2p-perf using the following branch:

Status Data Size Time (s) Bandwidth (Mbit/s)
Before
Uploaded 256.00 MiB 15.1152 135.49
Downloaded 256.00 MiB 13.2296 154.80
After
Uploaded 256.00 MiB 15.7178 130.30
Downloaded 256.00 MiB 13.2435 154.64

From the performance table, we are within 3% of the original buggy implementation. I would lean towards a normal variation in our results. Therefore, the performance remains unimpacted.

Repro Case

Have added a custom user protocol as part of our testing to filter out these errors.

  • The protocol opens 16 outbound substreams on the connection established event. Therefore, it will handle 16 outbound substreams and 16 inbound substreams
  • The outbound substreams will push a configurable number of packets, each of size 128 bytes, to the remote peer. While the inbound substreams will read the same number of packets from the remote peer.

Before this PR, the TCP was unaffected and the websocket reproduces the decrypt failure. After this PR, the test passes.

Closes: paritytech/polkadot-sdk#8525

lexnv added 2 commits May 16, 2025 11:41
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv self-assigned this May 16, 2025
@lexnv lexnv added the bug Something isn't working label May 16, 2025
lexnv added 3 commits May 16, 2025 14:32
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv merged commit 276b190 into master May 22, 2025
8 checks passed
@lexnv lexnv deleted the lexnv/fix-ws-stability branch May 22, 2025 15:48
lexnv added a commit that referenced this pull request May 26, 2025
## [0.9.5] - 2025-05-26

This release primarily focuses on strengthening the stability of the
websocket transport. We've resolved an issue where higher-level
buffering was causing the Noise protocol to fail when decoding messages.

We've also significantly improved connectivity between litep2p and
Smoldot (the Substrate-based light client). Empty frames are now handled
correctly, preventing handshake timeouts and ensuring smoother
communication.

Finally, we've carried out several dependency updates to keep the
library current with the latest versions of its underlying components.

### Fixed

- substream/fix: Allow empty payloads with 0-length frame
([#395](#395))
- websocket: Fix connection stability on decrypt messages
([#393](#393))

### Changed

- crypto/noise: Show peerIDs that fail to decode
([#392](#392))
- cargo: Bump yamux to 0.13.5 and tokio to 1.45.0
([#396](#396))
- ci: Enforce and apply clippy rules
([#388](#388))
- build(deps): bump ring from 0.16.20 to 0.17.14
([#389](#389))
- Update hickory-resolver 0.24.2 -> 0.25.2
([#386](#386))

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

litep2p::crypto::noise: failed to decrypt message error=Decrypt
2 participants