Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Banned node for "Same state requested multiple times" #4924

Open
girazoki opened this issue Jul 2, 2024 · 5 comments
Open

Banned node for "Same state requested multiple times" #4924

girazoki opened this issue Jul 2, 2024 · 5 comments

Comments

@girazoki
Copy link
Contributor

girazoki commented Jul 2, 2024

We have seen the "Banned: same state requested multiple times" using warp sync without apparent signs of the node doing anything wrong. The architecture is at follows:

  • Around roughly the same time, 3 different nodes warp sync against the same (and only) full-node.
  • They start syncing the state but at some point the full-node bans all 3 nodes with the same message: Reason: Same state request multiple times. Banned, disconnecting

Full node logs:

INFO tokio-runtime-worker substrate: [Parachain] 💤 Idle (3 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 2.7kiB/s ⬆ 4.4MiB/s
INFO tokio-runtime-worker substrate: [Parachain] 💤 Idle (3 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 0.5kiB/s ⬆ 2.7MiB/s
INFO tokio-runtime-worker substrate: [Parachain] 💤 Idle (3 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 0.9kiB/s ⬆ 4.8MiB/s
 Idle (3 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 2.0kiB/s ⬆ 10.1MiB/s
 INFO tokio-runtime-worker substrate: [Parachain] 💤 Idle (0 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 1.1kiB/s ⬆ 6.3MiB/s
 WARN tokio-runtime-worker peerset: [Parachain] Report 12D3KooWA4gQgEZgPka1UW5KWZYcjYMJUrpHgDWaJFJUaUUFknkx: -2147483648 to -2147483648. Reason: Same state request multiple times. Banned, disconnecting.
 WARN tokio-runtime-worker peerset: [Parachain] Report 12D3KooWKc1dwqMibdf5DYXM3B3vDxwpFXr6SKcvtJ8BSCZoehiZ: -2147483648 to -2147483648. Reason: Same state request multiple times. Banned, disconnecting.

We are wondering if the issue could be caused because the full-node cannot serve 3 nodes simultaneously asking for the latest state, which causes these 3 nodes to not receive the indicated bytes and therefore re-ask again for the same byttes, which causes the banning. Could this be any possible explanation?

From what we have seen from the substrate code, the banning happens after receiving the same request 3 times

@bkchr
Copy link
Member

bkchr commented Jul 2, 2024

CC @paritytech/networking

@girazoki
Copy link
Contributor Author

girazoki commented Jul 2, 2024

Possibly related: #531 (comment)

@lexnv
Copy link
Contributor

lexnv commented Jul 5, 2024

Possible root cause for multiple block requests, needs debug patch to test with a local net:

  • T0: When the peer is connected, the peer is added with a state to the self.peers

pub fn add_peer(&mut self, peer_id: PeerId, best_hash: B::Hash, best_number: NumberFor<B>) {

  • T1: The node makes a request to the peer setting the state to Downloading
  • T2: The remote peer takes longer than 20 secs to provide a response back
  • T3: keep-alive disconnects the peer causing us to lose the state about it

/// Notify that a sync peer has disconnected.
pub fn remove_peer(&mut self, peer_id: &PeerId) {
self.blocks.clear_peer_download(peer_id);
if let Some(gap_sync) = &mut self.gap_sync {
gap_sync.blocks.clear_peer_download(peer_id)
}
self.peers.remove(peer_id);

  • T4: We reconnect to the peer, set the state to Available

On one side, we store the peers in a HashMap (strategy/chain_sync logic):

peers: HashMap<PeerId, PeerSync<B>>,

On the other side, we store in a LRU map the previous requests:

seen_requests: LruMap<SeenRequestsKey<B>, SeenRequestsValue>,

@lexnv
Copy link
Contributor

lexnv commented Jul 5, 2024

Managed to reproduce this behavior with the following branch:

2024-07-05 19:15:40 Finished sleeping for block request from 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu!
2024-07-05 19:15:40 Received block request from 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu simulating sleep!
2024-07-05 19:15:40 Peer 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu disconnected. Removed from internal mappings
2024-07-05 19:15:40 New peer 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu with best hash 0x1c69…fbf9 (0)
2024-07-05 19:15:40 Inserting peer 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu with state PeerSync { peer_id: PeerId("12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu"), common_number: 0, best_hash: 0x1c69d379ff8920602d3ccb392e6f8fdd319b47971775f7696b17a69ca06cfbf9, best_number: 0, state: Available } (InChainPruned)


2024-07-05 19:16:00 Finished sleeping for block request from 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu!

2024-07-05 19:16:00 Received block request from 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu simulating sleep!
2024-07-05 19:16:00 Report 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.
2024-07-05 19:16:00 Peer 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu disconnected. Removed from internal mappings

Will come up with a PR to fix this next week

@girazoki
Copy link
Contributor Author

Thank you @lexnv ! appreciated!

github-merge-queue bot pushed a commit that referenced this issue Aug 5, 2024
…g the same request multiple times (#5029)

This PR avoids submitting the same block or state request multiple times
to the same slow peer.

Previously, we submitted the same request to the same slow peer, which
resulted in reputation bans on the slow peer side.
Furthermore, the strategy selected the same slow peer multiple times to
submit queries to, although a better candidate may exist.

Instead, in this PR we:
- introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only
track the state of disconnected peers with a request in flight
- when the `DisconnectedPeers` detects a peer disconnected with a
request in flight, the peer is backed off
  - on the first disconnection: 60 seconds
  - on second disconnection: 120 seconds
- on the third disconnection the peer is banned, and the peer remains
banned until the peerstore decays its reputation
  
This PR lifts the pressure from overloaded nodes that cannot process
requests in due time.
And if a peer is detected to be slow after backoffs, the peer is banned.

Theoretically, submitting the same request multiple times can still
happen when:
- (a) we backoff and ban the peer 
- (b) the network does not discover other peers -- this may also be a
test net
- (c) the peer gets reconnected after the reputation decay and is still
slow to respond



Aims to improve:
- #4924
- #531

Next Steps:
- Investigate the network after this is deployed, possibly bumping the
keep-alive timeout or seeing if there's something else misbehaving




This PR builds on top of:
- #4987


### Testing Done
- Added a couple of unit tests where test harness were set in place

- Local testnet

```bash
13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD

13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false

13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true

13:16:49.108  WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting.
```

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this issue Aug 28, 2024
…g the same request multiple times (paritytech#5029)

This PR avoids submitting the same block or state request multiple times
to the same slow peer.

Previously, we submitted the same request to the same slow peer, which
resulted in reputation bans on the slow peer side.
Furthermore, the strategy selected the same slow peer multiple times to
submit queries to, although a better candidate may exist.

Instead, in this PR we:
- introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only
track the state of disconnected peers with a request in flight
- when the `DisconnectedPeers` detects a peer disconnected with a
request in flight, the peer is backed off
  - on the first disconnection: 60 seconds
  - on second disconnection: 120 seconds
- on the third disconnection the peer is banned, and the peer remains
banned until the peerstore decays its reputation
  
This PR lifts the pressure from overloaded nodes that cannot process
requests in due time.
And if a peer is detected to be slow after backoffs, the peer is banned.

Theoretically, submitting the same request multiple times can still
happen when:
- (a) we backoff and ban the peer 
- (b) the network does not discover other peers -- this may also be a
test net
- (c) the peer gets reconnected after the reputation decay and is still
slow to respond



Aims to improve:
- paritytech#4924
- paritytech#531

Next Steps:
- Investigate the network after this is deployed, possibly bumping the
keep-alive timeout or seeing if there's something else misbehaving




This PR builds on top of:
- paritytech#4987


### Testing Done
- Added a couple of unit tests where test harness were set in place

- Local testnet

```bash
13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD

13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false

13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true

13:16:49.108  WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting.
```

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants