Replication feed thread blocks indefinitely on slow consumer, causing WAL exhaustion and forced fullsync

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/kvrocks/issues) and found no similar issues.


### Version

- Master: Kvrocks 2.10.1
- Replica: Kvrocks 2.14.0
- OS: Linux 6.12.53-69.119.amzn2023.aarch64 (Amazon Linux 2023)

### Minimal reproduce step

1. Set up a master-replica configuration with:
     - High write rate (~7K ops/sec)
     - Small WAL retention: `rocksdb.max_total_wal_size 1024` (1GB)

  2. Introduce network congestion or slowness on the replica side that causes it to consume replication data slower than the
  master produces it

  3. The TCP send buffer on the master fills up, and the replication feed thread blocks on write() indefinitely

  4. Wait for WAL rotation to prune old WAL files while the feed thread is still blocked

  5. When the connection eventually drops, or the thread unblocks, observe:
     - Master logs: "Fatal error encountered, WAL iterator is discrete, some seq might be lost"
     - Replica attempts psync, fails with "sequence out of range"
     - Full resync is triggered

  The issue is that step 3 can last indefinitely (we observed 44 hours) with no timeout, errors, or warnings logged.

### What did you expect to see?

 1. The master should detect when a replica falls too far behind and proactively disconnect it before WAL is exhausted

  2. Socket sends to replicas should have a timeout to prevent indefinite blocking

  3. Warning logs when replication lag grows significantly

  4. When disconnected early (while the sequence is still in WAL), the replica should be able to psync successfully on reconnect
  instead of requiring a full resync

### What did you see instead?

The replication feed thread blocked for 44 hours with no logs or errors:

  I20260127 22:16:21.006304  2857 replication.cc:115] WAL was rotated, would reopen again
                            [... 44 hours of silence ...]
  I20260129 18:36:55.603111  2857 replication.cc:115] WAL was rotated, would reopen again
  E20260129 18:36:55.646749  2857 replication.cc:126] Fatal error encountered, WAL iterator is discrete, some seq might be
  lost, sequence 480156205527 expected, but got 481055967952
  W20260129 18:36:55.646785  2857 replication.cc:84] Slave thread was terminated

  The replica then failed to psync ("sequence out of range") and required a full resync.

  Root cause: In FeedSlaveThread::loop(), the call to util::SockSend() (line 225) blocks indefinitely when the TCP buffer is
  full. The underlying WriteImpl() has no timeout mechanism. During this blocked period, the master continues writing, and WAL
  files are pruned, leaving the replica's sequence no longer available.

### Anything Else?

Here is the graph for the situation we suffered

<img width="2251" height="1098" alt="Image" src="https://github.com/user-attachments/assets/340284a7-b2b2-47e3-847f-5074d86d1c7a" />

I've drafted a possible solution, using Claude code, given that I'm not an expert or dev for C++, that could address this issue with three components:

  1. **Socket send timeout**: New `SockSendWithTimeout()` function using poll() with configurable timeout (default 30s)

  2. **Replication lag detection**: Check lag at start of each loop iteration, disconnect if it exceeds the configurable threshold
  (default 100M sequences)

  3. **Exponential backoff on reconnection**: Prevents rapid reconnect loops for persistently slow replicas (1s, 2s, 4s... up
  to 60s)

  New configuration options:
  - `max-replication-lag`: Max sequence lag before disconnecting slow consumer
  - `replication-send-timeout-ms`: Socket send timeout in milliseconds

  I'm happy to submit the idea PR with the potential fix idea. The changes touch:
  - src/config/config.h, config.cc (new config options)
  - src/common/io_util.h, io_util.cc (SockSendWithTimeout)
  - src/cluster/replication.h, replication.cc (lag detection, timeout usage, backoff)

  Workaround for affected users: Increase `rocksdb.max_total_wal_size` significantly (e.g., 16GB) to extend WAL retention and
  reduce the likelihood of exhaustion

### Are you willing to submit a PR?

- [x] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication feed thread blocks indefinitely on slow consumer, causing WAL exhaustion and forced fullsync #3356

Search before asking

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything Else?

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replication feed thread blocks indefinitely on slow consumer, causing WAL exhaustion and forced fullsync #3356

Description

Search before asking

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything Else?

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions