-
Notifications
You must be signed in to change notification settings - Fork 607
Description
Search before asking
- I had searched in the issues and found no similar issues.
Version
- Master: Kvrocks 2.10.1
- Replica: Kvrocks 2.14.0
- OS: Linux 6.12.53-69.119.amzn2023.aarch64 (Amazon Linux 2023)
Minimal reproduce step
-
Set up a master-replica configuration with:
- High write rate (~7K ops/sec)
- Small WAL retention:
rocksdb.max_total_wal_size 1024(1GB)
-
Introduce network congestion or slowness on the replica side that causes it to consume replication data slower than the
master produces it -
The TCP send buffer on the master fills up, and the replication feed thread blocks on write() indefinitely
-
Wait for WAL rotation to prune old WAL files while the feed thread is still blocked
-
When the connection eventually drops, or the thread unblocks, observe:
- Master logs: "Fatal error encountered, WAL iterator is discrete, some seq might be lost"
- Replica attempts psync, fails with "sequence out of range"
- Full resync is triggered
The issue is that step 3 can last indefinitely (we observed 44 hours) with no timeout, errors, or warnings logged.
What did you expect to see?
-
The master should detect when a replica falls too far behind and proactively disconnect it before WAL is exhausted
-
Socket sends to replicas should have a timeout to prevent indefinite blocking
-
Warning logs when replication lag grows significantly
-
When disconnected early (while the sequence is still in WAL), the replica should be able to psync successfully on reconnect
instead of requiring a full resync
What did you see instead?
The replication feed thread blocked for 44 hours with no logs or errors:
I20260127 22:16:21.006304 2857 replication.cc:115] WAL was rotated, would reopen again
[... 44 hours of silence ...]
I20260129 18:36:55.603111 2857 replication.cc:115] WAL was rotated, would reopen again
E20260129 18:36:55.646749 2857 replication.cc:126] Fatal error encountered, WAL iterator is discrete, some seq might be
lost, sequence 480156205527 expected, but got 481055967952
W20260129 18:36:55.646785 2857 replication.cc:84] Slave thread was terminated
The replica then failed to psync ("sequence out of range") and required a full resync.
Root cause: In FeedSlaveThread::loop(), the call to util::SockSend() (line 225) blocks indefinitely when the TCP buffer is
full. The underlying WriteImpl() has no timeout mechanism. During this blocked period, the master continues writing, and WAL
files are pruned, leaving the replica's sequence no longer available.
Anything Else?
Here is the graph for the situation we suffered
I've drafted a possible solution, using Claude code, given that I'm not an expert or dev for C++, that could address this issue with three components:
-
Socket send timeout: New
SockSendWithTimeout()function using poll() with configurable timeout (default 30s) -
Replication lag detection: Check lag at start of each loop iteration, disconnect if it exceeds the configurable threshold
(default 100M sequences) -
Exponential backoff on reconnection: Prevents rapid reconnect loops for persistently slow replicas (1s, 2s, 4s... up
to 60s)
New configuration options:
max-replication-lag: Max sequence lag before disconnecting slow consumerreplication-send-timeout-ms: Socket send timeout in milliseconds
I'm happy to submit the idea PR with the potential fix idea. The changes touch:
- src/config/config.h, config.cc (new config options)
- src/common/io_util.h, io_util.cc (SockSendWithTimeout)
- src/cluster/replication.h, replication.cc (lag detection, timeout usage, backoff)
Workaround for affected users: Increase rocksdb.max_total_wal_size significantly (e.g., 16GB) to extend WAL retention and
reduce the likelihood of exhaustion
Are you willing to submit a PR?
- I'm willing to submit a PR!