Skip to content

Comments

Fix chained replica crash when doing dual channel replication#2983

Merged
enjoy-binbin merged 3 commits intovalkey-io:unstablefrom
enjoy-binbin:dual_channel_fix
Dec 29, 2025
Merged

Fix chained replica crash when doing dual channel replication#2983
enjoy-binbin merged 3 commits intovalkey-io:unstablefrom
enjoy-binbin:dual_channel_fix

Conversation

@enjoy-binbin
Copy link
Member

There is a crash in freeReplicationBacklog:

Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, disconnectReplicas is called to disconnect all
replica clients, but since the RDB channel is protected, freeClient does not
actually free the replica client. Later, we encounter an assertion failure in
freeReplicationBacklog.

void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}

Dual channel replication was introduced in #60.

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin
Copy link
Member Author

The crash report:

### Starting server for test Chained replicas does not assert when using dual channel replication in tests/integration/dual-channel-replication.tcl
12392:M 26 Dec 2025 19:02:57.215 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
12392:M 26 Dec 2025 19:02:57.215 * Valkey version=255.255.255, bits=64, commit=3d7f4c82, modified=1, pid=12392, just started
12392:M 26 Dec 2025 19:02:57.215 * Configuration loaded
12392:M 26 Dec 2025 19:02:57.216 * monotonic clock: POSIX clock_gettime
12392:M 26 Dec 2025 19:02:57.217 # Failed to write PID file: Permission denied
                .+^+.                                                
            .+#########+.                                            
        .+########+########+.           Valkey 255.255.255 (3d7f4c82/1) 64 bit
    .+########+'     '+########+.                                    
 .########+'     .+.     '+########.    Running in standalone mode
 |####+'     .+#######+.     '+####|    Port: 21111
 |###|   .+###############+.   |###|    PID: 12392                     
 |###|   |#####*'' ''*#####|   |###|                                 
 |###|   |####'  .-.  '####|   |###|                                 
 |###|   |###(  (@@@)  )###|   |###|          https://valkey.io      
 |###|   |####.  '-'  .####|   |###|                                 
 |###|   |#####*.   .*#####|   |###|                                 
 |###|   '+#####|   |#####+'   |###|                                 
 |####+.     +##|   |#+'     .+####|                                 
 '#######+   |##|        .+########'                                 
    '+###|   |##|    .+########+'                                    
        '|   |####+########+'                                        
             +#########+'                                            
                '+v+'                                                

12392:M 26 Dec 2025 19:02:57.217 # WARNING: The TCP backlog setting of 511 cannot be enforced because kern.ipc.somaxconn is set to the lower value of 128.
12392:M 26 Dec 2025 19:02:57.223 * Module 'lua' loaded from libvalkeylua.so
12392:M 26 Dec 2025 19:02:57.223 * Server initialized
12392:M 26 Dec 2025 19:02:57.223 * Ready to accept connections tcp
12392:M 26 Dec 2025 19:02:57.223 * Ready to accept connections unix
12392:M 26 Dec 2025 19:02:57.284 - Accepted 127.0.0.1:57191
12392:M 26 Dec 2025 19:02:57.284 - Client closed connection id=3 addr=127.0.0.1:57191 laddr=127.0.0.1:21111 fd=14 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16890 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=33856 events=r cmd=ping user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=7 tot-net-out=7 tot-cmds=1
12392:M 26 Dec 2025 19:02:57.302 - Accepted 127.0.0.1:57192
### Starting test Psync established after rdb load - within grace period in tests/integration/dual-channel-replication.tcl
12392:M 26 Dec 2025 19:02:57.503 - Accepted 127.0.0.1:57195
12392:M 26 Dec 2025 19:02:57.503 * Replica 127.0.0.1:21112 asks for synchronization
12392:M 26 Dec 2025 19:02:57.503 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'b40e376f30a78a2f349b444a28d6aa16a9556124', my replication IDs are '61d4bf934b6aad3aa12823527d5fbbaaf2866b60' and '0000000000000000000000000000000000000000')
12392:M 26 Dec 2025 19:02:57.503 * Dual channel replication: Replica 127.0.0.1:21112 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
12392:M 26 Dec 2025 19:02:57.504 - Accepted 127.0.0.1:57196
12392:M 26 Dec 2025 19:02:57.505 * Replica 127.0.0.1:21112 asks for synchronization
12392:M 26 Dec 2025 19:02:57.505 * Replication backlog created, my new replication IDs are '460b2a5549de639e8b2cd4284d6e29fb5ffce744' and '0000000000000000000000000000000000000000'
12392:M 26 Dec 2025 19:02:57.505 * Starting BGSAVE for SYNC with target: replicas sockets using: dual-channel
12392:M 26 Dec 2025 19:02:57.505 * Dual channel replication: Sending to replica 127.0.0.1:21112 RDB end offset 0 and client-id 6
12392:M 26 Dec 2025 19:02:57.506 * Background RDB transfer started by pid 12490 to direct socket to replica
12392:M 26 Dec 2025 19:02:57.506 * Process is about to stop.
12490:C 26 Dec 2025 19:02:57.507 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
12490:C 26 Dec 2025 19:02:57.507 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
12392:M 26 Dec 2025 19:02:58.465 * Process has been continued.
12392:M 26 Dec 2025 19:02:58.465 * Background RDB transfer terminated with success
12392:M 26 Dec 2025 19:02:58.465 * Streamed RDB transfer with replica 127.0.0.1:21112 succeeded (socket). Waiting for REPLCONF ACK from replica to enable streaming
12392:M 26 Dec 2025 19:02:58.465 * RDB transfer completed, rdb only replica (127.0.0.1:21112) should be disconnected asap
12392:M 26 Dec 2025 19:02:58.465 - Dual channel replication: Postpone RDB client id=6 (127.0.0.1:21112) free for 60 seconds
12392:S 26 Dec 2025 19:02:58.654 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
12392:S 26 Dec 2025 19:02:58.654 * Connecting to PRIMARY 127.0.0.1:21113
12392:S 26 Dec 2025 19:02:58.654 * PRIMARY <-> REPLICA sync started
12392:S 26 Dec 2025 19:02:58.654 * REPLICAOF 127.0.0.1:21113 enabled (user request from 'id=4 addr=127.0.0.1:57192 laddr=127.0.0.1:21111 fd=14 name= user=default lib-name= lib-ver=')
12392:S 26 Dec 2025 19:02:58.655 * Non blocking connect for SYNC fired the event.
12392:S 26 Dec 2025 19:02:58.655 * Primary replied to PING, replication can continue...
12392:S 26 Dec 2025 19:02:58.655 * Trying a partial resynchronization (request 460b2a5549de639e8b2cd4284d6e29fb5ffce744:1).
12392:S 26 Dec 2025 19:02:58.655 * Full resync from primary: addef143959ae47e324a2b71198ddadfe68581d5:0
12392:S 26 Dec 2025 19:02:58.656 * Replica main thread creating Bio thread to save RDB to disk
12392:S 26 Dec 2025 19:02:58.657 * Replica bio thread: PRIMARY <-> REPLICA sync: receiving streamed RDB from primary with EOF to disk
12392:S 26 Dec 2025 19:02:58.657 * Replica bio thread: Done downloading RDB
12392:S 26 Dec 2025 19:02:58.669 - Client closed connection id=5 addr=127.0.0.1:57195 laddr=127.0.0.1:21111 fd=15 name= age=1 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16890 argv-mem=0 multi-mem=0 rbs=1024 rbp=61 obl=0 oll=0 omem=0 tot-mem=18496 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=388 tot-net-out=88 tot-cmds=7
12392:S 26 Dec 2025 19:02:58.669 - Accepted 127.0.0.1:57204
12392:S 26 Dec 2025 19:02:58.670 - Client closed connection id=8 addr=127.0.0.1:57204 laddr=127.0.0.1:21111 fd=15 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16890 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=33856 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=231 tot-net-out=83 tot-cmds=5
12392:S 26 Dec 2025 19:02:59.170 * Replica main thread detected RDB download completion in Bio thread
12392:S 26 Dec 2025 19:02:59.170 * Loading the RDB and finalizing primary-replica sync...
12392:S 26 Dec 2025 19:02:59.178 * Discarding previously cached primary state.


=== VALKEY BUG REPORT START: Cut & paste starting from here ===
12392:S 26 Dec 2025 19:02:59.178 # === ASSERTION FAILED ===
12392:S 26 Dec 2025 19:02:59.178 # ==> replication.c:160 'listLength(server.replicas) == 0' is not true

------ STACK TRACE ------

Backtrace:
0   valkey-server                       0x000000010281f70c freeReplicationBacklog + 444
1   valkey-server                       0x00000001027d456c serverCron + 6256
2   valkey-server                       0x00000001027c46d0 aeProcessEvents + 688
3   valkey-server                       0x00000001027ee264 main + 19252
4   dyld                                0x00000001898bf154 start + 2476

@codecov
Copy link

codecov bot commented Dec 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.91%. Comparing base (7385586) to head (b5b6b55).
⚠️ Report is 6 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2983      +/-   ##
============================================
+ Coverage     73.79%   73.91%   +0.12%     
============================================
  Files           125      125              
  Lines         69345    69349       +4     
============================================
+ Hits          51173    51260      +87     
+ Misses        18172    18089      -83     
Files with missing lines Coverage Δ
src/networking.c 88.34% <100.00%> (+<0.01%) ⬆️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks safe to me.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link
Contributor

@naglera naglera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the fix!

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Dec 28, 2025
@github-actions github-actions bot removed the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Dec 28, 2025
@enjoy-binbin enjoy-binbin merged commit 9c5d004 into valkey-io:unstable Dec 29, 2025
24 checks passed
@github-project-automation github-project-automation bot moved this to To be backported in Valkey 8.0 Dec 29, 2025
@github-project-automation github-project-automation bot moved this to To be backported in Valkey 8.1 Dec 29, 2025
@github-project-automation github-project-automation bot moved this to To be backported in Valkey 9.0 Dec 29, 2025
@enjoy-binbin enjoy-binbin deleted the dual_channel_fix branch December 29, 2025 02:20
@enjoy-binbin enjoy-binbin added bug Something isn't working release-notes This issue should get a line item in the release notes labels Dec 29, 2025
jdheyburn pushed a commit to jdheyburn/valkey that referenced this pull request Jan 8, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
zuiderkwast pushed a commit to zuiderkwast/placeholderkv that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@roshkhatri roshkhatri moved this from To be backported to 8.1.6 in Valkey 8.1 Jan 29, 2026
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 29, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Jan 30, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@roshkhatri roshkhatri moved this from To be backported to 8.0.7 in Valkey 8.0 Jan 30, 2026
zuiderkwast pushed a commit to zuiderkwast/placeholderkv that referenced this pull request Jan 30, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@zuiderkwast zuiderkwast moved this from To be backported to 9.0.2 WIP in Valkey 9.0 Jan 30, 2026
zuiderkwast pushed a commit that referenced this pull request Feb 3, 2026
There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in #60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Feb 4, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Feb 18, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request Feb 20, 2026
…-io#2983)

There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in valkey-io#60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
madolson pushed a commit that referenced this pull request Feb 24, 2026
There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in #60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
madolson pushed a commit that referenced this pull request Feb 24, 2026
There is a crash in freeReplicationBacklog:
```
Discarding previously cached primary state.
ASSERTION FAILED
'listLength(server.replicas) == 0' is not true
freeReplicationBacklog
```

The reason is that during dual channel operation, the RDB channel is protected.
In the chained replica case, `disconnectReplicas` is called to disconnect all
replica clients, but since the RDB channel is protected, `freeClient` does not
actually free the replica client. Later, we encounter an assertion failure in
`freeReplicationBacklog`.
```
void replicationAttachToNewPrimary(void) {
    /* Replica starts to apply data from new primary, we must discard the cached
     * primary structure. */
    serverAssert(server.primary == NULL);
    replicationDiscardCachedPrimary();

    /* Cancel any in progress imports (we will now use the primary's) */
    clusterCleanSlotImportsOnFullSync();

    disconnectReplicas();     /* Force our replicas to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained replicas to PSYNC. */
}
```

Dual channel replication was introduced in #60.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release-notes This issue should get a line item in the release notes

Projects

Status: 8.0.7 (WIP)
Status: 8.1.6 WIP
Status: 9.0.2
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants