Redis 8.2.1#19
Conversation
…s#14274) Fix redis#14267 This bug was introduced by redis#13495 ### Summary When a replica clears a large database, it periodically calls processEventsWhileBlocked() in the replicationEmptyDbCallback() callback during the key deletion process. If defragmentation is enabled, this means that active defrag can be triggered while the database is being deleted. The defragmentation process may also modify the database at this time, which could lead to crashes when the database is accessed after defragmentation. Code Path: ``` replicationEmptyDbCallback() -> processEventsWhileBlocked() -> whileBlockedCron() -> defragWhileBlocked() ``` ### Solution This PR temporarily disables active defrag before emptying the database, then restores the active defrag setting after the empty is complete. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…fter reload (redis#14276) This bug was introduced by redis#14130 found by @oranagra ### Summary Because `s->cgroup_ref` is created at runtime the first time a consumer group is linked with a message, but it is not released when all references are removed. However, after `debug reload` or restart, if the PEL is empty (meaning no consumer group is referencing any message), `s->cgroup_ref` will not be recreated. As a result, when executing XADD or XTRIM with `ACKED` option and checking whether a message that is being read but has not been ACKed can be deleted, the cgroup_ref being NULL will cause a crash. ### Code Path ``` xaddCommand -> streamTrim -> streamEntryIsReferenced ``` ### Solution Check if `s->cgroup_ref` is NULL in streamEntryIsReferenced().
WalkthroughVersion 8.2.1 point release with bug fixes and performance improvements. Changes include version constants update, release notes documentation, a defragmentation safety guard during master-replica synchronization, a stream entry optimization, and new test cases for defragmentation and stream reloading scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/replication.c`:
- Around line 1953-1962: The code saves the original defrag flag to
orig_active_defrag but restores it incorrectly by hardcoding
server.active_defrag_enabled = 1; instead of using the saved value; change the
restore to assign server.active_defrag_enabled = orig_active_defrag so the
previous state is preserved (the change is around the block that disables defrag
before calling emptyData(...) and references server.active_defrag_enabled,
orig_active_defrag, and replicationEmptyDbCallback).
In `@tests/unit/memefficiency.tcl`:
- Around line 70-76: The helper proc discard_replies_every has its condition
inverted and therefore discards replies on every iteration; change the modulus
check from "$count % $frequency != 0" to "$count % $frequency == 0" so it only
discards when the count is a multiple of frequency, and update all other
occurrences of the same pattern (the other listed line ranges) to the same
equality check to prevent premature/blocking reads in discard_replies_every.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
00-RELEASENOTESsrc/replication.csrc/t_stream.csrc/version.htests/unit/memefficiency.tcltests/unit/type/stream.tcl
🧰 Additional context used
🧬 Code graph analysis (1)
src/replication.c (1)
src/db.c (1)
emptyData(862-900)
🪛 GitHub Actions: CI
src/t_stream.c
[error] 1-1: Command failed during build: make -C src all. See error about unused variable in replication.c.
src/version.h
[error] 1-1: Command failed during build: make -C src all. See error about unused variable in replication.c.
src/replication.c
[error] 1956-1956: replication.c: In function ‘rdbLoadEmptyDbFunc’: unused variable ‘orig_active_defrag’ [-Werror=unused-variable]
🪛 GitHub Actions: Codecov
tests/unit/type/stream.tcl
[error] 25-25: XADD with ACKED option doesn't crash after DEBUG RELOAD in tests/unit/type/stream.tcl. Expected 3 but got 2 (assert_equal on XLEN).
src/replication.c
[warning] 1956-1956: Unused variable 'orig_active_defrag' in replication.c. The value is assigned but never used.
🪛 GitHub Actions: External Server Tests
src/replication.c
[error] 1956-1956: In function 'rdbLoadEmptyDbFunc': unused variable 'orig_active_defrag' -Werror=unused-variable caused the build to fail. Command failed: make REDIS_CFLAGS=-Werror.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Analyze (cpp)
- GitHub Check: code-coverage
- GitHub Check: code-coverage
🔇 Additional comments (4)
src/t_stream.c (1)
2708-2711: The NULL check prevents crashes after DEBUG RELOAD.The fix correctly guards against NULL pointer dereference when
cgroups_refis NULL (e.g., after DEBUG RELOAD with an empty PEL). Returning 1 (entry is referenced) is the safe/conservative choice — it prevents potential data loss at the cost of possibly not trimming entries that could technically be deleted.This is an acceptable trade-off for a crash fix. If precise trimming behavior is needed post-reload, a follow-up could improve the logic to return 0 when the PEL is definitively empty.
00-RELEASENOTES (1)
22-40: Release notes accurately document the fixes.The new 8.2.1 section correctly captures the bug fixes from this PR:
#14274aligns with the defrag disable during replica flush fix#14276aligns with the XADD/XTRIM crash fix after RDB loadThe format and structure are consistent with existing release notes.
src/version.h (1)
1-3: Version bump is correct.The version constants are correctly updated:
REDIS_VERSION "8.2.1"matches the releaseREDIS_VERSION_NUM 0x00080201correctly encodes version 8.2.1 in the expected formatNote: The pipeline failure about an unused variable in replication.c is unrelated to this file and should be addressed separately.
tests/unit/type/stream.tcl (1)
269-272: The test expectation is correct; the code has a bug at line 2708.At line 270, the code correctly implements DELETE_STRATEGY_ACKED semantics but encounters a blocking conservative NULL check. The test expectation of XLEN=2 is correct.
When entry 1-0 is checked for deletion:
- Entry 1-0 >= min_cgroup_last_id (1-0 not < 1-0), so it's been delivered
- cgroups_ref is NULL because PEL is empty after RELOAD
- The code should recognize: entry >= min_cgroup_last_id + cgroups_ref=NULL → entry is fully acknowledged → deletable
- However, line 2708 unconditionally returns 1 (referenced) when cgroups_ref is NULL, blocking deletion
The test comment states the expected behavior clearly: "ACKED option should work correctly even without cgroup references." This means ACKed entries should be deletable regardless of cgroups_ref state, as long as they're >= min_cgroup_last_id.
The actual problem is not with the test, but with the conservative NULL check at
src/t_stream.c:2708that should instead check whether the entry is actually within the delivered range before falling back to cgroups_ref.Likely an incorrect or invalid review comment.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| /* Temporarily disable active defragmentation during database flush. | ||
| * This prevents defrag from being triggered in replicationEmptyDbCallback() | ||
| * which could modify the database while it's being emptied. */ | ||
| int orig_active_defrag = server.active_defrag_enabled; | ||
| server.active_defrag_enabled = 0; | ||
|
|
||
| emptyData(-1, empty_db_flags, replicationEmptyDbCallback); | ||
|
|
||
| /* Restore the original active defragmentation setting. */ | ||
| server.active_defrag_enabled = 1; |
There was a problem hiding this comment.
Restore the original defrag setting (and fix the -Werror build break).
You save orig_active_defrag but never use it, which fails CI and also forces defrag on even if it was previously disabled. Restore the saved value instead.
🐛 Proposed fix
- int orig_active_defrag = server.active_defrag_enabled;
- server.active_defrag_enabled = 0;
-
- emptyData(-1, empty_db_flags, replicationEmptyDbCallback);
-
- /* Restore the original active defragmentation setting. */
- server.active_defrag_enabled = 1;
+ int orig_active_defrag = server.active_defrag_enabled;
+ server.active_defrag_enabled = 0;
+
+ emptyData(-1, empty_db_flags, replicationEmptyDbCallback);
+
+ /* Restore the original active defragmentation setting. */
+ server.active_defrag_enabled = orig_active_defrag;🧰 Tools
🪛 GitHub Actions: CI
[error] 1956-1956: replication.c: In function ‘rdbLoadEmptyDbFunc’: unused variable ‘orig_active_defrag’ [-Werror=unused-variable]
🪛 GitHub Actions: Codecov
[warning] 1956-1956: Unused variable 'orig_active_defrag' in replication.c. The value is assigned but never used.
🪛 GitHub Actions: External Server Tests
[error] 1956-1956: In function 'rdbLoadEmptyDbFunc': unused variable 'orig_active_defrag' -Werror=unused-variable caused the build to fail. Command failed: make REDIS_CFLAGS=-Werror.
🤖 Prompt for AI Agents
In `@src/replication.c` around lines 1953 - 1962, The code saves the original
defrag flag to orig_active_defrag but restores it incorrectly by hardcoding
server.active_defrag_enabled = 1; instead of using the saved value; change the
restore to assign server.active_defrag_enabled = orig_active_defrag so the
previous state is preserved (the change is around the block that disables defrag
before calling emptyData(...) and references server.active_defrag_enabled,
orig_active_defrag, and replicationEmptyDbCallback).
| proc discard_replies_every {rd count frequency discard_num} { | ||
| if {$count % $frequency != 0} { | ||
| for {set k 0} {$k < $discard_num} {incr k} { | ||
| $rd read ; # Discard replies | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
discard_replies_every discards on every iteration (condition inverted).
$count % $frequency != 0 causes the helper to read discard_num replies on almost every loop (including the first), which will block once replies are exhausted. This affects all new call sites, including the replica flush defrag test.
🐛 Proposed fix
- if {$count % $frequency != 0} {
+ if {$frequency > 0 && ($count % $frequency) == 0} {
for {set k 0} {$k < $discard_num} {incr k} {
$rd read ; # Discard replies
}
}Also applies to: 350-351, 362-363, 372-373, 768-769, 878-879, 888-889, 945-1004
🤖 Prompt for AI Agents
In `@tests/unit/memefficiency.tcl` around lines 70 - 76, The helper proc
discard_replies_every has its condition inverted and therefore discards replies
on every iteration; change the modulus check from "$count % $frequency != 0" to
"$count % $frequency == 0" so it only discards when the count is a multiple of
frequency, and update all other occurrences of the same pattern (the other
listed line ranges) to the same equality check to prevent premature/blocking
reads in discard_replies_every.
Benchmark PR from qodo-benchmark#100
Summary by CodeRabbit
Release 8.2.1
✏️ Tip: You can customize this high-level summary in your review settings.