Add auto-repair options for broken AOF tail on startup by tomerqodo · Pull Request #29 · qodo-benchmark/redis-combined-coderabbit

tomerqodo · 2026-01-29T10:12:30Z

Benchmark PR from agentic-review-benchmarks#7

Summary by CodeRabbit

Release Notes

New Features
- Introduced automatic AOF corruption recovery with two new configuration options: aof-load-broken (enables recovery) and aof-load-broken-max-size (sets corruption size threshold). Server can now automatically recover from mid-file corruption by truncating corrupted segments below the threshold.
Tests
- Added comprehensive integration tests for AOF corruption recovery scenarios and edge cases.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…oken

coderabbitai · 2026-01-29T10:12:48Z

Walkthrough

Redis configuration now supports automatic recovery from corrupted AOF files. Two new configuration options enable mid-file corruption recovery with size limits, allowing server startup to succeed through truncation instead of always aborting on corruption, with warnings logged for transparency.

Changes

Cohort / File(s)	Summary
Configuration Options `redis.conf`, `src/config.c`	Added `aof-load-broken` (boolean, default off) and `aof-load-broken-max-size` (default 4096 bytes) configuration directives to control AOF corruption recovery behavior.
AOF Recovery Logic `src/aof.c`	Integrated automatic recovery path into error handler: when corruption detected, attempts truncation to last valid offset if enabled and size permits; treats AOF_BROKEN_RECOVERED as valid terminal state in file loading paths.
Server Structure `src/server.h`	Extended `redisServer` struct with `aof_load_broken` (flag) and `aof_load_broken_max_size` (capacity limit) fields; added AOF_BROKEN_RECOVERED state macro (value 6).
Integration Tests `tests/integration/aof.tcl`	Added 136 lines of test scenarios validating recovery behavior with corrupted base/incr files, max-size enforcement, log message verification, and edge cases.

Sequence Diagram(s)

sequenceDiagram
    participant Server as Server (Startup)
    participant AOFLoader as AOF Loader
    participant Config as Config
    participant File as AOF File
    
    Server->>Config: Check aof-load-broken & max-size
    Config-->>Server: Return settings
    
    Server->>AOFLoader: Load AOF files
    AOFLoader->>File: Read file content
    File-->>AOFLoader: Data (with corruption)
    
    AOFLoader->>AOFLoader: Detect corruption format error
    
    alt aof-load-broken enabled
        AOFLoader->>AOFLoader: Check corruption size vs max
        alt Corruption within limit
            AOFLoader->>File: Truncate to last valid offset
            File-->>AOFLoader: Truncate success
            AOFLoader->>AOFLoader: Mark as AOF_BROKEN_RECOVERED
            AOFLoader-->>Server: Load complete with warning
        else Corruption exceeds limit
            AOFLoader-->>Server: Fail with error log
        end
    else aof-load-broken disabled
        AOFLoader-->>Server: Fatal error, abort
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A redis server stumbles upon a broken file,
But now it gently heals the wound with style—
Truncate the tail, recover with care,
When data loss is less than despair!
Configuration options guide the way,
Letting Redis fight another day. 🌟

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add auto-repair options for broken AOF tail on startup' accurately and specifically summarizes the main change - introducing new configuration options (aof-load-broken and aof-load-broken-max-size) that enable automatic recovery from corrupted AOF files at startup.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/aof.c (1)
1830-1895: Treat AOF_BROKEN_RECOVERED as a successful terminal state.

In the base-file path, AOF_BROKEN_RECOVERED is still treated as fatal (Line 1841), and stopLoading doesn’t recognize it as success (Line 1894). This skips size bookkeeping and can surface as a failed load despite successful recovery.
🛠️ Suggested fix
-        if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) {
+        if (ret == AOF_OPEN_ERR || ret == AOF_FAILED) {
             goto cleanup;
         }
...
-    stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED);
+    stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED);

🤖 Fix all issues with AI agents

In `@src/aof.c`:
- Around line 1722-1752: The code uses sb.st_size in the aof-load-broken
recovery path without ensuring the earlier redis_fstat succeeded; add a local
check of redis_fstat()'s return (or a boolean set when redis_fstat filled sb)
before any use of sb.st_size in the block handling server.aof_load_broken and
valid_up_to, and if fstat failed log a warning (using serverLog) and skip the
truncation/recovery and the size comparisons against
server.aof_load_broken_max_size to avoid UB and misleading decisions; update
references in this branch (uses of sb.st_size, the broken-size comparisons, and
the truncate/lseek logic around server.aof_fd) to only execute when the fstat
check confirms sb is valid.

In `@src/config.c`:
- Line 3258: The config entry created via createOffTConfig for
"aof-load-broken-max-size" currently uses INTEGER_CONFIG which rejects unit
suffixes; change its type to MEMORY_CONFIG so values like "4kb" are accepted and
parsed consistently with other size directives (e.g.,
auto-aof-rewrite-min-size), ensuring the existing default
server.aof_load_broken_max_size and bounds (0, LONG_MAX) remain unchanged.

In `@tests/integration/aof.tcl`:
- Around line 723-725: The Tcl exec call using a wildcard doesn't expand so incr
AOF files aren't deleted; change the deletion to use Tcl glob expansion (e.g.
call glob on "$aof_dirpath/appendonly.aof.*" and then file delete those entries
or use exec with the expanded list via {*}[glob ...]) in the same block where
exec rm -f $aof_dirpath/appendonly.aof.* currently appears (near the
start_server_aof invocation) so the incremental AOF files are actually removed
and the "base is last file" test condition is satisfied.

🧹 Nitpick comments (1)

src/server.h (1)
15-16: Header guard still uses reserved identifier (optional fix).

The change from __REDIS_H to _REDIS_H is an improvement, but identifiers starting with underscore followed by uppercase are still reserved by the C standard (§7.1.3). Using REDIS_H without the leading underscore would be fully compliant.

This is unlikely to cause practical issues and is common in many codebases.
♻️ Fully compliant header guard (optional)
-#ifndef _REDIS_H
-#define _REDIS_H
+#ifndef REDIS_H
+#define REDIS_H
Note: This would also require changing the corresponding #endif comment if one exists at the end of the file.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49d1914 and 5c59ae6.

📒 Files selected for processing (5)

redis.conf
src/aof.c
src/config.c
src/server.h
tests/integration/aof.tcl

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: code-coverage
GitHub Check: test-external-cluster
GitHub Check: test-external-nodebug
GitHub Check: Analyze (cpp)
GitHub Check: test-external-standalone
GitHub Check: test-ubuntu-latest
GitHub Check: build-centos-jemalloc
GitHub Check: test-sanitizer-address
GitHub Check: build-32bit
GitHub Check: build-old-chain-jemalloc
GitHub Check: code-coverage
GitHub Check: test-external-nodebug
GitHub Check: build-32bit
GitHub Check: build-centos-jemalloc
GitHub Check: test-external-standalone
GitHub Check: test-external-cluster
GitHub Check: test-ubuntu-latest
GitHub Check: test-sanitizer-address
GitHub Check: build-old-chain-jemalloc

🔇 Additional comments (4)

src/server.h (2)

341-348: LGTM!

The new AOF_BROKEN_RECOVERED return value follows the existing sequential pattern and provides a distinct status for the auto-repair scenario (broken AOF that was successfully recovered through truncation) versus the existing AOF_TRUNCATED state.

2009-2012: LGTM!

The new configuration fields are well-placed alongside the related aof_load_truncated option. The type choices are appropriate: int for the boolean flag (consistent with Redis practice) and off_t for the file size limit (portable and handles large files correctly).

src/config.c (1)

3092-3094: Good addition of the aof-load-broken toggle to the static config surface.
This wiring is consistent with the existing AOF load options.

redis.conf (1)

1532-1549: Documentation is clear and appropriately warns about data loss risks.
The guidance distinguishes mid-file corruption from truncated tail handling and sets the right expectations.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-29T10:17:35Z

src/aof.c

+    /* fmterr may be caused by accidentally machine shutdown, so if the broken tail
+     * is less than a specified size, try to recover it automatically */
+    if (server.aof_load_broken) {
+        if (valid_up_to == -1) {
+            serverLog(LL_WARNING,"Last valid command offset is invalid");
+        } else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {
+            if (truncate(aof_filepath,valid_up_to) == -1) {
+                serverLog(LL_WARNING,"Error truncating the AOF file: %s",
+                    strerror(errno));
+            } else {
+                /* Make sure the AOF file descriptor points to the end of the
+                 * file after the truncate call. */
+                if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) {
+                    serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s",
+                        strerror(errno));
+                } else {
+                    serverLog(LL_WARNING,
+                        "AOF loaded anyway because aof-load-broken is enabled and "
+                        "broken size '%lld' is less than aof-load-broken-max-size '%lld'",
+                        (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
+                    ret = AOF_BROKEN_RECOVERED;
+                    goto loaded_ok;
+                }
+            }
+        } else { /* The size of the corrupted portion exceeds the configured limit. */
+            serverLog(LL_WARNING,
+                 "AOF was not loaded because the size of the corrupted portion "
+                 "exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' "
+                 "is bigger than aof-load-broken-max-size '%lld'",
+                 (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
+        }


⚠️ Potential issue | 🟠 Major

Guard broken-size calculation with a valid fstat.

sb.st_size is now used in the recovery path but sb may be uninitialized if redis_fstat previously failed. Add a local redis_fstat check before using sb.st_size to avoid UB and misleading size comparisons.

🛠️ Suggested fix

fmterr: /* Format error. */ /* fmterr may be caused by accidentally machine shutdown, so if the broken tail * is less than a specified size, try to recover it automatically */ if (server.aof_load_broken) { + if (redis_fstat(fileno(fp), &sb) == -1) { + serverLog(LL_WARNING, "Unable to stat AOF file %s: %s", filename, strerror(errno)); + ret = AOF_FAILED; + goto cleanup; + } if (valid_up_to == -1) { serverLog(LL_WARNING,"Last valid command offset is invalid"); } else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/* fmterr may be caused by accidentally machine shutdown, so if the broken tail

* is less than a specified size, try to recover it automatically */

if (server.aof_load_broken) {

if (valid_up_to == -1) {

serverLog(LL_WARNING,"Last valid command offset is invalid");

} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {

if (truncate(aof_filepath,valid_up_to) == -1) {

serverLog(LL_WARNING,"Error truncating the AOF file: %s",

strerror(errno));

} else {

/* Make sure the AOF file descriptor points to the end of the

* file after the truncate call. */

if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) {

serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s",

strerror(errno));

} else {

serverLog(LL_WARNING,

"AOF loaded anyway because aof-load-broken is enabled and "

"broken size '%lld' is less than aof-load-broken-max-size '%lld'",

(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));

ret = AOF_BROKEN_RECOVERED;

goto loaded_ok;

}

}

} else { /* The size of the corrupted portion exceeds the configured limit. */

serverLog(LL_WARNING,

"AOF was not loaded because the size of the corrupted portion "

"exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' "

"is bigger than aof-load-broken-max-size '%lld'",

(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));

}

/* fmterr may be caused by accidentally machine shutdown, so if the broken tail

* is less than a specified size, try to recover it automatically */

if (server.aof_load_broken) {

if (redis_fstat(fileno(fp), &sb) == -1) {

serverLog(LL_WARNING, "Unable to stat AOF file %s: %s", filename, strerror(errno));

ret = AOF_FAILED;

goto cleanup;

}

if (valid_up_to == -1) {

serverLog(LL_WARNING,"Last valid command offset is invalid");

} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {

if (truncate(aof_filepath,valid_up_to) == -1) {

serverLog(LL_WARNING,"Error truncating the AOF file: %s",

strerror(errno));

} else {

/* Make sure the AOF file descriptor points to the end of the

* file after the truncate call. */

if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) {

serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s",

strerror(errno));

} else {

serverLog(LL_WARNING,

"AOF loaded anyway because aof-load-broken is enabled and "

"broken size '%lld' is less than aof-load-broken-max-size '%lld'",

(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));

ret = AOF_BROKEN_RECOVERED;

goto loaded_ok;

}

}

} else { /* The size of the corrupted portion exceeds the configured limit. */

serverLog(LL_WARNING,

"AOF was not loaded because the size of the corrupted portion "

"exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' "

"is bigger than aof-load-broken-max-size '%lld'",

(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));

}

🤖 Prompt for AI Agents

In `@src/aof.c` around lines 1722 - 1752, The code uses sb.st_size in the aof-load-broken recovery path without ensuring the earlier redis_fstat succeeded; add a local check of redis_fstat()'s return (or a boolean set when redis_fstat filled sb) before any use of sb.st_size in the block handling server.aof_load_broken and valid_up_to, and if fstat failed log a warning (using serverLog) and skip the truncation/recovery and the size comparisons against server.aof_load_broken_max_size to avoid UB and misleading decisions; update references in this branch (uses of sb.st_size, the broken-size comparisons, and the truncate/lseek logic around server.aof_fd) to only execute when the fstat check confirms sb is valid.

coderabbitai · 2026-01-29T10:17:35Z

src/config.c

    createTimeTConfig("repl-backlog-ttl", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.repl_backlog_time_limit, 60*60, INTEGER_CONFIG, NULL, NULL), /* Default: 1 hour */
    createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),
    createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),
+    createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),


⚠️ Potential issue | 🟡 Minor

Allow standard size units for aof-load-broken-max-size.
Using INTEGER_CONFIG rejects values like 4kb even though this is a byte-size setting and the config header advertises unit suffixes. Consider switching to MEMORY_CONFIG for consistency with other size-based directives (e.g., auto-aof-rewrite-min-size).

Proposed change

- createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL), + createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),

createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),

🤖 Prompt for AI Agents

In `@src/config.c` at line 3258, The config entry created via createOffTConfig for "aof-load-broken-max-size" currently uses INTEGER_CONFIG which rejects unit suffixes; change its type to MEMORY_CONFIG so values like "4kb" are accepted and parsed consistently with other size directives (e.g., auto-aof-rewrite-min-size), ensuring the existing default server.aof_load_broken_max_size and bounds (0, LONG_MAX) remain unchanged.

coderabbitai · 2026-01-29T10:17:35Z

tests/integration/aof.tcl

+    # Remove all incr AOF files to make the base file being the last file
+    exec rm -f $aof_dirpath/appendonly.aof.*
+    start_server_aof [list dir $server_path aof-load-broken yes] {


⚠️ Potential issue | 🟠 Major

Wildcard deletion won’t expand in Tcl exec.

exec rm -f $aof_dirpath/appendonly.aof.* won’t expand the glob, so the incr files likely remain and the test won’t create the “base is last file” condition. Use glob + file delete (or exec rm -f {*}[glob ...]).

🛠️ Suggested fix

- exec rm -f $aof_dirpath/appendonly.aof.* + file delete -force {*}[glob -nocomplain $aof_dirpath/appendonly.aof.*]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Remove all incr AOF files to make the base file being the last file

exec rm -f $aof_dirpath/appendonly.aof.*

start_server_aof [list dir $server_path aof-load-broken yes] {

# Remove all incr AOF files to make the base file being the last file

file delete -force {*}[glob -nocomplain $aof_dirpath/appendonly.aof.*]

start_server_aof [list dir $server_path aof-load-broken yes] {

🤖 Prompt for AI Agents

In `@tests/integration/aof.tcl` around lines 723 - 725, The Tcl exec call using a wildcard doesn't expand so incr AOF files aren't deleted; change the deletion to use Tcl glob expansion (e.g. call glob on "$aof_dirpath/appendonly.aof.*" and then file delete those entries or use exec with the expanded list via {*}[glob ...]) in the same block where exec rm -f $aof_dirpath/appendonly.aof.* currently appears (near the start_server_aof invocation) so the incremental AOF files are actually removed and the "base is last file" test condition is satisfied.

StavRLevi and others added 22 commits January 25, 2026 12:12

Add auto-repair options for broken AOF tail on startup

76826d9

change aof-load-broken-max-size configuration

f9f91f4

fix of-load-broken-max-size configuration

f7854e1

fix unnecessary changes

9487237

remove unnecessary change

11754c1

Fix format specifier in AOF error message

023f121

Fix format specifier in AOF error message

09e8235

Fix tcl error msg

459faa7

Add incr file support + remove spaces

bcd7eea

add tests for base aof

9751360

fix spaces

37e5bb5

add support to incr aof + add test for an intermediate AOF file is br…

825c783

…oken

fix indentation and add more tests

a202286

add tls

a2703d9

add space + file for test

4bc5794

try to remove tls for tests

293ff89

fix tls

c67106f

add aof-load-broken to redis.conf

9da7ec3

add expansion about aof-load-broken-max-size

008dff1

add expansion about aof-load-broken-max-size

8408c32

add aof-load-broken-max-size to conf file

34df5ba

update pr

5c59ae6

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto-repair options for broken AOF tail on startup#29

Add auto-repair options for broken AOF tail on startup#29
tomerqodo wants to merge 22 commits intocoderabbit_full_base_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7from
coderabbit_full_head_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7

tomerqodo commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 29, 2026

Uh oh!

coderabbitai bot Jan 29, 2026

Uh oh!

coderabbitai bot Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
	createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),

Conversation

tomerqodo commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomerqodo commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading