Skip to content

Add auto-repair options for broken AOF tail on startup#29

Open
tomerqodo wants to merge 22 commits intocoderabbit_full_base_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7from
coderabbit_full_head_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7
Open

Add auto-repair options for broken AOF tail on startup#29
tomerqodo wants to merge 22 commits intocoderabbit_full_base_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7from
coderabbit_full_head_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7

Conversation

@tomerqodo
Copy link

@tomerqodo tomerqodo commented Jan 29, 2026

Benchmark PR from agentic-review-benchmarks#7

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced automatic AOF corruption recovery with two new configuration options: aof-load-broken (enables recovery) and aof-load-broken-max-size (sets corruption size threshold). Server can now automatically recover from mid-file corruption by truncating corrupted segments below the threshold.
  • Tests

    • Added comprehensive integration tests for AOF corruption recovery scenarios and edge cases.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

Walkthrough

Redis configuration now supports automatic recovery from corrupted AOF files. Two new configuration options enable mid-file corruption recovery with size limits, allowing server startup to succeed through truncation instead of always aborting on corruption, with warnings logged for transparency.

Changes

Cohort / File(s) Summary
Configuration Options
redis.conf, src/config.c
Added aof-load-broken (boolean, default off) and aof-load-broken-max-size (default 4096 bytes) configuration directives to control AOF corruption recovery behavior.
AOF Recovery Logic
src/aof.c
Integrated automatic recovery path into error handler: when corruption detected, attempts truncation to last valid offset if enabled and size permits; treats AOF_BROKEN_RECOVERED as valid terminal state in file loading paths.
Server Structure
src/server.h
Extended redisServer struct with aof_load_broken (flag) and aof_load_broken_max_size (capacity limit) fields; added AOF_BROKEN_RECOVERED state macro (value 6).
Integration Tests
tests/integration/aof.tcl
Added 136 lines of test scenarios validating recovery behavior with corrupted base/incr files, max-size enforcement, log message verification, and edge cases.

Sequence Diagram(s)

sequenceDiagram
    participant Server as Server (Startup)
    participant AOFLoader as AOF Loader
    participant Config as Config
    participant File as AOF File
    
    Server->>Config: Check aof-load-broken & max-size
    Config-->>Server: Return settings
    
    Server->>AOFLoader: Load AOF files
    AOFLoader->>File: Read file content
    File-->>AOFLoader: Data (with corruption)
    
    AOFLoader->>AOFLoader: Detect corruption format error
    
    alt aof-load-broken enabled
        AOFLoader->>AOFLoader: Check corruption size vs max
        alt Corruption within limit
            AOFLoader->>File: Truncate to last valid offset
            File-->>AOFLoader: Truncate success
            AOFLoader->>AOFLoader: Mark as AOF_BROKEN_RECOVERED
            AOFLoader-->>Server: Load complete with warning
        else Corruption exceeds limit
            AOFLoader-->>Server: Fail with error log
        end
    else aof-load-broken disabled
        AOFLoader-->>Server: Fatal error, abort
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A redis server stumbles upon a broken file,
But now it gently heals the wound with style—
Truncate the tail, recover with care,
When data loss is less than despair!
Configuration options guide the way,
Letting Redis fight another day. 🌟

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add auto-repair options for broken AOF tail on startup' accurately and specifically summarizes the main change - introducing new configuration options (aof-load-broken and aof-load-broken-max-size) that enable automatic recovery from corrupted AOF files at startup.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/aof.c (1)

1830-1895: Treat AOF_BROKEN_RECOVERED as a successful terminal state.

In the base-file path, AOF_BROKEN_RECOVERED is still treated as fatal (Line 1841), and stopLoading doesn’t recognize it as success (Line 1894). This skips size bookkeeping and can surface as a failed load despite successful recovery.

🛠️ Suggested fix
-        if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) {
+        if (ret == AOF_OPEN_ERR || ret == AOF_FAILED) {
             goto cleanup;
         }
...
-    stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED);
+    stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED);
🤖 Fix all issues with AI agents
In `@src/aof.c`:
- Around line 1722-1752: The code uses sb.st_size in the aof-load-broken
recovery path without ensuring the earlier redis_fstat succeeded; add a local
check of redis_fstat()'s return (or a boolean set when redis_fstat filled sb)
before any use of sb.st_size in the block handling server.aof_load_broken and
valid_up_to, and if fstat failed log a warning (using serverLog) and skip the
truncation/recovery and the size comparisons against
server.aof_load_broken_max_size to avoid UB and misleading decisions; update
references in this branch (uses of sb.st_size, the broken-size comparisons, and
the truncate/lseek logic around server.aof_fd) to only execute when the fstat
check confirms sb is valid.

In `@src/config.c`:
- Line 3258: The config entry created via createOffTConfig for
"aof-load-broken-max-size" currently uses INTEGER_CONFIG which rejects unit
suffixes; change its type to MEMORY_CONFIG so values like "4kb" are accepted and
parsed consistently with other size directives (e.g.,
auto-aof-rewrite-min-size), ensuring the existing default
server.aof_load_broken_max_size and bounds (0, LONG_MAX) remain unchanged.

In `@tests/integration/aof.tcl`:
- Around line 723-725: The Tcl exec call using a wildcard doesn't expand so incr
AOF files aren't deleted; change the deletion to use Tcl glob expansion (e.g.
call glob on "$aof_dirpath/appendonly.aof.*" and then file delete those entries
or use exec with the expanded list via {*}[glob ...]) in the same block where
exec rm -f $aof_dirpath/appendonly.aof.* currently appears (near the
start_server_aof invocation) so the incremental AOF files are actually removed
and the "base is last file" test condition is satisfied.
🧹 Nitpick comments (1)
src/server.h (1)

15-16: Header guard still uses reserved identifier (optional fix).

The change from __REDIS_H to _REDIS_H is an improvement, but identifiers starting with underscore followed by uppercase are still reserved by the C standard (§7.1.3). Using REDIS_H without the leading underscore would be fully compliant.

This is unlikely to cause practical issues and is common in many codebases.

♻️ Fully compliant header guard (optional)
-#ifndef _REDIS_H
-#define _REDIS_H
+#ifndef REDIS_H
+#define REDIS_H

Note: This would also require changing the corresponding #endif comment if one exists at the end of the file.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49d1914 and 5c59ae6.

📒 Files selected for processing (5)
  • redis.conf
  • src/aof.c
  • src/config.c
  • src/server.h
  • tests/integration/aof.tcl
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: code-coverage
  • GitHub Check: test-external-cluster
  • GitHub Check: test-external-nodebug
  • GitHub Check: Analyze (cpp)
  • GitHub Check: test-external-standalone
  • GitHub Check: test-ubuntu-latest
  • GitHub Check: build-centos-jemalloc
  • GitHub Check: test-sanitizer-address
  • GitHub Check: build-32bit
  • GitHub Check: build-old-chain-jemalloc
  • GitHub Check: code-coverage
  • GitHub Check: test-external-nodebug
  • GitHub Check: build-32bit
  • GitHub Check: build-centos-jemalloc
  • GitHub Check: test-external-standalone
  • GitHub Check: test-external-cluster
  • GitHub Check: test-ubuntu-latest
  • GitHub Check: test-sanitizer-address
  • GitHub Check: build-old-chain-jemalloc
🔇 Additional comments (4)
src/server.h (2)

341-348: LGTM!

The new AOF_BROKEN_RECOVERED return value follows the existing sequential pattern and provides a distinct status for the auto-repair scenario (broken AOF that was successfully recovered through truncation) versus the existing AOF_TRUNCATED state.


2009-2012: LGTM!

The new configuration fields are well-placed alongside the related aof_load_truncated option. The type choices are appropriate: int for the boolean flag (consistent with Redis practice) and off_t for the file size limit (portable and handles large files correctly).

src/config.c (1)

3092-3094: Good addition of the aof-load-broken toggle to the static config surface.
This wiring is consistent with the existing AOF load options.

redis.conf (1)

1532-1549: Documentation is clear and appropriately warns about data loss risks.
The guidance distinguishes mid-file corruption from truncated tail handling and sets the right expectations.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +1722 to +1752
/* fmterr may be caused by accidentally machine shutdown, so if the broken tail
* is less than a specified size, try to recover it automatically */
if (server.aof_load_broken) {
if (valid_up_to == -1) {
serverLog(LL_WARNING,"Last valid command offset is invalid");
} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {
if (truncate(aof_filepath,valid_up_to) == -1) {
serverLog(LL_WARNING,"Error truncating the AOF file: %s",
strerror(errno));
} else {
/* Make sure the AOF file descriptor points to the end of the
* file after the truncate call. */
if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) {
serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s",
strerror(errno));
} else {
serverLog(LL_WARNING,
"AOF loaded anyway because aof-load-broken is enabled and "
"broken size '%lld' is less than aof-load-broken-max-size '%lld'",
(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
ret = AOF_BROKEN_RECOVERED;
goto loaded_ok;
}
}
} else { /* The size of the corrupted portion exceeds the configured limit. */
serverLog(LL_WARNING,
"AOF was not loaded because the size of the corrupted portion "
"exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' "
"is bigger than aof-load-broken-max-size '%lld'",
(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard broken-size calculation with a valid fstat.

sb.st_size is now used in the recovery path but sb may be uninitialized if redis_fstat previously failed. Add a local redis_fstat check before using sb.st_size to avoid UB and misleading size comparisons.

🛠️ Suggested fix
 fmterr: /* Format error. */
     /* fmterr may be caused by accidentally machine shutdown, so if the broken tail
      * is less than a specified size, try to recover it automatically */
     if (server.aof_load_broken) {
+        if (redis_fstat(fileno(fp), &sb) == -1) {
+            serverLog(LL_WARNING, "Unable to stat AOF file %s: %s", filename, strerror(errno));
+            ret = AOF_FAILED;
+            goto cleanup;
+        }
         if (valid_up_to == -1) {
             serverLog(LL_WARNING,"Last valid command offset is invalid");
         } else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/* fmterr may be caused by accidentally machine shutdown, so if the broken tail
* is less than a specified size, try to recover it automatically */
if (server.aof_load_broken) {
if (valid_up_to == -1) {
serverLog(LL_WARNING,"Last valid command offset is invalid");
} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {
if (truncate(aof_filepath,valid_up_to) == -1) {
serverLog(LL_WARNING,"Error truncating the AOF file: %s",
strerror(errno));
} else {
/* Make sure the AOF file descriptor points to the end of the
* file after the truncate call. */
if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) {
serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s",
strerror(errno));
} else {
serverLog(LL_WARNING,
"AOF loaded anyway because aof-load-broken is enabled and "
"broken size '%lld' is less than aof-load-broken-max-size '%lld'",
(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
ret = AOF_BROKEN_RECOVERED;
goto loaded_ok;
}
}
} else { /* The size of the corrupted portion exceeds the configured limit. */
serverLog(LL_WARNING,
"AOF was not loaded because the size of the corrupted portion "
"exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' "
"is bigger than aof-load-broken-max-size '%lld'",
(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
}
/* fmterr may be caused by accidentally machine shutdown, so if the broken tail
* is less than a specified size, try to recover it automatically */
if (server.aof_load_broken) {
if (redis_fstat(fileno(fp), &sb) == -1) {
serverLog(LL_WARNING, "Unable to stat AOF file %s: %s", filename, strerror(errno));
ret = AOF_FAILED;
goto cleanup;
}
if (valid_up_to == -1) {
serverLog(LL_WARNING,"Last valid command offset is invalid");
} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {
if (truncate(aof_filepath,valid_up_to) == -1) {
serverLog(LL_WARNING,"Error truncating the AOF file: %s",
strerror(errno));
} else {
/* Make sure the AOF file descriptor points to the end of the
* file after the truncate call. */
if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) {
serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s",
strerror(errno));
} else {
serverLog(LL_WARNING,
"AOF loaded anyway because aof-load-broken is enabled and "
"broken size '%lld' is less than aof-load-broken-max-size '%lld'",
(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
ret = AOF_BROKEN_RECOVERED;
goto loaded_ok;
}
}
} else { /* The size of the corrupted portion exceeds the configured limit. */
serverLog(LL_WARNING,
"AOF was not loaded because the size of the corrupted portion "
"exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' "
"is bigger than aof-load-broken-max-size '%lld'",
(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
}
🤖 Prompt for AI Agents
In `@src/aof.c` around lines 1722 - 1752, The code uses sb.st_size in the
aof-load-broken recovery path without ensuring the earlier redis_fstat
succeeded; add a local check of redis_fstat()'s return (or a boolean set when
redis_fstat filled sb) before any use of sb.st_size in the block handling
server.aof_load_broken and valid_up_to, and if fstat failed log a warning (using
serverLog) and skip the truncation/recovery and the size comparisons against
server.aof_load_broken_max_size to avoid UB and misleading decisions; update
references in this branch (uses of sb.st_size, the broken-size comparisons, and
the truncate/lseek logic around server.aof_fd) to only execute when the fstat
check confirms sb is valid.

createTimeTConfig("repl-backlog-ttl", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.repl_backlog_time_limit, 60*60, INTEGER_CONFIG, NULL, NULL), /* Default: 1 hour */
createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),
createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),
createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Allow standard size units for aof-load-broken-max-size.
Using INTEGER_CONFIG rejects values like 4kb even though this is a byte-size setting and the config header advertises unit suffixes. Consider switching to MEMORY_CONFIG for consistency with other size-based directives (e.g., auto-aof-rewrite-min-size).

Proposed change
-    createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
+    createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),
🤖 Prompt for AI Agents
In `@src/config.c` at line 3258, The config entry created via createOffTConfig for
"aof-load-broken-max-size" currently uses INTEGER_CONFIG which rejects unit
suffixes; change its type to MEMORY_CONFIG so values like "4kb" are accepted and
parsed consistently with other size directives (e.g.,
auto-aof-rewrite-min-size), ensuring the existing default
server.aof_load_broken_max_size and bounds (0, LONG_MAX) remain unchanged.

Comment on lines +723 to +725
# Remove all incr AOF files to make the base file being the last file
exec rm -f $aof_dirpath/appendonly.aof.*
start_server_aof [list dir $server_path aof-load-broken yes] {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wildcard deletion won’t expand in Tcl exec.

exec rm -f $aof_dirpath/appendonly.aof.* won’t expand the glob, so the incr files likely remain and the test won’t create the “base is last file” condition. Use glob + file delete (or exec rm -f {*}[glob ...]).

🛠️ Suggested fix
-    exec rm -f $aof_dirpath/appendonly.aof.*
+    file delete -force {*}[glob -nocomplain $aof_dirpath/appendonly.aof.*]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Remove all incr AOF files to make the base file being the last file
exec rm -f $aof_dirpath/appendonly.aof.*
start_server_aof [list dir $server_path aof-load-broken yes] {
# Remove all incr AOF files to make the base file being the last file
file delete -force {*}[glob -nocomplain $aof_dirpath/appendonly.aof.*]
start_server_aof [list dir $server_path aof-load-broken yes] {
🤖 Prompt for AI Agents
In `@tests/integration/aof.tcl` around lines 723 - 725, The Tcl exec call using a
wildcard doesn't expand so incr AOF files aren't deleted; change the deletion to
use Tcl glob expansion (e.g. call glob on "$aof_dirpath/appendonly.aof.*" and
then file delete those entries or use exec with the expanded list via {*}[glob
...]) in the same block where exec rm -f $aof_dirpath/appendonly.aof.* currently
appears (near the start_server_aof invocation) so the incremental AOF files are
actually removed and the "base is last file" test condition is satisfied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants