Add auto-repair options for broken AOF tail on startup#29
Conversation
WalkthroughRedis configuration now supports automatic recovery from corrupted AOF files. Two new configuration options enable mid-file corruption recovery with size limits, allowing server startup to succeed through truncation instead of always aborting on corruption, with warnings logged for transparency. Changes
Sequence Diagram(s)sequenceDiagram
participant Server as Server (Startup)
participant AOFLoader as AOF Loader
participant Config as Config
participant File as AOF File
Server->>Config: Check aof-load-broken & max-size
Config-->>Server: Return settings
Server->>AOFLoader: Load AOF files
AOFLoader->>File: Read file content
File-->>AOFLoader: Data (with corruption)
AOFLoader->>AOFLoader: Detect corruption format error
alt aof-load-broken enabled
AOFLoader->>AOFLoader: Check corruption size vs max
alt Corruption within limit
AOFLoader->>File: Truncate to last valid offset
File-->>AOFLoader: Truncate success
AOFLoader->>AOFLoader: Mark as AOF_BROKEN_RECOVERED
AOFLoader-->>Server: Load complete with warning
else Corruption exceeds limit
AOFLoader-->>Server: Fail with error log
end
else aof-load-broken disabled
AOFLoader-->>Server: Fatal error, abort
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/aof.c (1)
1830-1895: TreatAOF_BROKEN_RECOVEREDas a successful terminal state.In the base-file path,
AOF_BROKEN_RECOVEREDis still treated as fatal (Line 1841), andstopLoadingdoesn’t recognize it as success (Line 1894). This skips size bookkeeping and can surface as a failed load despite successful recovery.🛠️ Suggested fix
- if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) { + if (ret == AOF_OPEN_ERR || ret == AOF_FAILED) { goto cleanup; } ... - stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED); + stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED);
🤖 Fix all issues with AI agents
In `@src/aof.c`:
- Around line 1722-1752: The code uses sb.st_size in the aof-load-broken
recovery path without ensuring the earlier redis_fstat succeeded; add a local
check of redis_fstat()'s return (or a boolean set when redis_fstat filled sb)
before any use of sb.st_size in the block handling server.aof_load_broken and
valid_up_to, and if fstat failed log a warning (using serverLog) and skip the
truncation/recovery and the size comparisons against
server.aof_load_broken_max_size to avoid UB and misleading decisions; update
references in this branch (uses of sb.st_size, the broken-size comparisons, and
the truncate/lseek logic around server.aof_fd) to only execute when the fstat
check confirms sb is valid.
In `@src/config.c`:
- Line 3258: The config entry created via createOffTConfig for
"aof-load-broken-max-size" currently uses INTEGER_CONFIG which rejects unit
suffixes; change its type to MEMORY_CONFIG so values like "4kb" are accepted and
parsed consistently with other size directives (e.g.,
auto-aof-rewrite-min-size), ensuring the existing default
server.aof_load_broken_max_size and bounds (0, LONG_MAX) remain unchanged.
In `@tests/integration/aof.tcl`:
- Around line 723-725: The Tcl exec call using a wildcard doesn't expand so incr
AOF files aren't deleted; change the deletion to use Tcl glob expansion (e.g.
call glob on "$aof_dirpath/appendonly.aof.*" and then file delete those entries
or use exec with the expanded list via {*}[glob ...]) in the same block where
exec rm -f $aof_dirpath/appendonly.aof.* currently appears (near the
start_server_aof invocation) so the incremental AOF files are actually removed
and the "base is last file" test condition is satisfied.
🧹 Nitpick comments (1)
src/server.h (1)
15-16: Header guard still uses reserved identifier (optional fix).The change from
__REDIS_Hto_REDIS_His an improvement, but identifiers starting with underscore followed by uppercase are still reserved by the C standard (§7.1.3). UsingREDIS_Hwithout the leading underscore would be fully compliant.This is unlikely to cause practical issues and is common in many codebases.
♻️ Fully compliant header guard (optional)
-#ifndef _REDIS_H -#define _REDIS_H +#ifndef REDIS_H +#define REDIS_HNote: This would also require changing the corresponding
#endifcomment if one exists at the end of the file.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
redis.confsrc/aof.csrc/config.csrc/server.htests/integration/aof.tcl
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: code-coverage
- GitHub Check: test-external-cluster
- GitHub Check: test-external-nodebug
- GitHub Check: Analyze (cpp)
- GitHub Check: test-external-standalone
- GitHub Check: test-ubuntu-latest
- GitHub Check: build-centos-jemalloc
- GitHub Check: test-sanitizer-address
- GitHub Check: build-32bit
- GitHub Check: build-old-chain-jemalloc
- GitHub Check: code-coverage
- GitHub Check: test-external-nodebug
- GitHub Check: build-32bit
- GitHub Check: build-centos-jemalloc
- GitHub Check: test-external-standalone
- GitHub Check: test-external-cluster
- GitHub Check: test-ubuntu-latest
- GitHub Check: test-sanitizer-address
- GitHub Check: build-old-chain-jemalloc
🔇 Additional comments (4)
src/server.h (2)
341-348: LGTM!The new
AOF_BROKEN_RECOVEREDreturn value follows the existing sequential pattern and provides a distinct status for the auto-repair scenario (broken AOF that was successfully recovered through truncation) versus the existingAOF_TRUNCATEDstate.
2009-2012: LGTM!The new configuration fields are well-placed alongside the related
aof_load_truncatedoption. The type choices are appropriate:intfor the boolean flag (consistent with Redis practice) andoff_tfor the file size limit (portable and handles large files correctly).src/config.c (1)
3092-3094: Good addition of theaof-load-brokentoggle to the static config surface.
This wiring is consistent with the existing AOF load options.redis.conf (1)
1532-1549: Documentation is clear and appropriately warns about data loss risks.
The guidance distinguishes mid-file corruption from truncated tail handling and sets the right expectations.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| /* fmterr may be caused by accidentally machine shutdown, so if the broken tail | ||
| * is less than a specified size, try to recover it automatically */ | ||
| if (server.aof_load_broken) { | ||
| if (valid_up_to == -1) { | ||
| serverLog(LL_WARNING,"Last valid command offset is invalid"); | ||
| } else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) { | ||
| if (truncate(aof_filepath,valid_up_to) == -1) { | ||
| serverLog(LL_WARNING,"Error truncating the AOF file: %s", | ||
| strerror(errno)); | ||
| } else { | ||
| /* Make sure the AOF file descriptor points to the end of the | ||
| * file after the truncate call. */ | ||
| if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) { | ||
| serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s", | ||
| strerror(errno)); | ||
| } else { | ||
| serverLog(LL_WARNING, | ||
| "AOF loaded anyway because aof-load-broken is enabled and " | ||
| "broken size '%lld' is less than aof-load-broken-max-size '%lld'", | ||
| (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size)); | ||
| ret = AOF_BROKEN_RECOVERED; | ||
| goto loaded_ok; | ||
| } | ||
| } | ||
| } else { /* The size of the corrupted portion exceeds the configured limit. */ | ||
| serverLog(LL_WARNING, | ||
| "AOF was not loaded because the size of the corrupted portion " | ||
| "exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' " | ||
| "is bigger than aof-load-broken-max-size '%lld'", | ||
| (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size)); | ||
| } |
There was a problem hiding this comment.
Guard broken-size calculation with a valid fstat.
sb.st_size is now used in the recovery path but sb may be uninitialized if redis_fstat previously failed. Add a local redis_fstat check before using sb.st_size to avoid UB and misleading size comparisons.
🛠️ Suggested fix
fmterr: /* Format error. */
/* fmterr may be caused by accidentally machine shutdown, so if the broken tail
* is less than a specified size, try to recover it automatically */
if (server.aof_load_broken) {
+ if (redis_fstat(fileno(fp), &sb) == -1) {
+ serverLog(LL_WARNING, "Unable to stat AOF file %s: %s", filename, strerror(errno));
+ ret = AOF_FAILED;
+ goto cleanup;
+ }
if (valid_up_to == -1) {
serverLog(LL_WARNING,"Last valid command offset is invalid");
} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /* fmterr may be caused by accidentally machine shutdown, so if the broken tail | |
| * is less than a specified size, try to recover it automatically */ | |
| if (server.aof_load_broken) { | |
| if (valid_up_to == -1) { | |
| serverLog(LL_WARNING,"Last valid command offset is invalid"); | |
| } else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) { | |
| if (truncate(aof_filepath,valid_up_to) == -1) { | |
| serverLog(LL_WARNING,"Error truncating the AOF file: %s", | |
| strerror(errno)); | |
| } else { | |
| /* Make sure the AOF file descriptor points to the end of the | |
| * file after the truncate call. */ | |
| if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) { | |
| serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s", | |
| strerror(errno)); | |
| } else { | |
| serverLog(LL_WARNING, | |
| "AOF loaded anyway because aof-load-broken is enabled and " | |
| "broken size '%lld' is less than aof-load-broken-max-size '%lld'", | |
| (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size)); | |
| ret = AOF_BROKEN_RECOVERED; | |
| goto loaded_ok; | |
| } | |
| } | |
| } else { /* The size of the corrupted portion exceeds the configured limit. */ | |
| serverLog(LL_WARNING, | |
| "AOF was not loaded because the size of the corrupted portion " | |
| "exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' " | |
| "is bigger than aof-load-broken-max-size '%lld'", | |
| (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size)); | |
| } | |
| /* fmterr may be caused by accidentally machine shutdown, so if the broken tail | |
| * is less than a specified size, try to recover it automatically */ | |
| if (server.aof_load_broken) { | |
| if (redis_fstat(fileno(fp), &sb) == -1) { | |
| serverLog(LL_WARNING, "Unable to stat AOF file %s: %s", filename, strerror(errno)); | |
| ret = AOF_FAILED; | |
| goto cleanup; | |
| } | |
| if (valid_up_to == -1) { | |
| serverLog(LL_WARNING,"Last valid command offset is invalid"); | |
| } else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) { | |
| if (truncate(aof_filepath,valid_up_to) == -1) { | |
| serverLog(LL_WARNING,"Error truncating the AOF file: %s", | |
| strerror(errno)); | |
| } else { | |
| /* Make sure the AOF file descriptor points to the end of the | |
| * file after the truncate call. */ | |
| if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) { | |
| serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s", | |
| strerror(errno)); | |
| } else { | |
| serverLog(LL_WARNING, | |
| "AOF loaded anyway because aof-load-broken is enabled and " | |
| "broken size '%lld' is less than aof-load-broken-max-size '%lld'", | |
| (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size)); | |
| ret = AOF_BROKEN_RECOVERED; | |
| goto loaded_ok; | |
| } | |
| } | |
| } else { /* The size of the corrupted portion exceeds the configured limit. */ | |
| serverLog(LL_WARNING, | |
| "AOF was not loaded because the size of the corrupted portion " | |
| "exceeds the configured limit. aof-load-broken is enabled and broken size '%lld' " | |
| "is bigger than aof-load-broken-max-size '%lld'", | |
| (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size)); | |
| } |
🤖 Prompt for AI Agents
In `@src/aof.c` around lines 1722 - 1752, The code uses sb.st_size in the
aof-load-broken recovery path without ensuring the earlier redis_fstat
succeeded; add a local check of redis_fstat()'s return (or a boolean set when
redis_fstat filled sb) before any use of sb.st_size in the block handling
server.aof_load_broken and valid_up_to, and if fstat failed log a warning (using
serverLog) and skip the truncation/recovery and the size comparisons against
server.aof_load_broken_max_size to avoid UB and misleading decisions; update
references in this branch (uses of sb.st_size, the broken-size comparisons, and
the truncate/lseek logic around server.aof_fd) to only execute when the fstat
check confirms sb is valid.
| createTimeTConfig("repl-backlog-ttl", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.repl_backlog_time_limit, 60*60, INTEGER_CONFIG, NULL, NULL), /* Default: 1 hour */ | ||
| createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL), | ||
| createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL), | ||
| createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL), |
There was a problem hiding this comment.
Allow standard size units for aof-load-broken-max-size.
Using INTEGER_CONFIG rejects values like 4kb even though this is a byte-size setting and the config header advertises unit suffixes. Consider switching to MEMORY_CONFIG for consistency with other size-based directives (e.g., auto-aof-rewrite-min-size).
Proposed change
- createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
+ createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL), | |
| createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL), |
🤖 Prompt for AI Agents
In `@src/config.c` at line 3258, The config entry created via createOffTConfig for
"aof-load-broken-max-size" currently uses INTEGER_CONFIG which rejects unit
suffixes; change its type to MEMORY_CONFIG so values like "4kb" are accepted and
parsed consistently with other size directives (e.g.,
auto-aof-rewrite-min-size), ensuring the existing default
server.aof_load_broken_max_size and bounds (0, LONG_MAX) remain unchanged.
| # Remove all incr AOF files to make the base file being the last file | ||
| exec rm -f $aof_dirpath/appendonly.aof.* | ||
| start_server_aof [list dir $server_path aof-load-broken yes] { |
There was a problem hiding this comment.
Wildcard deletion won’t expand in Tcl exec.
exec rm -f $aof_dirpath/appendonly.aof.* won’t expand the glob, so the incr files likely remain and the test won’t create the “base is last file” condition. Use glob + file delete (or exec rm -f {*}[glob ...]).
🛠️ Suggested fix
- exec rm -f $aof_dirpath/appendonly.aof.*
+ file delete -force {*}[glob -nocomplain $aof_dirpath/appendonly.aof.*]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Remove all incr AOF files to make the base file being the last file | |
| exec rm -f $aof_dirpath/appendonly.aof.* | |
| start_server_aof [list dir $server_path aof-load-broken yes] { | |
| # Remove all incr AOF files to make the base file being the last file | |
| file delete -force {*}[glob -nocomplain $aof_dirpath/appendonly.aof.*] | |
| start_server_aof [list dir $server_path aof-load-broken yes] { |
🤖 Prompt for AI Agents
In `@tests/integration/aof.tcl` around lines 723 - 725, The Tcl exec call using a
wildcard doesn't expand so incr AOF files aren't deleted; change the deletion to
use Tcl glob expansion (e.g. call glob on "$aof_dirpath/appendonly.aof.*" and
then file delete those entries or use exec with the expanded list via {*}[glob
...]) in the same block where exec rm -f $aof_dirpath/appendonly.aof.* currently
appears (near the start_server_aof invocation) so the incremental AOF files are
actually removed and the "base is last file" test condition is satisfied.
Benchmark PR from agentic-review-benchmarks#7
Summary by CodeRabbit
Release Notes
New Features
aof-load-broken(enables recovery) andaof-load-broken-max-size(sets corruption size threshold). Server can now automatically recover from mid-file corruption by truncating corrupted segments below the threshold.Tests
✏️ Tip: You can customize this high-level summary in your review settings.