-
Notifications
You must be signed in to change notification settings - Fork 679
add retry for index header corruption to avoid index headers getting deleted from store-gateway #13402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…eleted from store-gateway
|
this sounds like your object store is not atomic. I'm not sure we should be solving for this. What object store are you using? |
|
Hi @dimitarvdimitrov we are using s3. And still we atleast see this behavior in one of store-gateways each time we do a an uprgade to them. |
|
I am wondering if this is a race condition where we acquire a reader object to read the contents of the s3 bucket from store-gateway while compactor updates the block in s3. Now store-gateway already has reader object which tries to read from an updated block which compactor overwrote. What do you think? |
|
I tried to correlate my theory here in the logs. I am trying to find the occurence of compactor writes and store-gateway reads and Index header corrupting in between. |
|
i don't think s3 has this behaviour, i'm not sure what's causing the errors. i'll leave some ideas on the issue |
What this PR does
Adds retry logic with exponential backoff to
ReadIndex()to handle transient bucket index corruption during concurrent reads/writes.Problem:
During store-gateway restarts, the store-gateway's
ReadIndex()call can receive mixed content from both the old and new versions of the file. This could result in corrupted gzip streams or invalid JSON, causingErrIndexCorrupted. When this happens, the store-gateway proceeds with empty metadata, which leads to incorrect block cleanup where valid blocks (and their index headers) are deleted, requiring hours to rebuild.Solution:
ErrIndexCorruptedErrIndexNotFound) fail immediatelyreadIndexAttempt()for clean separationImpact:
This prevents the critical bug where store-gateways delete index headers during rolling restarts, eliminating multi-hour recovery times for large tenants.
Which issue(s) this PR fixes or relates to
Fixes #10649
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX].about-versioning.mdupdated with experimental features.Note
Adds retry with exponential backoff/jitter to
ReadIndex()forErrIndexCorrupted, with logging and tests, refactoring read logic intoreadIndexAttempt().ReadIndex()up to 5 times onErrIndexCorruptedwith exponential backoff and jitter; respects context cancellation.readIndexAttempt().TestReadIndex_ShouldRetryIfIndexIsCorruptedusing acorruptingBucketto simulate transient corruption and verify retry success.Written by Cursor Bugbot for commit ec31777. This will update automatically on new commits. Configure here.