add retry for index header corruption to avoid index headers getting deleted from store-gateway #13402

sarthaktyagi-505 · 2025-11-07T10:16:30Z

What this PR does

Adds retry logic with exponential backoff to ReadIndex() to handle transient bucket index corruption during concurrent reads/writes.

Problem:
During store-gateway restarts, the store-gateway's ReadIndex() call can receive mixed content from both the old and new versions of the file. This could result in corrupted gzip streams or invalid JSON, causing ErrIndexCorrupted. When this happens, the store-gateway proceeds with empty metadata, which leads to incorrect block cleanup where valid blocks (and their index headers) are deleted, requiring hours to rebuild.

Solution:

Implements retry logic (up to 5 attempts) specifically for ErrIndexCorrupted
Uses exponential backoff with jitter (500ms, 1s, 2s, 4s, 8s)
Only retries on corruption errors; other errors (e.g., ErrIndexNotFound) fail immediately
Respects context cancellation during retry delays
Refactored read logic into readIndexAttempt() for clean separation

Impact:
This prevents the critical bug where store-gateways delete index headers during rolling restarts, eliminating multi-hour recovery times for large tenants.

Which issue(s) this PR fixes or relates to

Fixes #10649

Checklist

Tests updated.
Documentation added.
[] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

Note

Adds retry with exponential backoff/jitter to ReadIndex() for ErrIndexCorrupted, with logging and tests, refactoring read logic into readIndexAttempt().

Bucket Index Read:
- Retries ReadIndex() up to 5 times on ErrIndexCorrupted with exponential backoff and jitter; respects context cancellation.
- Adds structured logging for retry attempts and final failure/success.
- Refactors single read flow into readIndexAttempt().
Tests:
- Adds TestReadIndex_ShouldRetryIfIndexIsCorrupted using a corruptingBucket to simulate transient corruption and verify retry success.

^{Written by Cursor Bugbot for commit ec31777. This will update automatically on new commits. Configure here.}

…eleted from store-gateway

dimitarvdimitrov · 2025-11-07T11:11:17Z

this sounds like your object store is not atomic. I'm not sure we should be solving for this. What object store are you using?

sarthaktyagi-505 · 2025-11-07T11:22:50Z

Hi @dimitarvdimitrov we are using s3. And still we atleast see this behavior in one of store-gateways each time we do a an uprgade to them.

sarthaktyagi-505 · 2025-11-07T11:31:01Z

I am wondering if this is a race condition where we acquire a reader object to read the contents of the s3 bucket from store-gateway while compactor updates the block in s3. Now store-gateway already has reader object which tries to read from an updated block which compactor overwrote. What do you think?

sarthaktyagi-505 · 2025-11-07T12:57:48Z

I tried to correlate my theory here in the logs. I am trying to find the occurence of compactor writes and store-gateway reads and Index header corrupting in between.

2025-10-23 15:53:32.608 | ts=2025-10-23T14:53:32.549103977Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231212 task=clean_up_users msg="started blocks cleanup and maintenance" |  
-- | -- | --
  |   | 2025-10-23 15:53:32.709 | ts=2025-10-23T14:53:32.6773033Z caller=blocks_cleaner.go:432 level=info component=cleaner run_id=1761231212 task=clean_up_users user=continuous-testing msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:33.409 | ts=2025-10-23T14:53:33.340786921Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231212 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:33.409 | ts=2025-10-23T14:53:33.340759764Z caller=blocks_cleaner.go:437 level=info component=cleaner run_id=1761231212 task=clean_up_users user=continuous-testing msg="completed blocks cleanup and maintenance" duration=663.463354ms |  
  |   | 2025-10-23 15:53:45.909 | ts=2025-10-23T14:53:45.905558752Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231225 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:46.009 | ts=2025-10-23T14:53:45.956984693Z caller=blocks_cleaner.go:432 level=info component=cleaner run_id=1761231225 task=clean_up_users user=default msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:56.709 | ts=2025-10-23T14:53:56.654133097Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231225 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:56.709 | ts=2025-10-23T14:53:56.654086899Z caller=blocks_cleaner.go:437 level=info component=cleaner run_id=1761231225 task=clean_up_users user=default msg="completed blocks cleanup and maintenance" duration=10.697107941s |  
  |   | 2025-10-23 15:55:59.607 | ts=2025-10-23T14:55:59.576588484Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231359 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:55:59.607 | ts=2025-10-23T14:55:59.539423965Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231359 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:57:21.582 | ts=2025-10-23T14:57:21.559690605Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231441 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:57:21.880 | ts=2025-10-23T14:57:21.81997276Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231441 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:58:24.247 | ts=2025-10-23T14:58:24.228715924Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231504 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:58:24.347 | ts=2025-10-23T14:58:24.291902844Z caller=blocks_cleaner.go:432 level=info component=cleaner run_id=1761231504 task=clean_up_users user=admiral msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:58:25.347 | ts=2025-10-23T14:58:25.335444455Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231504 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:58:25.347 | ts=2025-10-23T14:58:25.335397303Z caller=blocks_cleaner.go:437 level=info component=cleaner run_id=1761231504 task=clean_up_users user=admiral msg="completed blocks cleanup and maintenance" duration=1.043503944s |  
  |   | 2025-10-23 16:01:57.375 | ts=2025-10-23T15:01:57.359079181Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231717 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 16:01:57.375 | ts=2025-10-23T15:01:57.319202838Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231717 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:59:17.641 | ts=2025-10-23T14:59:17.595252792Z caller=bucket_index_metadata_fetcher.go:84 level=error msg="corrupted bucket index found" user=default err="bucket index corrupted" |  
  |   | 2025-10-23 15:55:29.441 | ts=2025-10-23T14:55:29.422508252Z caller=gateway.go:327 level=info msg="synchronizing TSDB blocks for all users" reason=periodic |  
  |   | 2025-10-23 15:55:41.241 | ts=2025-10-23T14:55:41.203238979Z caller=gateway.go:333 level=info msg="successfully synchronized TSDB blocks for all users" reason=periodic |  
  |   | 2025-10-23 15:59:12.041 | ts=2025-10-23T14:59:11.981672633Z caller=gateway.go:327 level=info msg="synchronizing TSDB blocks for all users" reason=ring-change |  
  |   | 2025-10-23 15:59:12.941 | ts=2025-10-23T14:59:12.889765727Z caller=gateway.go:333 level=info msg="successfully synchronized TSDB blocks for all users" reason=ring-change |  
  |   | 2025-10-23 15:59:17.541 | ts=2025-10-23T14:59:17.531405709Z caller=gateway.go:327 level=info msg="synchronizing TSDB blocks for all users" reason=ring-change |  
  |   | 2025-10-23 15:59:40.341 | ts=2025-10-23T14:59:40.319443325Z caller=gateway.go:333 level=info msg="successfully synchronized TSDB blocks for all users" reason=ring-change |  
  |   | 2025-10-23 16:01:10.628 | ts=2025-10-23T15:01:10.578276572Z caller=bucket_stores.go:195 level=info msg="synchronizing TSDB blocks for all users" |  
  |   | 2025-10-23 16:00:17.355 | ts=2025-10-23T15:00:17.322478469Z caller=bucket.go:352 level=info user=default msg="dropped outdated block" block=01K86MWFR0N8SSZ8N2F49HNXFG |  
  |   | 2025-10-23 16:00:17.355 | ts=2025-10-23T15:00:17.29033733Z caller=bucket.go:352 level=info user=default msg="dropped outdated block" block=01K860FK1EX1GRXRQDM8ZBV3Y1 |  
  |   | 2025-10-23 16:00:17.355 | ts=2025-10-23T15:00:17.285231156Z caller=bucket.go:352 level=info user=default msg="dropped outdated block" block=01K872X6AM742DC5815SF540VC |  
  |   | 2025-10-23 16:00:17.355 | ts=2025-10-23T15:00:17.280347362Z caller=bucket.go:352 level=info user=default msg="dropped outdated block" block=01K86W5W3SR68KZAXDTFP1VXVM

dimitarvdimitrov · 2025-11-12T14:44:38Z

i don't think s3 has this behaviour, i'm not sure what's causing the errors. i'll leave some ideas on the issue

add retry on index header corruption to avoid index headers getting d…

037d55e

…eleted from store-gateway

sarthaktyagi-505 requested a review from a team as a code owner November 7, 2025 10:16

sarthaktyagi-505 changed the title ~~add retry on index header corruption to avoid index headers getting deleted from store-gateway~~ add retry for index header corruption to avoid index headers getting deleted from store-gateway Nov 7, 2025

fix lint error

ec31777

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add retry for index header corruption to avoid index headers getting deleted from store-gateway #13402

add retry for index header corruption to avoid index headers getting deleted from store-gateway #13402

Uh oh!

sarthaktyagi-505 commented Nov 7, 2025 •

edited by cursor bot

Loading

Uh oh!

dimitarvdimitrov commented Nov 7, 2025

Uh oh!

sarthaktyagi-505 commented Nov 7, 2025 •

edited

Loading

Uh oh!

sarthaktyagi-505 commented Nov 7, 2025

Uh oh!

sarthaktyagi-505 commented Nov 7, 2025

Uh oh!

dimitarvdimitrov commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add retry for index header corruption to avoid index headers getting deleted from store-gateway #13402

Are you sure you want to change the base?

add retry for index header corruption to avoid index headers getting deleted from store-gateway #13402

Uh oh!

Conversation

sarthaktyagi-505 commented Nov 7, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

Uh oh!

dimitarvdimitrov commented Nov 7, 2025

Uh oh!

sarthaktyagi-505 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarthaktyagi-505 commented Nov 7, 2025

Uh oh!

sarthaktyagi-505 commented Nov 7, 2025

Uh oh!

dimitarvdimitrov commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sarthaktyagi-505 commented Nov 7, 2025 •

edited by cursor bot

Loading

sarthaktyagi-505 commented Nov 7, 2025 •

edited

Loading