Skip to content

Conversation

@sarthaktyagi-505
Copy link
Contributor

@sarthaktyagi-505 sarthaktyagi-505 commented Nov 7, 2025

What this PR does

Adds retry logic with exponential backoff to ReadIndex() to handle transient bucket index corruption during concurrent reads/writes.

Problem:
During store-gateway restarts, the store-gateway's ReadIndex() call can receive mixed content from both the old and new versions of the file. This could result in corrupted gzip streams or invalid JSON, causing ErrIndexCorrupted. When this happens, the store-gateway proceeds with empty metadata, which leads to incorrect block cleanup where valid blocks (and their index headers) are deleted, requiring hours to rebuild.

Solution:

  • Implements retry logic (up to 5 attempts) specifically for ErrIndexCorrupted
  • Uses exponential backoff with jitter (500ms, 1s, 2s, 4s, 8s)
  • Only retries on corruption errors; other errors (e.g., ErrIndexNotFound) fail immediately
  • Respects context cancellation during retry delays
  • Refactored read logic into readIndexAttempt() for clean separation

Impact:
This prevents the critical bug where store-gateways delete index headers during rolling restarts, eliminating multi-hour recovery times for large tenants.

Which issue(s) this PR fixes or relates to

Fixes #10649

Checklist

  • Tests updated.
  • Documentation added.
  • [] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

Note

Adds retry with exponential backoff/jitter to ReadIndex() for ErrIndexCorrupted, with logging and tests, refactoring read logic into readIndexAttempt().

  • Bucket Index Read:
    • Retries ReadIndex() up to 5 times on ErrIndexCorrupted with exponential backoff and jitter; respects context cancellation.
    • Adds structured logging for retry attempts and final failure/success.
    • Refactors single read flow into readIndexAttempt().
  • Tests:
    • Adds TestReadIndex_ShouldRetryIfIndexIsCorrupted using a corruptingBucket to simulate transient corruption and verify retry success.

Written by Cursor Bugbot for commit ec31777. This will update automatically on new commits. Configure here.

@sarthaktyagi-505 sarthaktyagi-505 requested a review from a team as a code owner November 7, 2025 10:16
@sarthaktyagi-505 sarthaktyagi-505 changed the title add retry on index header corruption to avoid index headers getting deleted from store-gateway add retry for index header corruption to avoid index headers getting deleted from store-gateway Nov 7, 2025
@dimitarvdimitrov
Copy link
Contributor

this sounds like your object store is not atomic. I'm not sure we should be solving for this. What object store are you using?

@sarthaktyagi-505
Copy link
Contributor Author

sarthaktyagi-505 commented Nov 7, 2025

Hi @dimitarvdimitrov we are using s3. And still we atleast see this behavior in one of store-gateways each time we do a an uprgade to them.

@sarthaktyagi-505
Copy link
Contributor Author

I am wondering if this is a race condition where we acquire a reader object to read the contents of the s3 bucket from store-gateway while compactor updates the block in s3. Now store-gateway already has reader object which tries to read from an updated block which compactor overwrote. What do you think?

@sarthaktyagi-505
Copy link
Contributor Author

I tried to correlate my theory here in the logs. I am trying to find the occurence of compactor writes and store-gateway reads and Index header corrupting in between.

2025-10-23 15:53:32.608 | ts=2025-10-23T14:53:32.549103977Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231212 task=clean_up_users msg="started blocks cleanup and maintenance" |  
-- | -- | --
  |   | 2025-10-23 15:53:32.709 | ts=2025-10-23T14:53:32.6773033Z caller=blocks_cleaner.go:432 level=info component=cleaner run_id=1761231212 task=clean_up_users user=continuous-testing msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:33.409 | ts=2025-10-23T14:53:33.340786921Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231212 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:33.409 | ts=2025-10-23T14:53:33.340759764Z caller=blocks_cleaner.go:437 level=info component=cleaner run_id=1761231212 task=clean_up_users user=continuous-testing msg="completed blocks cleanup and maintenance" duration=663.463354ms |  
  |   | 2025-10-23 15:53:45.909 | ts=2025-10-23T14:53:45.905558752Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231225 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:46.009 | ts=2025-10-23T14:53:45.956984693Z caller=blocks_cleaner.go:432 level=info component=cleaner run_id=1761231225 task=clean_up_users user=default msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:56.709 | ts=2025-10-23T14:53:56.654133097Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231225 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:53:56.709 | ts=2025-10-23T14:53:56.654086899Z caller=blocks_cleaner.go:437 level=info component=cleaner run_id=1761231225 task=clean_up_users user=default msg="completed blocks cleanup and maintenance" duration=10.697107941s |  
  |   | 2025-10-23 15:55:59.607 | ts=2025-10-23T14:55:59.576588484Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231359 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:55:59.607 | ts=2025-10-23T14:55:59.539423965Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231359 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:57:21.582 | ts=2025-10-23T14:57:21.559690605Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231441 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:57:21.880 | ts=2025-10-23T14:57:21.81997276Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231441 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:58:24.247 | ts=2025-10-23T14:58:24.228715924Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231504 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:58:24.347 | ts=2025-10-23T14:58:24.291902844Z caller=blocks_cleaner.go:432 level=info component=cleaner run_id=1761231504 task=clean_up_users user=admiral msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:58:25.347 | ts=2025-10-23T14:58:25.335444455Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231504 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:58:25.347 | ts=2025-10-23T14:58:25.335397303Z caller=blocks_cleaner.go:437 level=info component=cleaner run_id=1761231504 task=clean_up_users user=admiral msg="completed blocks cleanup and maintenance" duration=1.043503944s |  
  |   | 2025-10-23 16:01:57.375 | ts=2025-10-23T15:01:57.359079181Z caller=blocks_cleaner.go:243 level=info component=cleaner run_id=1761231717 task=clean_up_users msg="successfully completed blocks cleanup and maintenance" |  
  |   | 2025-10-23 16:01:57.375 | ts=2025-10-23T15:01:57.319202838Z caller=blocks_cleaner.go:237 level=info component=cleaner run_id=1761231717 task=clean_up_users msg="started blocks cleanup and maintenance" |  
  |   | 2025-10-23 15:59:17.641 | ts=2025-10-23T14:59:17.595252792Z caller=bucket_index_metadata_fetcher.go:84 level=error msg="corrupted bucket index found" user=default err="bucket index corrupted" |  
  |   | 2025-10-23 15:55:29.441 | ts=2025-10-23T14:55:29.422508252Z caller=gateway.go:327 level=info msg="synchronizing TSDB blocks for all users" reason=periodic |  
  |   | 2025-10-23 15:55:41.241 | ts=2025-10-23T14:55:41.203238979Z caller=gateway.go:333 level=info msg="successfully synchronized TSDB blocks for all users" reason=periodic |  
  |   | 2025-10-23 15:59:12.041 | ts=2025-10-23T14:59:11.981672633Z caller=gateway.go:327 level=info msg="synchronizing TSDB blocks for all users" reason=ring-change |  
  |   | 2025-10-23 15:59:12.941 | ts=2025-10-23T14:59:12.889765727Z caller=gateway.go:333 level=info msg="successfully synchronized TSDB blocks for all users" reason=ring-change |  
  |   | 2025-10-23 15:59:17.541 | ts=2025-10-23T14:59:17.531405709Z caller=gateway.go:327 level=info msg="synchronizing TSDB blocks for all users" reason=ring-change |  
  |   | 2025-10-23 15:59:40.341 | ts=2025-10-23T14:59:40.319443325Z caller=gateway.go:333 level=info msg="successfully synchronized TSDB blocks for all users" reason=ring-change |  
  |   | 2025-10-23 16:01:10.628 | ts=2025-10-23T15:01:10.578276572Z caller=bucket_stores.go:195 level=info msg="synchronizing TSDB blocks for all users" |  
  |   | 2025-10-23 16:00:17.355 | ts=2025-10-23T15:00:17.322478469Z caller=bucket.go:352 level=info user=default msg="dropped outdated block" block=01K86MWFR0N8SSZ8N2F49HNXFG |  
  |   | 2025-10-23 16:00:17.355 | ts=2025-10-23T15:00:17.29033733Z caller=bucket.go:352 level=info user=default msg="dropped outdated block" block=01K860FK1EX1GRXRQDM8ZBV3Y1 |  
  |   | 2025-10-23 16:00:17.355 | ts=2025-10-23T15:00:17.285231156Z caller=bucket.go:352 level=info user=default msg="dropped outdated block" block=01K872X6AM742DC5815SF540VC |  
  |   | 2025-10-23 16:00:17.355 | ts=2025-10-23T15:00:17.280347362Z caller=bucket.go:352 level=info user=default msg="dropped outdated block" block=01K86W5W3SR68KZAXDTFP1VXVM

@dimitarvdimitrov
Copy link
Contributor

i don't think s3 has this behaviour, i'm not sure what's causing the errors. i'll leave some ideas on the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: store-gateway loads all its block before being ready with lazy loading option enabled

2 participants