Compactor: Add retries to block cleaner #8036

seizethedave · 2024-05-02T23:26:49Z

What this PR does

This adds some (five) retries to the blocks cleaner's cleanUser routine to minimize the chances that a single tenant's cleanup loop will fail enough consecutive times to affect queriability.

When the blocks cleaner runs for a tenant, it carries out a series of steps to perform one cleanUser pass. Most of these steps involve an objstore invocation. (Fetching a block index, iterating the paths under a block folder, deleting a marker...)
In these series of steps, there are currently two avenues for "retries":

Retries that the GCS, Minio (and so on) objstore provider SDKs perform. For example, the GCS SDK will automatically retry operations that it deems idempotent. And it has a suite of rules to determine which errors it will retry. Minio has similar (but different) policies around automatically retrying things.
every 15 minutes (by default) the tenant's block cleaner job will be run again.

We are currently relying on Avenue 2 to eventually recover from past block cleaner failures. But the crux of a recent incident was that the stuff in cleanUser must 100% complete for the updated bucket index to be written. If cleanUser fails enough consecutive times, store-gateways will refuse to load the "stale" bucket index, and some queries will begin to fail. In that incident, a larger percentage of obj store calls were exceeding their context deadline (which looks like network flakiness) hence the >=4 consecutive cleanUser failures leading to a >=1 hour stale bucket index.

As for improving it, this PR "just" wraps cleanUser with five retries.

The positives:

This is the simplest thing.
Likely would have prevented the incident.

The shortcomings:

A failed cleanUser call will completely start over. (As it does today, only the retry won't be 15 minutes later.)

Other ideas:

The individual obj store APIs and provider backends (GCS, minio, ...) have different ideas and requirements for idempotency. So writing something like a "retrying bucket" layer in the objstore client hierarchy, while appealing, might be a lot more work (or less tractable) than it seems.

Which issue(s) this PR fixes or relates to

Relates to GCS, S3 timeouts are not retried #7980

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

…retry

jhalterman · 2024-05-03T00:46:49Z

pkg/compactor/blocks_cleaner.go

@@ -410,15 +424,39 @@ func (c *BlocksCleaner) cleanUser(ctx context.Context, userID string, userLogger
 		}
 	}()

-	// Read the bucket index.
+	retries := backoff.New(ctx, backoff.Config{
+		MaxRetries: 5,


5 retries seems a lot, especially if the system is timing out because it's overloaded, we're just making it worse. Maybe 2-3?

jhalterman · 2024-05-03T00:48:48Z

pkg/compactor/blocks_cleaner.go

+	})
+	var lastErr error
+
+	for retries.Ongoing() {


Would it be better to retry the individual operation that failed, rather than the entire cleanUser flow?

Thank you for agreeing to have this conversation with me.

Ideally cleanUser (and other compactor flows?) would be composed of series of idempotent operations that can be retried individually until the entire series is finished. (kinda like an Airflow job, or whatever.) But it's complicated because thanos.objStore is a plugin architecture of different blob store providers and they do not always agree on what is idempotent. (see Delete.) So my thought was to just speed up the retry operation that already happens today until someone had some good ideas about finer grained retries. (Basically, choosing a "good" fix rather than a "really good" one.)

However:

It's true that Delete is tricky to retry across the blob store providers.

But looking again at the situation that motivated this PR, 100% of the failures were idempotent things: Gets and Uploads.

While working on this PR, I found that injecting failures into arbitrary blob store operations and then retrying the whole cleanUser series would lead to eventual success, but the blocks cleaner unit tests' assertions about the state of the bucket at the end of the process were not always upheld. This means that cleanUser is suspect under partial failures and we should as much as possible do fine grained retries.

This makes me think I should indeed pursue the "RetryingBucket" idea I had that injects a few retries into all of the idempotent blob store methods. (Which I think is all of them except Delete.) (cleanUser does do deletes, but we could come back to that later.)

pursue the "RetryingBucket" idea

I'll workshop that in a separate PR and close this one if it works out.

Something like a RetryingBucket implies retries are done internally to the bucket, but even if we have to externally retry bucket operations, my comment was just about whether this is better - to retry just the operation that failed - rather than all of the operations in cleanUser. Your point about idempotency makes sense though.

just to add one more cook: is there a third option - adding a retry mechanism to each of the high level actions of the compactor cleanup - ReadIndex, UpdateIndex, deleteRemainingData, and WriteIndex?

I prototyped both.

RetryingBucket...Client {WIP} Make blocks cleaner more resilient by adding a retrying objstore.Bucket. #8052

higher-level (but not that high) operation retry Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071

The retryingBucket is kinda neat but ultimately it is not a great abstraction because you have to think about how much time you're going to give to specific operations... but store that info at the client layer and pass that into some function that is going to do 4 different operations across 30 different blob store objects. Kind of an impedance mismatch. I think the second one there is the one we should go with.

seizethedave added 8 commits April 28, 2024 20:29

Add retries to cleanUser

41d85dc

rm newline

79d2ccc

Merge remote-tracking branch 'origin/main' into davidgrant/compactor-…

2615f2c

…retry

stash changes

7e4ff91

Unbreak the error I was returning.

2eacf9d

Fix the tests.

219b681

Merge remote-tracking branch 'origin/main' into davidgrant/compactor-…

426a362

…retry

add changelog entry.

b2b51a4

seizethedave marked this pull request as ready for review May 2, 2024 23:37

seizethedave requested a review from a team as a code owner May 2, 2024 23:37

jhalterman reviewed May 3, 2024

View reviewed changes

seizethedave mentioned this pull request May 3, 2024

{WIP} Make blocks cleaner more resilient by adding a retrying objstore.Bucket. #8052

Closed

4 tasks

seizethedave closed this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor: Add retries to block cleaner #8036

Compactor: Add retries to block cleaner #8036

seizethedave commented May 2, 2024 •

edited

Loading

jhalterman May 3, 2024

jhalterman May 3, 2024

seizethedave May 3, 2024 •

edited

Loading

seizethedave May 3, 2024 •

edited

Loading

jhalterman May 4, 2024

dimitarvdimitrov May 6, 2024

seizethedave May 6, 2024

Compactor: Add retries to block cleaner #8036

Compactor: Add retries to block cleaner #8036

Conversation

seizethedave commented May 2, 2024 • edited Loading

What this PR does

Other ideas:

Which issue(s) this PR fixes or relates to

Checklist

jhalterman May 3, 2024

Choose a reason for hiding this comment

jhalterman May 3, 2024

Choose a reason for hiding this comment

seizethedave May 3, 2024 • edited Loading

Choose a reason for hiding this comment

seizethedave May 3, 2024 • edited Loading

Choose a reason for hiding this comment

jhalterman May 4, 2024

Choose a reason for hiding this comment

dimitarvdimitrov May 6, 2024

Choose a reason for hiding this comment

seizethedave May 6, 2024

Choose a reason for hiding this comment

seizethedave commented May 2, 2024 •

edited

Loading

seizethedave May 3, 2024 •

edited

Loading

seizethedave May 3, 2024 •

edited

Loading