-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor: Add retries to block cleaner #8036
Conversation
@@ -410,15 +424,39 @@ func (c *BlocksCleaner) cleanUser(ctx context.Context, userID string, userLogger | |||
} | |||
}() | |||
|
|||
// Read the bucket index. | |||
retries := backoff.New(ctx, backoff.Config{ | |||
MaxRetries: 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 retries seems a lot, especially if the system is timing out because it's overloaded, we're just making it worse. Maybe 2-3?
}) | ||
var lastErr error | ||
|
||
for retries.Ongoing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to retry the individual operation that failed, rather than the entire cleanUser flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for agreeing to have this conversation with me.
Ideally cleanUser (and other compactor flows?) would be composed of series of idempotent operations that can be retried individually until the entire series is finished. (kinda like an Airflow job, or whatever.) But it's complicated because thanos.objStore is a plugin architecture of different blob store providers and they do not always agree on what is idempotent. (see Delete.) So my thought was to just speed up the retry operation that already happens today until someone had some good ideas about finer grained retries. (Basically, choosing a "good" fix rather than a "really good" one.)
However:
- It's true that
Delete
is tricky to retry across the blob store providers. - But looking again at the situation that motivated this PR, 100% of the failures were idempotent things: Gets and Uploads.
- While working on this PR, I found that injecting failures into arbitrary blob store operations and then retrying the whole
cleanUser
series would lead to eventual success, but the blocks cleaner unit tests' assertions about the state of the bucket at the end of the process were not always upheld. This means that cleanUser is suspect under partial failures and we should as much as possible do fine grained retries. - This makes me think I should indeed pursue the "RetryingBucket" idea I had that injects a few retries into all of the idempotent blob store methods. (Which I think is all of them except
Delete
.) (cleanUser does do deletes, but we could come back to that later.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pursue the "RetryingBucket" idea
I'll workshop that in a separate PR and close this one if it works out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like a RetryingBucket
implies retries are done internally to the bucket, but even if we have to externally retry bucket operations, my comment was just about whether this is better - to retry just the operation that failed - rather than all of the operations in cleanUser. Your point about idempotency makes sense though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to add one more cook: is there a third option - adding a retry mechanism to each of the high level actions of the compactor cleanup - ReadIndex
, UpdateIndex
, deleteRemainingData
, and WriteIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prototyped both.
- RetryingBucket...Client {WIP} Make blocks cleaner more resilient by adding a retrying objstore.Bucket. #8052
- higher-level (but not that high) operation retry Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071
The retryingBucket is kinda neat but ultimately it is not a great abstraction because you have to think about how much time you're going to give to specific operations... but store that info at the client layer and pass that into some function that is going to do 4 different operations across 30 different blob store objects. Kind of an impedance mismatch. I think the second one there is the one we should go with.
What this PR does
This adds some (five) retries to the blocks cleaner's
cleanUser
routine to minimize the chances that a single tenant's cleanup loop will fail enough consecutive times to affect queriability.When the blocks cleaner runs for a tenant, it carries out a series of steps to perform one
cleanUser
pass. Most of these steps involve an objstore invocation. (Fetching a block index, iterating the paths under a block folder, deleting a marker...)In these series of steps, there are currently two avenues for "retries":
We are currently relying on Avenue 2 to eventually recover from past block cleaner failures. But the crux of a recent incident was that the stuff in cleanUser must 100% complete for the updated bucket index to be written. If cleanUser fails enough consecutive times, store-gateways will refuse to load the "stale" bucket index, and some queries will begin to fail. In that incident, a larger percentage of obj store calls were exceeding their context deadline (which looks like network flakiness) hence the >=4 consecutive cleanUser failures leading to a >=1 hour stale bucket index.
As for improving it, this PR "just" wraps
cleanUser
with five retries.The positives:
The shortcomings:
Other ideas:
The individual obj store APIs and provider backends (GCS, minio, ...) have different ideas and requirements for idempotency. So writing something like a "retrying bucket" layer in the objstore client hierarchy, while appealing, might be a lot more work (or less tractable) than it seems.
see also:
Which issue(s) this PR fixes or relates to
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.