persist: set a timeout on compaction requests #15260
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'll fill in more details on Monday, but it seems like we're seeing compaction requests occasionally start but never complete. This causes compaction for a shard to jam indefinitely, leading to a domino chain of greater problems, like
environmentd
crash looping.Inferring from what data points we do have, but lacking hard evidence, my best guess right now is some S3 calls got stuck indefinitely. We don't currently set any client timeouts on our calls, so it seems possible that a call could be made to an unreachable destination and never complete. We could add timeouts directly to the S3 client, but smithy-lang/smithy-rs#1740 is upcoming and completely overhauls the timeout structure, and seems much improved over what we could do on today's SDK version.
This PR puts a hard timeout on compaction, so that requests over 10 minutes will be dropped. This should prevent us from getting stuck indefinitely on a long request, at the risk of dropping/spinning on requests that really do need 10 minutes. In practice, the mean compaction time is under a second, but, just want to highlight the potential risk here.
Motivation
Tips for reviewer
Checklist
This PR has adequate test coverage / QA involvement has been duly considered.
This PR evolves an existing
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way) and therefore is tagged with aT-proto
label.This PR includes the following user-facing behavior changes: