Description
Is your proposal related to a problem?
I'm working on two Kubernetes jobs which read/write/delete blocks from object storage:
- periodic rewrites on new blocks uploaded to object storage
- running maintenance scripts in case of any corruption that gets Compactor halting (which happens quite often :/)
- For the first job, I want to ensure that Compactor isn't currently running, working on blocks I want to rewrite at the same time
- For the second job, I want to notify the Compactor that it's fine to carry on
Describe the solution you'd like
It would be great if my tasks could call an endpoint like ://thanos-compactor/suspend
when they are starting their work.
This would notify Compactor to drop & forget everything it's doing immediately, going into its halted state.
After everything is done, a call to ://thanos-compactor/resume
would then get it back running and to leave the halted state again, resyncing block information to continue (or rather restart) compaction.
Compactor should support storing its suspension status locally, so stays suspended after getting restarted.
These endpoints should be opt-in via a flag (or seperate ones?), as this shouldn't be made available without proper monitoring for Compactors being halted for too long.
Having "Resume" in the UI would be nice as well; not so sure about a "Suspend" button, as it could raise concerns about accidental (or malicious) activations impacting stability and performance. But on the other hand, there is already a --disable-admin-operations
flag.
Describe alternatives you've considered
Putting thanos compact
inbetween my scripts
My first idea was to run Compactor as a job as well, making it easier to perform the operations in sequence.
While this would be possible with a custom image executing some pre and post thanos compact
tasks, it's also getting more complex when I don't want to execute those tasks on the same schedule.
Locking streams or bucket instead of Compactor
suspend / resume would work in my use case, but I'm only working with one Compactor instance right now, against a simple, not replicated bucket.
With more than one Compactor instances working on different streams, discovering, suspending and resuming the correct one might get a bit more involved.
In that case it might be easier to put some lock file into the bucket, which could be checked by the Compactor before every upload.
Haven't thought about it too much, so not sure if this would even properly work in all cases when replication gets involved?
This would also waste more CPU cycles, since Compactor learns about the lock at a later point.
Doing relabeling in Compactor
As described in #4941 (comment), my particular use case of relabeling would be handled even better directly by the Compactor.
Additional context
/resume
would even by useful without /suspend
, in cases where it's easier to do a request (or click a button in the UI) rather than restarting the process, for example when Comapctor isn't running in the cluster.