Skip to content

"suspend" and "resume" endpoint for Compactor #8141

Open
@outofrange

Description

@outofrange

Is your proposal related to a problem?

I'm working on two Kubernetes jobs which read/write/delete blocks from object storage:

  1. periodic rewrites on new blocks uploaded to object storage
  2. running maintenance scripts in case of any corruption that gets Compactor halting (which happens quite often :/)
  • For the first job, I want to ensure that Compactor isn't currently running, working on blocks I want to rewrite at the same time
  • For the second job, I want to notify the Compactor that it's fine to carry on

Describe the solution you'd like

It would be great if my tasks could call an endpoint like ://thanos-compactor/suspend when they are starting their work.
This would notify Compactor to drop & forget everything it's doing immediately, going into its halted state.

After everything is done, a call to ://thanos-compactor/resume would then get it back running and to leave the halted state again, resyncing block information to continue (or rather restart) compaction.

Compactor should support storing its suspension status locally, so stays suspended after getting restarted.

These endpoints should be opt-in via a flag (or seperate ones?), as this shouldn't be made available without proper monitoring for Compactors being halted for too long.

Having "Resume" in the UI would be nice as well; not so sure about a "Suspend" button, as it could raise concerns about accidental (or malicious) activations impacting stability and performance. But on the other hand, there is already a --disable-admin-operations flag.

Describe alternatives you've considered

Putting thanos compact inbetween my scripts

My first idea was to run Compactor as a job as well, making it easier to perform the operations in sequence.
While this would be possible with a custom image executing some pre and post thanos compact tasks, it's also getting more complex when I don't want to execute those tasks on the same schedule.

Locking streams or bucket instead of Compactor

suspend / resume would work in my use case, but I'm only working with one Compactor instance right now, against a simple, not replicated bucket.

With more than one Compactor instances working on different streams, discovering, suspending and resuming the correct one might get a bit more involved.
In that case it might be easier to put some lock file into the bucket, which could be checked by the Compactor before every upload.
Haven't thought about it too much, so not sure if this would even properly work in all cases when replication gets involved?

This would also waste more CPU cycles, since Compactor learns about the lock at a later point.

Doing relabeling in Compactor

As described in #4941 (comment), my particular use case of relabeling would be handled even better directly by the Compactor.

Additional context

/resume would even by useful without /suspend, in cases where it's easier to do a request (or click a button in the UI) rather than restarting the process, for example when Comapctor isn't running in the cluster.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions