Tips for slow compact on a large bucket with large blocks #4310

sevagh · 2021-06-04T14:11:40Z

Hello,

I use Thanos with a rather large bucket (Ceph object store) - 10TB total. I store metrics at the raw resolution with 30 days retention, downsampling disabled.

Here's my systemd daemon for it:

/opt/prometheus2/thanos compact \
        --data-dir=/thanos-compact \
        --objstore.config-file=/etc/prometheus2/thanos_bucket_config.yml \
        --http-address=0.0.0.0:10910 \
        --retention.resolution-raw=30d \
        --retention.resolution-5m=1m \
        --retention.resolution-1h=1m \
        --downsampling.disable \
        --delete-delay=0s \
        --wait

The daemon runs in continual mode, but recently it has been slow to complete its compaction runs and it doesn't reach the "delete blocks" part of the run until it's too late and the storage bucket is overflowing (related issue: #2605)

I recently ran a compaction without the --wait mode, just to see how long a single run takes, and it's been 9 days so far without any deletions.

I have a locally compiled binary of Thanos which only runs block deletions on blocks marked for deletion, described here: #2605 (comment)

One thing I can do is:

Run compact for several days
Interrupt/exit it early (this should be safe to do, right?)
Run custom compact-deleter binary
Resume compactor

What I'm looking for is perhaps tips or solutions on how I can make this better.

Run multiple compactors. I understand the current design has the compactor as a singleton for multiple access safety (and I'm not a distributed systems engineer, so I don't know of the challenges of having multiple compactors not overwrite each others' data)
Bump up compact concurrency (probably an easy win, since I haven't tried that yet: https://thanos.io/tip/components/compact.md/#cpu)
There is a label sharding link for scaling compactor, but the link is dead: https://thanos.io/tip/sharding.md#compactor -
There's also the tip if single blocks are too large for compactor: Limit size of blocks to X bytes on compaction. #3068 - I believe this is the most likely case I'm hitting. We run some very fat single Prometheus instances such that there are many, many metrics per block

Here are the metrics from the compact instance:

Any tips would be appreciated, thanks!

The text was updated successfully, but these errors were encountered:

wiardvanrij · 2021-06-06T09:54:12Z

The correct link is now: https://thanos.io/tip/thanos/sharding.md/#compactor
About scalability: https://thanos.io/tip/components/compact.md/#scalability

There are various features about improving compactor performance. This is an umbrella issue which tracks them: #4233

Your first issue link is/should be resolved with #3115

That said, I'm not sure if you are really hitting certain limits that could not be already resolved by tweaking your setup. So I'm curious about what Thanos version you are using and if you could tell me something about the stats of it (i.e. cpu & memory usage). Did you also limit those stats?
Could you also give us some number of the amount of series per 2 hour block? This is basically the only stat that matters in this case. The amount of data in the bucket does not give the right details. However, having millions of series per 2h blocks would definitely indicate you might hit some limits.

If you want to run multiple compactors, you could look into the labels. As per docs: "This allows to assign multiple streams to each instance of compactor."
Since you have to configure this correctly I'm not going to exactly say how you can do this, as I prefer you test this out on a TST env, so you are sure it works for your setup. Compactor is a thing thats irreversible if done wrong. Hence this should need proper testing.

For example for store component one could use a relabel like this (don't use this for compactor!):

          --selector.relabel-config=
            - action: hashmod
              source_labels: ["__block_id"]
              target_label: shard
              modulus: {{ $shards }}
            - action: keep
              source_labels: ["shard"]
              regex: {{ $index }}

Yet this should not be used for compactor as this is not 'pinned' towards a specific stream. We merely split all data over multiple shards.

So, you want some form of relabel config that regex on streams; i.e.;

            - action: keep
              regex: my-instance
              source_labels: ["cluster"]

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

yeya24 · 2021-06-30T06:24:58Z

I am considering the same thing and I think you already gave the answer @sevagh 😄.
Right now the only way to scale the compactor is:

Add more compaction concurrency
Use hash partitioning (sharding) mentioned by @wiardvanrij. Shard the blocks by some labels which group your blocks from the same cluster together
Combine the two

sevagh · 2021-07-09T19:31:21Z

Thanos for the replies. @yeya24 I'm reading this PR you recently got merged, and I think it might help me: https://github.com/thanos-io/thanos/pull/4239/files/15acd8c8683c8ecc785ec71e4c16f89738e839b6#diff-59764a4da653d4464eac20465390033ab8abbd8b54688979727065cb389e848d

One of my issues with Ceph-Thanos is that I have 2x Prometheus pollers like a typical HA setup, and store 2x copies of each tsdb block (slightly different due to natural differences between two pollers).

It looks like the offline deduplication you added with "penalty" mode intended for HA Prometheus would shrink my ceph bucket by 50%ish? By combining these 2x HA blocks?

yeya24 · 2021-07-10T00:50:57Z

Thanos for the replies. @yeya24 I'm reading this PR you recently got merged, and I think it might help me: https://github.com/thanos-io/thanos/pull/4239/files/15acd8c8683c8ecc785ec71e4c16f89738e839b6#diff-59764a4da653d4464eac20465390033ab8abbd8b54688979727065cb389e848d

One of my issues with Ceph-Thanos is that I have 2x Prometheus pollers like a typical HA setup, and store 2x copies of each tsdb block (slightly different due to natural differences between two pollers).

It looks like the offline deduplication you added with "penalty" mode intended for HA Prometheus would shrink my ceph bucket by 50%ish? By combining these 2x HA blocks?

That penalty mode can reduce the data size in your bucket by less than 50%. About 30% ~ 45 % I guess as only chunk data are deduplicated, but indexes are still the same compared to the regular compaction.

yeya24 · 2021-07-10T19:24:24Z

@sevagh Let me move this to discussion as it is generally a question, not an issue.

wiardvanrij added the question label Jun 6, 2021

yeya24 closed this as completed Jul 10, 2021

thanos-io locked and limited conversation to collaborators Jul 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Tips for slow compact on a large bucket with large blocks #4310

Tips for slow compact on a large bucket with large blocks #4310

sevagh commented Jun 4, 2021

wiardvanrij commented Jun 6, 2021

yeya24 commented Jun 30, 2021

sevagh commented Jul 9, 2021

yeya24 commented Jul 10, 2021

yeya24 commented Jul 10, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Tips for slow compact on a large bucket with large blocks #4310

Tips for slow compact on a large bucket with large blocks #4310

Comments

sevagh commented Jun 4, 2021

wiardvanrij commented Jun 6, 2021

yeya24 commented Jun 30, 2021

sevagh commented Jul 9, 2021

yeya24 commented Jul 10, 2021

yeya24 commented Jul 10, 2021

This issue was moved to a discussion.