Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compactor] Downsampling error when downsampling deduped block #6696

Open
verejoel opened this issue Sep 4, 2023 · 4 comments
Open

[compactor] Downsampling error when downsampling deduped block #6696

verejoel opened this issue Sep 4, 2023 · 4 comments

Comments

@verejoel
Copy link
Contributor

verejoel commented Sep 4, 2023

Thanos, Prometheus and Golang version used: 0.32.2

Object Storage Provider: Azure

What happened: Compactor fails to downsample a block with the following error:

ts=2023-09-04T08:56:35.688668329Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H9FB4YMTT0BNKVRVQKHNRXDF to window 3600000: downsample aggregate block, series: 1899: invalid size"

What you expected to happen: The block can be downsampled from 5m to 60m resolution

How to reproduce it (as minimally and precisely as possible): This is quite complicated. I will try and describe what happened as precisely as possible:

  • run Thanos for > 1 year without deduping in the compactors -> many individual streams in bucket with receive_replica external labels and same tenant_id external label
  • enable compactor deduplication by setting --deduplication.replica-label=receive_replica
  • enable penalty algorithm (----deduplication.func=penalty), as compaction of already downsampled blocks was failing, presumably due to failing assumptions of one-to-one dedupe algorithm
  • leave to run for 2 days -> all seemed to be working
  • downsampling of the first 5m block that was formed from already downsampled blocks fails with the given error

Full logs to relevant components:

Logs

ts=2023-09-04T08:56:33.626230473Z caller=compact.go:1419 level=info msg="start of GC"
ts=2023-09-04T08:56:33.626417909Z caller=compact.go:1442 level=info msg="start of compactions"
ts=2023-09-04T08:56:33.626660602Z caller=compact.go:1478 level=info msg="compaction iterations done"
ts=2023-09-04T08:56:33.626689005Z caller=compact.go:434 level=info msg="start first pass of downsampling"
ts=2023-09-04T08:56:33.729098296Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=102.795866ms duration_ms=102 cached=146 returned=35 partial=0
ts=2023-09-04T08:56:33.829870428Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=100.653163ms duration_ms=100 cached=146 returned=35 partial=0
ts=2023-09-04T08:56:35.016484486Z caller=downsample.go:362 level=info msg="downloaded block" id=01H9FB4YMTT0BNKVRVQKHNRXDF duration=1.185885634s duration_ms=1185
ts=2023-09-04T08:56:35.688252109Z caller=streamed_block_writer.go:178 level=info msg="finalized downsampled block" mint=1692230401392 maxt=1693440000000 ulid=01H9FPGWFVHQE2CXF8G5DHCFNP resolution=3600000
ts=2023-09-04T08:56:35.688668329Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H9FB4YMTT0BNKVRVQKHNRXDF to window 3600000: downsample aggregate block, series: 1899: invalid size"
ts=2023-09-04T08:56:35.688691523Z caller=http.go:91 level=info service=http/server component=compact msg="internal server is shutting down" err="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H9FB4YMTT0BNKVRVQKHNRXDF to window 3600000: downsample aggregate block, series: 1899: invalid size"
ts=2023-09-04T08:56:35.688735665Z caller=http.go:110 level=info service=http/server component=compact msg="internal server is shutdown gracefully" err="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H9FB4YMTT0BNKVRVQKHNRXDF to window 3600000: downsample aggregate block, series: 1899: invalid size"
ts=2023-09-04T08:56:35.688751953Z caller=intrumentation.go:81 level=info msg="changing probe status" status=not-healthy reason="error executing compaction: first pass of downsampling failed: downsampling to 60 min: downsample block 01H9FB4YMTT0BNKVRVQKHNRXDF to window 3600000: downsample aggregate block, series: 1899: invalid size"
ts=2023-09-04T08:56:35.688861491Z caller=main.go:161 level=error err="downsampling to 60 min: downsample block 01H9FB4YMTT0BNKVRVQKHNRXDF to window 3600000: downsample aggregate block, series: 1899: invalid size\nfirst pass of downsampling failed\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:445\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:481\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:480\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nerror executing compaction\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:508\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:74\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:480\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"

Anything else we need to know:

@GiedriusS
Copy link
Member

Is this the same like #6380?

@verejoel
Copy link
Contributor Author

verejoel commented Sep 4, 2023

Yes looks to be, vertical compaction was also recently enabled there.

I have some example blocks that fail, what would be the best place to upload them?

@michalschott
Copy link

Also facing this issue.

@yeya24
Copy link
Contributor

yeya24 commented Sep 5, 2023

Hey, it would be great to have the block and debug the problem locally. I feel like #6598 might be related.

If you are willing to share one of the problematic block, feel free to ping me on Slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants