Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip blocks with out-of-order chunk during compaction #4469

Merged
merged 1 commit into from
Aug 13, 2021

Conversation

huyan0
Copy link
Contributor

@huyan0 huyan0 commented Jul 22, 2021

Signed-off-by: Yang Hu yhuz@amazon.com

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • add a check for OOO chunks in block index, and mark such block for no compaction, and continue on with the compaction
  • add testutil to generate an index with OOO chunks

Verification

  • verified the exepected behavior with compact_e2e_test.go

Addresses: #3442
cc: @yeya24 @alvinlin123 @roystchiang

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! Super high-quality code. Thanks for the contribution @huyan0

cmd/thanos/compact.go Outdated Show resolved Hide resolved

if i.OutOfOrderSeries > 0 {
errMsg = append(errMsg, fmt.Sprintf(
func (i HealthStats) OutOfOrderChunksErr() error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, do you think we can just skip compaction for all the CriticalErr; instead of just OutOfOrderChunksErr?

Copy link
Contributor Author

@huyan0 huyan0 Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay on the compact side given we cannot generate more blocks than not compacting...I don't think there's much implication on the retrieval side either but would like to hear what @yeya24 thinks : )
@yeya24 do you have some context on why the project decided to halt compaction in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka do you have some context on the discussion? :)

@huyan0 huyan0 force-pushed the main branch 2 times, most recently from 887afae to 0532597 Compare July 23, 2021 14:20
@huyan0 huyan0 marked this pull request as ready for review July 23, 2021 19:19
@huyan0 huyan0 requested a review from yeya24 July 23, 2021 19:19
@huyan0
Copy link
Contributor Author

huyan0 commented Jul 27, 2021

I have updated the metric counter, will add to changelog if there's no further comments for now @yeya24

yeya24
yeya24 previously approved these changes Jul 28, 2021
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr LGTM. But it changes the default behavior and might be critical so it would be good to hear from other maintainers.

cc @thanos-io/thanos-maintainers

@wiardvanrij
Copy link
Member

The code looks solid and you referenced an issue. If I understand correctly, we now mark it as corrupted and skip it. That still leaves a block corrupted. Are there any follow-up steps that a user could do? Basically I would really love to have some documentation on:

  • How could one find corrupted blocks?
  • What is the impact for the user if that happens?
  • Could one still fix a corrupted block and let it get compacted after the fact?
  • anything else that provides context to users

This would IMO be the cherry on top :)

@bwplotka
Copy link
Member

TODO: Can we check if those blocks are readable by Thanos Store/Cortex Store

@bwplotka
Copy link
Member

TODO: Review & check for consistency with other issues and how we handle them (:

@huyan0
Copy link
Contributor Author

huyan0 commented Jul 30, 2021

@bwplotka regarding your two comments:

  1. the blocks are still readable because the mark that we add to the block is for compact only, and the store API has no requirement on whether the chunks are sorted: link

  2. We handle this error differently than the other errors encountered during compaction; for those we halt the compaction. Other than OOO chunks, the two other specific errors compact check for are these , for which we attempt to repair or compacts if a feature flag is set, so I don't see an overarching consistent approach to adhere to, other than logging the occurrence to provide context for user. This is understandable since these errors are unexpected and specific. I agree a deep dive into the root cause is needed to address this in long term to come up with a fix for the root cause.

@huyan0
Copy link
Contributor Author

huyan0 commented Aug 9, 2021

PR still pending review :) Any updates needed?

cmd/thanos/compact.go Outdated Show resolved Hide resolved
@yeya24
Copy link
Contributor

yeya24 commented Aug 9, 2021

Btw let's update the changelog and we are ready to go I think

@yeya24
Copy link
Contributor

yeya24 commented Aug 11, 2021

@huyan0 Sorry for the late review. I will merge the pr after you resolve the conflict.

@huyan0
Copy link
Contributor Author

huyan0 commented Aug 11, 2021

hello, I have rebased : )

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yeya24
Copy link
Contributor

yeya24 commented Aug 11, 2021

Looks like the E2E is still failing and caused by this pr.

21:58:23 Killing sidecar-remote-and-sidecar
21:58:23 compact-expect-to-halt: panic: inconsistent label cardinality: expected 2 label values but got 1 in []string{"deletion-mark.json"}
21:58:23 compact-expect-to-halt: goroutine 1 [running]:
21:58:23 compact-expect-to-halt: github.com/prometheus/client_golang/prometheus.(*CounterVec).WithLabelValues(0xc00000f9b8, 0xc000887840, 0x1, 0x1, 0x2035af0, 0xc00019f320)
21:58:23 compact-expect-to-halt: /home/runner/go/pkg/mod/github.com/prometheus/client_golang@v1.10.0/prometheus/counter.go:248 +0xff
21:58:23 compact-expect-to-halt: main.newCompactMetrics(0xc0001a0640, 0x9d29229e0000, 0x1ba7500)
21:58:23 compact-expect-to-halt: /home/runner/work/thanos/thanos/cmd/thanos/compact.go:152 +0x5cd
21:58:23 compact-expect-to-halt: main.runCompact(0xc00049d5a8, 0x1ff4040, 0xc000889410, 0x201c490, 0x2e02e80, 0xc0001a0640, 0x1ff8d80, 0xc0008876b0, 0x1, 0x4, ...)
21:58:23 compact-expect-to-halt: /home/runner/work/thanos/thanos/cmd/thanos/compact.go:171 +0xc5
21:58:23 compact-expect-to-halt: main.registerCompact.func1(0xc00049d5a8, 0x1ff4040, 0xc000889410, 0xc0001a0640, 0x201c490, 0x2e02e80, 0xc000083e60, 0x0, 0x4169fb, 0x7f5f11773100)

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be other panics caused by that deletion marker metric. Please check and fix them. Thanks for the patience.

cmd/thanos/compact.go Show resolved Hide resolved
cmd/thanos/compact.go Outdated Show resolved Hide resolved
cmd/thanos/compact.go Outdated Show resolved Hide resolved
pkg/compact/compact.go Outdated Show resolved Hide resolved
@yeya24
Copy link
Contributor

yeya24 commented Aug 12, 2021

Can you do a rebase on latest main rather than including changes from another pr.

CHANGELOG.md Outdated Show resolved Hide resolved
@huyan0 huyan0 force-pushed the main branch 2 times, most recently from b4405a2 to 18efebe Compare August 12, 2021 00:31
pkg/compact/compact.go Outdated Show resolved Hide resolved
pkg/testutil/testutil.go Outdated Show resolved Hide resolved
pkg/testutil/testutil.go Outdated Show resolved Hide resolved
@huyan0 huyan0 force-pushed the main branch 2 times, most recently from 2abf49b to b0fbe2d Compare August 12, 2021 16:36
@huyan0 huyan0 requested a review from yeya24 August 12, 2021 16:58
@yeya24 yeya24 enabled auto-merge (squash) August 12, 2021 17:26
yeya24
yeya24 previously approved these changes Aug 12, 2021
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huyan0
Copy link
Contributor Author

huyan0 commented Aug 12, 2021

looking... Sorry for the back and forth. Will try to address it

Signed-off-by: Yang Hu <yhuz@amazon.com>
auto-merge was automatically disabled August 12, 2021 17:36

Head branch was pushed to by a user without write access

@huyan0 huyan0 requested a review from yeya24 August 12, 2021 19:28
@yeya24 yeya24 enabled auto-merge (squash) August 13, 2021 03:58
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants