Skip to content

Fix ring tokens sorting regression introduced in #3601 #3815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 12, 2021

Conversation

pracucci
Copy link
Contributor

What this PR does:
I've introduced a regression in #3601 assuming that tokens are always sorted in the ring. Actually, recent versions of Cortex guarantee it but if your Cortex cluster has been created long time ago and tokens have been transferred from a leaving ingester to a joining one over and over they are still unsorted in the ring.

In this PR:

  • Guarantee sorted tokens when a JOINING ingester claim tokems from a LEAVING one
  • Guarantee sorted tokens when an ingester read tokens from file
  • Rollback tokens sorting in Desc.GetTokens() and Desc.getTokensByZone() which was removed in Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor #3601 (required to cover the time range during which ingesters of the new Cortex version are rolling out)
  • No CHANGELOG because I expect this to be cherry picked in release-1.7

My plan is:

  • Merge this PR
  • Cherry pick the fix into release-1.7
  • Open another PR to remove the sorting from Desc.GetTokens() and Desc.getTokensByZone() adding a CHANGELOG entry to require upgrade to Cortex 1.7 before upgrading to future version Cortex 1.8

The performance impact for the sorting re-introduced in Desc.GetTokens() and Desc.getTokensByZone() is pretty low:

benchmark                                   old ns/op     new ns/op     delta
BenchmarkRing_ShuffleShard_512Tokens-12     399147        427036        +6.99%
BenchmarkRing_Get-12                        781           787           +0.77%

Ring.Get() is not expected to be impacted, the little diff is just noise.

Which issue(s) this PR fixes:
Fixes #3769

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@bboreham
Copy link
Contributor

After this change, are tokens sorted in disk file and in KV store?
That would be a good thing to ensure, because we could then remove sort-on-load after a couple of versions.

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok

@pracucci
Copy link
Contributor Author

After this change, are tokens sorted in disk file and in KV store?

Yes, they should (unless I'm missing anything).

because we could then remove sort-on-load after a couple of versions

My idea (in the PR description) was to remove it from the next version and require to upgrade to Cortex 1.7 before 1.8. What do you think? It would allow us to immediately remove the sort-on-load from master.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

(Nit: when posting benchmarks, please include full output from benchstat, including statistical significance, otherwise they are not saying much).

@pracucci
Copy link
Contributor Author

🤞 Let's go and hope (even if hope is never a strategy)

@pracucci pracucci merged commit 56f794d into cortexproject:master Feb 12, 2021
@pracucci pracucci deleted the fix-ring-tokens-sorting branch February 12, 2021 11:51
pracucci added a commit to pracucci/cortex that referenced this pull request Feb 12, 2021
khaines pushed a commit that referenced this pull request Feb 13, 2021
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1.7.0-rc0 querier incorrectly interpreting ring from 1.6.0 ingesters
3 participants