Skip to content

Added shuffle sharding support to generate a subring #3090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 17, 2020

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Aug 26, 2020

What this PR does:
The current Subring() implementation was conceived as an experimental shuffle sharding support in Cortex. It works fine in regards to build a subring of N nodes and guaranteeing stability and consistency but unfortunately suffers two issues:

  1. It doesn't shuffle instances between tenants
  2. It's not zone aware (the resulting subring may be very inbalanced in regards to zones)

In this PR I proposed an alternative ShuffleShard() implementation which is inspired by AWS Route53 infima library and guarantee stability and consistency, while offering nodes shuffling and zone awareness.

Will be done in subsequent PRs:

  • Introduce a cache for each tenant subring
  • Documentation

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@tomwilkie
Copy link
Contributor

/cc @thorfour (original author of #1947) and @harry671003, who's been poking around in this area.


// We expect the shard size to be divisible by the number of zones, in order to
// have nodes balanced across zones. If it's not, we do round up.
numInstancesPerZone := int(math.Ceil(float64(size) / float64(len(r.ringZones))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to check: in the case where we're not using zones, zone = ingesterID. Will this algorithm end distributing all users to all ingesters in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case where we're not using zones, zone = ingesterID

It was in the initial implementation but was soon fixed in #2404. When not using zones, the zone is an empty string, which works fine here (all ingesters belong to the same zone).

However, I'm wondering if we should add a boolean flag to the ring to explicitly enable zone awareness, disabled by default instead of just relying on the fact zones are set (think about a rolling update adding zones to ingesters, you may want to rollout all ingesters first and then enable it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents is if it doesn't add too much complexity, it's probably better to favor rollout safety for changing the flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree @ranton256. Will do it in a separate PR (since it's unrelated from the work done in this PR).

@thorfour
Copy link
Contributor

It doesn't shuffle instances between tenants

I guess I don't understand this comment. Can you elaborate on what this problem is?

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

// number of zones and the returned subring will contain the same number of
// instances per zone as far as there are enough registered instances in the ring.
//
// The algorithm used to build the subring is a shuffle sharder based on probabilistic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "probabilistic hashing"?

Copy link
Contributor Author

@pracucci pracucci Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uniqueness is not guaranteed, but it offers a low collision probability. What we're using here is a probabilistic data structure.

@pracucci
Copy link
Contributor Author

It doesn't shuffle instances between tenants

I guess I don't understand this comment. Can you elaborate on what this problem is?

If we pick nodes following the "next" tokens in the ring (like Subring() is working right now) we pick consecutive nodes instead of shuffling them. Think about 2 tenants: one whose initial hash match the node X in the ring, and the second matches the node X+1 in the ring. They will share all nodes minus 1 in the subring. Shuffle sharding is about shuffling nodes to make this case probabilistically unlikely to happen, which is what this PR does.

@thorfour
Copy link
Contributor

It doesn't shuffle instances between tenants

I guess I don't understand this comment. Can you elaborate on what this problem is?

If we pick nodes following the "next" tokens in the ring (like Subring() is working right now) we pick consecutive nodes instead of shuffling them. Think about 2 tenants: one whose initial hash match the node X in the ring, and the second matches the node X+1 in the ring. They will share all nodes minus 1 in the subring. Shuffle sharding is about shuffling nodes to make this case probabilistically unlikely to happen, which is what this PR does.

So is the value of this for really low tenant counts to make it more likely to evenly distribute load? Because the example of 2 tenants overlapping will always be true at large enough tenant counts.

@pracucci
Copy link
Contributor Author

So is the value of this for really low tenant counts to make it more likely to evenly distribute load?

No. It works very well with a large number of tenants too (assuming a reasonably large number of nodes too, but order of magnitude smaller then tenants).

Because the example of 2 tenants overlapping will always be true at large enough tenant counts.

It depends on the cluster size. If you have a cluster with 100 ingesters and you randomly pick 6 nodes for each tenant, you end up with 75M possible combinations, and the chances to have overlapping nodes (with shuffle sharding) are:

  • Probability of 6 equal instances: 0.00000008%
  • Probability of 5 equal instances: 0.00004731%
  • Probability of 4 equal instances: 0.00550018%
  • Probability of 3 equal instances: 0.22489615%
  • Probability of 2 equal instances: 3.83729063%
  • Probability of 1 equal instances: 27.62849251%
  • Probability of 0 equal instances: 68.30377314%

I created this spreadsheet to easily compute it.

For more information, you could take a look at this and the reference implementation linked in the PR description.

Copy link
Contributor

@thorfour thorfour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pracucci thanks for the explanation! LGTM

@pracucci pracucci force-pushed the fix-shuffle-sharding branch from 0629048 to ea4c258 Compare August 27, 2020 07:52
# in the per-tenant overrides, a value of 0 disables shuffle sharding for the
# tenant.
# CLI flag: -distribution.ingestion-tenant-shard-size
[ingestion_tenant_shard_size: <int> | default = 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it based on the similar "store_gateway_tenant_shard_size" flag from https://github.com/cortexproject/cortex/pull/3069/files you are planning to name all the similar flags "_tenant_shard_size"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are you planning to take/keep the -experimental off this when you are ready to submit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it based on the similar "store_gateway_tenant_shard_size" flag from https://github.com/cortexproject/cortex/pull/3069/files you are planning to name all the similar flags "_tenant_shard_size"?

Yes, I do. The reason is that overrides are specified within the same YAML node, so we need a way to clearly differentiate them. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are you planning to take/keep the -experimental off this when you are ready to submit?

I would plan to remove the experimental flag once ready to submit, unless you have any concerns. Generally speaking, I think originally adding the experimental prefix to CLI flags was a mistake and shouldn't be done anymore. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean document as experimental but not name the flag experimental? That sounds fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Richard, you're correct (sorry for being unclear). This way, once we're all comfortable marking it stable, it will just be a doc change instead of a config change.

Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pstibrany
Copy link
Contributor

There is additional work (caching of user rings) available in https://github.com/pstibrany/cortex/tree/fix-shuffle-sharding branch, not yet tested.

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the fix-shuffle-sharding branch from 3783b39 to 7131ffc Compare September 17, 2020 13:08
@pracucci pracucci marked this pull request as ready for review September 17, 2020 13:08
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci merged commit c0d3995 into cortexproject:master Sep 17, 2020
simonswine pushed a commit to grafana/e2e that referenced this pull request Jan 13, 2022
…ortex#3090)

* Added shuffle sharding support to generate a subring

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Solved all TODOs

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Simplified unlock

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed linter

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added benchmark

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Replaced Subring() with ShuffleShard()

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Small improvements

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Shortened CHANGELOG entry

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Make golang doc happy

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed flag name and added integration test

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants