-
Notifications
You must be signed in to change notification settings - Fork 820
Added shuffle sharding support to generate a subring #3090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added shuffle sharding support to generate a subring #3090
Conversation
/cc @thorfour (original author of #1947) and @harry671003, who's been poking around in this area. |
|
||
// We expect the shard size to be divisible by the number of zones, in order to | ||
// have nodes balanced across zones. If it's not, we do round up. | ||
numInstancesPerZone := int(math.Ceil(float64(size) / float64(len(r.ringZones)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to check: in the case where we're not using zones, zone = ingesterID. Will this algorithm end distributing all users to all ingesters in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case where we're not using zones, zone = ingesterID
It was in the initial implementation but was soon fixed in #2404. When not using zones, the zone
is an empty string, which works fine here (all ingesters belong to the same zone).
However, I'm wondering if we should add a boolean flag to the ring to explicitly enable zone awareness, disabled by default instead of just relying on the fact zones are set (think about a rolling update adding zones to ingesters, you may want to rollout all ingesters first and then enable it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents is if it doesn't add too much complexity, it's probably better to favor rollout safety for changing the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree @ranton256. Will do it in a separate PR (since it's unrelated from the work done in this PR).
I guess I don't understand this comment. Can you elaborate on what this problem is? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
// number of zones and the returned subring will contain the same number of | ||
// instances per zone as far as there are enough registered instances in the ring. | ||
// | ||
// The algorithm used to build the subring is a shuffle sharder based on probabilistic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "probabilistic hashing"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uniqueness is not guaranteed, but it offers a low collision probability. What we're using here is a probabilistic data structure.
If we pick nodes following the "next" tokens in the ring (like |
So is the value of this for really low tenant counts to make it more likely to evenly distribute load? Because the example of 2 tenants overlapping will always be true at large enough tenant counts. |
No. It works very well with a large number of tenants too (assuming a reasonably large number of nodes too, but order of magnitude smaller then tenants).
It depends on the cluster size. If you have a cluster with 100 ingesters and you randomly pick 6 nodes for each tenant, you end up with 75M possible combinations, and the chances to have overlapping nodes (with shuffle sharding) are:
I created this spreadsheet to easily compute it. For more information, you could take a look at this and the reference implementation linked in the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pracucci thanks for the explanation! LGTM
0629048
to
ea4c258
Compare
# in the per-tenant overrides, a value of 0 disables shuffle sharding for the | ||
# tenant. | ||
# CLI flag: -distribution.ingestion-tenant-shard-size | ||
[ingestion_tenant_shard_size: <int> | default = 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it based on the similar "store_gateway_tenant_shard_size" flag from https://github.com/cortexproject/cortex/pull/3069/files you are planning to name all the similar flags "_tenant_shard_size"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are you planning to take/keep the -experimental off this when you are ready to submit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it based on the similar "store_gateway_tenant_shard_size" flag from https://github.com/cortexproject/cortex/pull/3069/files you are planning to name all the similar flags "_tenant_shard_size"?
Yes, I do. The reason is that overrides are specified within the same YAML node, so we need a way to clearly differentiate them. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are you planning to take/keep the -experimental off this when you are ready to submit?
I would plan to remove the experimental flag once ready to submit, unless you have any concerns. Generally speaking, I think originally adding the experimental prefix to CLI flags was a mistake and shouldn't be done anymore. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean document as experimental but not name the flag experimental? That sounds fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes Richard, you're correct (sorry for being unclear). This way, once we're all comfortable marking it stable, it will just be a doc change instead of a config change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There is additional work (caching of user rings) available in https://github.com/pstibrany/cortex/tree/fix-shuffle-sharding branch, not yet tested. |
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
3783b39
to
7131ffc
Compare
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…ortex#3090) * Added shuffle sharding support to generate a subring Signed-off-by: Marco Pracucci <marco@pracucci.com> * Solved all TODOs Signed-off-by: Marco Pracucci <marco@pracucci.com> * Simplified unlock Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed linter Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added benchmark Signed-off-by: Marco Pracucci <marco@pracucci.com> * Replaced Subring() with ShuffleShard() Signed-off-by: Marco Pracucci <marco@pracucci.com> * Small improvements Signed-off-by: Marco Pracucci <marco@pracucci.com> * Shortened CHANGELOG entry Signed-off-by: Marco Pracucci <marco@pracucci.com> * Make golang doc happy Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed flag name and added integration test Signed-off-by: Marco Pracucci <marco@pracucci.com>
What this PR does:
The current
Subring()
implementation was conceived as an experimental shuffle sharding support in Cortex. It works fine in regards to build a subring of N nodes and guaranteeing stability and consistency but unfortunately suffers two issues:In this PR I proposed an alternative
ShuffleShard()
implementation which is inspired by AWS Route53 infima library and guarantee stability and consistency, while offering nodes shuffling and zone awareness.Will be done in subsequent PRs:
Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]