Skip to content

Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor #3601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Dec 14, 2020

What this PR does:
We've been reported an high memory utilisation in the store-gateway when shuffle-sharding is enabled and the cluster has a large number of tenants and replicas. This memory utilisation increase is caused by the in-memory subring cache used by the shuffle-sharding, which is used to reduce the CPU utilisation due to the expensive computation of the subring.

However, store-gateway, ruler and compactor use Ring.ShuffleShard() only to check whether a tenant/block/rulegroup belongs to their own shard and this operation is run, for each tenant, only at a regular interval (it doesn't depend on QPS).

Disabling subring cache is a possible solution (proposed in this PR), but then CPU utilisation is expected to increase. For this reason, in this PR I've also worked to optimise Ring.ShuffleShard() paying attention to not increase CPU time for Ring.Get() (problem is that it's easy to optimise ShuffleShard() but hard to do without impacting Get()).

BenchmarkRing_ShuffleShard_512Tokens (512 tokens per instance)

benchmark                                   old ns/op     new ns/op     delta
BenchmarkRing_ShuffleShard_512Tokens-12     1396063       389304        -72.11%

benchmark                                   old allocs     new allocs     delta
BenchmarkRing_ShuffleShard_512Tokens-12     60             59             -1.67%

benchmark                                   old bytes     new bytes     delta
BenchmarkRing_ShuffleShard_512Tokens-12     681775        58226         -91.46%

BenchmarkRing_Get

benchmark                                   old ns/op     new ns/op     delta
BenchmarkRing_Get-12                        853           764           -10.43%

benchmark                                   old allocs     new allocs     delta
BenchmarkRing_Get-12                        0              0              +0.00%

benchmark                                   old bytes     new bytes     delta
BenchmarkRing_Get-12                        0             0             +0.00%

BenchmarkRing_ShuffleShard (128 tokens per instance)

BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_3-12              100674        32002         -68.21%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_10-12             393978        152904        -61.19%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_30-12             1316209       639339        -51.43%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_3-12              117988        48928         -58.53%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_10-12             525024        183298        -65.09%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_30-12             1221208       558804        -54.24%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_3-12             112511        37656         -66.53%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_10-12            408735        153631        -62.41%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_30-12            1328229       655980        -50.61%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_3-12             121937        49367         -59.51%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_10-12            466278        176984        -62.04%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_30-12            1265773       562446        -55.57%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_3-12            95741         31072         -67.55%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_10-12           376408        143209        -61.95%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_30-12           1328692       616708        -53.59%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_3-12            114967        47096         -59.04%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_10-12           441547        176075        -60.12%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_30-12           1288778       551126        -57.24%

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]


// Whether the shuffle-sharding subring cache is disabled. This option is set
// internally and never exposed to the user.
SubringCacheDisabled bool `yaml:"-"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config option is "disabled bool" instead of "enabled bool" so that the default will be "enabled" (because the zero value of a bool is false).

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@pracucci pracucci marked this pull request as draft December 15, 2020 15:05
@pracucci pracucci force-pushed the do-not-cache-subring-if-not-required branch from be49fee to 438fa45 Compare December 15, 2020 16:55
@pracucci pracucci requested a review from pstibrany December 15, 2020 16:56
@@ -34,6 +34,11 @@ func GenerateTokens(numTokens int, takenTokens []uint32) []uint32 {
i++
}

// Ensure returned tokens are sorted.
sort.Slice(tokens, func(i, j int) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ring lifecycler it's already guaranteed all tokens are sorted. I've introduced sorting here as well to have same guarantee in tests too.

@pracucci pracucci changed the title Disabled subring cache in store-gateway, ruler and compactor Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor Dec 15, 2020
@pracucci pracucci marked this pull request as ready for review December 15, 2020 18:49
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far. (I thought it's still a draft :))

@pracucci pracucci force-pushed the do-not-cache-subring-if-not-required branch from c0d0e97 to 9d2de1d Compare December 16, 2020 11:38
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the do-not-cache-subring-if-not-required branch from 3e87a3d to 49e56e8 Compare January 7, 2021 10:11
@pracucci pracucci merged commit e4867d8 into cortexproject:master Jan 8, 2021
@pracucci pracucci deleted the do-not-cache-subring-if-not-required branch January 8, 2021 14:07
pracucci added a commit to pracucci/cortex that referenced this pull request Feb 12, 2021
Signed-off-by: Marco Pracucci <marco@pracucci.com>
pracucci added a commit that referenced this pull request Feb 12, 2021
Signed-off-by: Marco Pracucci <marco@pracucci.com>
pracucci added a commit to pracucci/cortex that referenced this pull request Feb 12, 2021
khaines pushed a commit that referenced this pull request Feb 13, 2021
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants