Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor #3601

pracucci · 2020-12-14T10:42:05Z

What this PR does:
We've been reported an high memory utilisation in the store-gateway when shuffle-sharding is enabled and the cluster has a large number of tenants and replicas. This memory utilisation increase is caused by the in-memory subring cache used by the shuffle-sharding, which is used to reduce the CPU utilisation due to the expensive computation of the subring.

However, store-gateway, ruler and compactor use Ring.ShuffleShard() only to check whether a tenant/block/rulegroup belongs to their own shard and this operation is run, for each tenant, only at a regular interval (it doesn't depend on QPS).

Disabling subring cache is a possible solution (proposed in this PR), but then CPU utilisation is expected to increase. For this reason, in this PR I've also worked to optimise Ring.ShuffleShard() paying attention to not increase CPU time for Ring.Get() (problem is that it's easy to optimise ShuffleShard() but hard to do without impacting Get()).

BenchmarkRing_ShuffleShard_512Tokens (512 tokens per instance)

benchmark                                   old ns/op     new ns/op     delta
BenchmarkRing_ShuffleShard_512Tokens-12     1396063       389304        -72.11%

benchmark                                   old allocs     new allocs     delta
BenchmarkRing_ShuffleShard_512Tokens-12     60             59             -1.67%

benchmark                                   old bytes     new bytes     delta
BenchmarkRing_ShuffleShard_512Tokens-12     681775        58226         -91.46%

BenchmarkRing_Get

benchmark                                   old ns/op     new ns/op     delta
BenchmarkRing_Get-12                        853           764           -10.43%

benchmark                                   old allocs     new allocs     delta
BenchmarkRing_Get-12                        0              0              +0.00%

benchmark                                   old bytes     new bytes     delta
BenchmarkRing_Get-12                        0             0             +0.00%

BenchmarkRing_ShuffleShard (128 tokens per instance)

BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_3-12              100674        32002         -68.21%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_10-12             393978        152904        -61.19%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_1,_shard_size_=_30-12             1316209       639339        -51.43%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_3-12              117988        48928         -58.53%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_10-12             525024        183298        -65.09%
BenchmarkRing_ShuffleShard/num_instances_=_50,_num_zones_=_3,_shard_size_=_30-12             1221208       558804        -54.24%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_3-12             112511        37656         -66.53%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_10-12            408735        153631        -62.41%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_1,_shard_size_=_30-12            1328229       655980        -50.61%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_3-12             121937        49367         -59.51%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_10-12            466278        176984        -62.04%
BenchmarkRing_ShuffleShard/num_instances_=_100,_num_zones_=_3,_shard_size_=_30-12            1265773       562446        -55.57%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_3-12            95741         31072         -67.55%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_10-12           376408        143209        -61.95%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_1,_shard_size_=_30-12           1328692       616708        -53.59%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_3-12            114967        47096         -59.04%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_10-12           441547        176075        -60.12%
BenchmarkRing_ShuffleShard/num_instances_=_1000,_num_zones_=_3,_shard_size_=_30-12           1288778       551126        -57.24%

Which issue(s) this PR fixes:
N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pracucci · 2020-12-14T10:44:50Z

pkg/ring/ring.go

+
+	// Whether the shuffle-sharding subring cache is disabled. This option is set
+	// internally and never exposed to the user.
+	SubringCacheDisabled bool `yaml:"-"`


The config option is "disabled bool" instead of "enabled bool" so that the default will be "enabled" (because the zero value of a bool is false).

pstibrany

Makes sense to me

pracucci · 2020-12-15T17:08:08Z

pkg/ring/util.go

@@ -34,6 +34,11 @@ func GenerateTokens(numTokens int, takenTokens []uint32) []uint32 {
 		i++
 	}

+	// Ensure returned tokens are sorted.
+	sort.Slice(tokens, func(i, j int) bool {


In the ring lifecycler it's already guaranteed all tokens are sorted. I've introduced sorting here as well to have same guarantee in tests too.

pstibrany

LGTM ~~so far.~~ (I thought it's still a draft :))

pkg/ring/ring.go

Signed-off-by: Marco Pracucci <marco@pracucci.com>

…ortexproject#3815) Signed-off-by: Marco Pracucci <marco@pracucci.com>

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci requested a review from pstibrany December 14, 2020 10:42

pull-request-size bot added the size/S label Dec 14, 2020

pracucci commented Dec 14, 2020

View reviewed changes

pstibrany approved these changes Dec 14, 2020

View reviewed changes

pull-request-size bot added size/XL and removed size/S labels Dec 15, 2020

pracucci marked this pull request as draft December 15, 2020 15:05

pracucci force-pushed the do-not-cache-subring-if-not-required branch from be49fee to 438fa45 Compare December 15, 2020 16:55

pracucci requested a review from pstibrany December 15, 2020 16:56

pracucci commented Dec 15, 2020

View reviewed changes

pracucci changed the title ~~Disabled subring cache in store-gateway, ruler and compactor~~ Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor Dec 15, 2020

pracucci marked this pull request as ready for review December 15, 2020 18:49

pstibrany approved these changes Dec 16, 2020

View reviewed changes

pkg/ring/ring.go Show resolved Hide resolved

pracucci force-pushed the do-not-cache-subring-if-not-required branch from c0d0e97 to 9d2de1d Compare December 16, 2020 11:38

pracucci added 7 commits January 7, 2021 11:10

Disabled subring cache in store-gateway, ruler and compactor

4d898a7

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Optimise Ring.ShuffleShard() and Ring.Get()

49ce0bf

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Code cleanup

513d260

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Code cleanup

27206db

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Simplified Ring.TokensFor()

0dccace

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Added CHANGELOG entry

856d543

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Added test to ensure Ring.Get() doesn't do any memory allocation

49e56e8

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci force-pushed the do-not-cache-subring-if-not-required branch from 3e87a3d to 49e56e8 Compare January 7, 2021 10:11

pracucci merged commit e4867d8 into cortexproject:master Jan 8, 2021

pracucci deleted the do-not-cache-subring-if-not-required branch January 8, 2021 14:07

pracucci added a commit to pracucci/cortex that referenced this pull request Feb 12, 2021

Fix ring tokens sorting regression introduced in cortexproject#3601

7133cc7

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci mentioned this pull request Feb 12, 2021

Fix ring tokens sorting regression introduced in #3601 #3815

Merged

3 tasks

pracucci added a commit that referenced this pull request Feb 12, 2021

Fix ring tokens sorting regression introduced in #3601 (#3815)

56f794d

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci added a commit to pracucci/cortex that referenced this pull request Feb 12, 2021

Fix ring tokens sorting regression introduced in cortexproject#3601 (c…

04f6183

…ortexproject#3815) Signed-off-by: Marco Pracucci <marco@pracucci.com>

khaines pushed a commit that referenced this pull request Feb 13, 2021

Fix ring tokens sorting regression introduced in #3601 (#3815) (#3816)

e108f70

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor #3601

Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor #3601

Uh oh!

pracucci commented Dec 14, 2020 •

edited

Loading

Uh oh!

pracucci Dec 14, 2020

Uh oh!

pstibrany left a comment

Uh oh!

pracucci Dec 15, 2020

Uh oh!

pstibrany left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor #3601

Optimised Ring.ShuffleShard() and disabled subring cache in store-gateway, ruler and compactor #3601

Uh oh!

Conversation

pracucci commented Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BenchmarkRing_ShuffleShard_512Tokens (512 tokens per instance)

BenchmarkRing_Get

BenchmarkRing_ShuffleShard (128 tokens per instance)

Uh oh!

pracucci Dec 14, 2020

Choose a reason for hiding this comment

Uh oh!

pstibrany left a comment

Choose a reason for hiding this comment

Uh oh!

pracucci Dec 15, 2020

Choose a reason for hiding this comment

Uh oh!

pstibrany left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pracucci commented Dec 14, 2020 •

edited

Loading

pstibrany left a comment •

edited

Loading