limits: distributor user subrings #1947

thorfour · 2020-01-02T17:39:24Z

Signed-off-by: Thor thansen@digitalocean.com

Experimental TSDB Change Only

What this PR does: Opening a TSDB per user on each Ingester presents a problem with memory consumption. The series hashtable that TSDB pre-allocates per user is roughly 1.5MB.

Which means that at a large user count it eats up a large amount of memory per Ingester. More importantly when the number of ingesters/nodes are scaled out the memory consumption stays the same per node since the same number of TSDB's are allocated on each node.

Just as a demonstration I reduced the stripe size in the TSDB code.

+++ b/vendor/github.com/prometheus/prometheus/tsdb/head.go
@@ -1465,7 +1465,7 @@ type stripeSeries struct {
 }
 const (
-       stripeSize = 1 << 14
+       stripeSize = 1 << 8
        stripeMask = stripeSize - 1
 )

Which reduced overall memory consumption from ~94GiB to ~23Gib

Reducing the stripe size has performance implications, and doesn't resolve the root problem of metric footprint not scaling with node count.

This change adds an optional step when distributing metrics, that first creates a subring of ingesters based on the user ID, and then uses that subring to hash as normal. This means that for any given user, that user only has an open TSDB on N given Ingesters reducing the overall memory footprint and horizontally scaling with nodes. The setting is apart of the util.Limits code to provide a default and importantly per-user limits to prevent hot-spots being created by extra heavy users.

You can see in the image below the memory usage before (red arrow) with the reduced stripe size, and the memory usage after (green arrow) with a user sub ring of 4 and replication 3. So a user is allocated to 4 of the 16 nodes in the cluster.

Which issue(s) this PR fixes:
N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pracucci

Good job @thorfour! As we discussed privately, I see the value in doing it for the experimental TSDB storage, in a cluster with many tenants. I just left a couple of minor comments.

pkg/util/validation/limits.go

pkg/ring/ring.go

pracucci · 2020-01-03T08:58:31Z

@codesome While I believe this feature is great, do you also see any optimization we can do in TSDB to mitigate this?

codesome · 2020-01-06T07:48:14Z

@codesome While I believe this feature is great, do you also see any optimization we can do in TSDB to mitigate this?

Usually, such constant numbers are not exposed for users to set in case of Prometheus, and it might be the same case with stripeSize. As Prometheus is focused on a single installation of Prometheus on a node and performance, I don't think this constant would get reduced upstream.

CHANGELOG.md

pracucci

LGTM

pkg/util/validation/limits.go

bboreham · 2020-01-08T22:07:02Z

Even with the chunk store, I have sometimes wished that users were restricted to subsets of ingesters, generally because 3+ ingesters crashed so I lost some data from everyone and I would prefer to apologise to a subset of users.

thorfour · 2020-01-08T22:11:54Z

@codesome I definitely don't think the constant should be reduced. But is there a way to have it be user defined? So that TSDB's aren't as memory hungry but trade performance at upper bounds to do so?

It wouldn't even have to be a user option exposed on the Prometheus end, but one exposed only in the TSDB library, and Prometheus could simply use a constant.

pstibrany · 2020-01-10T09:30:55Z

This PR may have implications to PR #1764.

pstibrany · 2020-01-10T10:12:19Z

pkg/ring/ring.go

+
+// Subring returns a ring of n ingesters from the given ring
+// Subrings are meant only for ingestor lookup and should have their data externalized.
+func (r *Ring) Subring(key uint32, n int) (ReadRing, error) {


Would it make sense to use some caching here? We're going to call this on every push request, which looks like a lot of ring traversal. On the other hand, number of different keys isn't that high (= number of tenants), so perhaps we can use some of distributors memory to cache these subrings for some time.

I considered that, but was concerned about cache invalidation when the ring size changes. So I decided to at least punt on that for now.

pstibrany · 2020-01-10T10:15:07Z

pkg/ring/ring.go

+		start         = r.search(key)
+		iterations    = 0
+	)
+	for i := start; len(distinctHosts) < n && iterations < len(r.ringDesc.Tokens); i++ {


It seems to me that this algorithm is too complicated for what we're trying to do here (pick N ingesters based on key). That said, when I tried to come up with alternatives, they were not any better – and would cause too much ingester jumps during scale-up/down.

I think I would have preferred a simpler solution as well, but I didn't want to thrash when ring size changes as you've pointed out. Maybe if we couple simple sharding with #1958 it may not be so much of a problem?

Tangentially related I also had the idea that if a Querier might be able to detect a ring size hasn't changed in the last N time, and could in theory stop sending queries to all ingesters and instead only send them to the subring.

pracucci · 2020-01-17T15:08:45Z

@bboreham @gouthamve Have you had a chance to take a look at this PR? To my perspective, it's a safe feature, given doesn't change the actual Cortex logic until you enable the user sub ring.

docs/configuration/config-file-reference.md

jtlisi

LGTM

Will note that @pstibrany's comment on the performance of the SubRing function has me concerned and I think we will need to iterate on that. However, since it is an experimental feature I think this should be fine.

This will also be great if we could use it for reads. Connecting to each ingester is not ideal as things currently stand.

Signed-off-by: Thor <thansen@digitalocean.com>

pkg/ring/ring.go

Signed-off-by: Thor <thansen@digitalocean.com>

pracucci · 2020-01-31T08:57:41Z

I've done another round of tests, and LGTM. It's time merge it!

pracucci reviewed Jan 3, 2020

View reviewed changes

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

pkg/ring/ring.go Outdated Show resolved Hide resolved

pkg/ring/ring.go Show resolved Hide resolved

thorfour force-pushed the feat/tsdb/user-rings branch 2 times, most recently from 19eea95 to f8f6737 Compare January 3, 2020 18:37

pracucci reviewed Jan 7, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

thorfour force-pushed the feat/tsdb/user-rings branch 5 times, most recently from 48b4833 to 2512db3 Compare January 7, 2020 15:19

pracucci approved these changes Jan 7, 2020

View reviewed changes

thorfour force-pushed the feat/tsdb/user-rings branch from 2512db3 to c7377dd Compare January 8, 2020 16:35

pracucci reviewed Jan 8, 2020

View reviewed changes

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

thorfour force-pushed the feat/tsdb/user-rings branch from f29f9af to a396200 Compare January 8, 2020 17:15

pstibrany approved these changes Jan 10, 2020

View reviewed changes

thorfour mentioned this pull request Jan 16, 2020

made stripe size configurable prometheus/prometheus#6644

Merged

thorfour force-pushed the feat/tsdb/user-rings branch from a396200 to 5ad0e51 Compare January 17, 2020 15:06

thorfour force-pushed the feat/tsdb/user-rings branch 3 times, most recently from 6766b0b to ebb3007 Compare January 21, 2020 18:42

pracucci reviewed Jan 24, 2020

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

thorfour force-pushed the feat/tsdb/user-rings branch 4 times, most recently from e3b1760 to 2695a2d Compare January 28, 2020 17:32

jtlisi approved these changes Jan 28, 2020

View reviewed changes

limits: distributor user subrings

3b9e821

Signed-off-by: Thor <thansen@digitalocean.com>

thorfour force-pushed the feat/tsdb/user-rings branch from 2695a2d to 3b9e821 Compare January 28, 2020 19:12

pracucci reviewed Jan 30, 2020

View reviewed changes

pkg/ring/ring.go Outdated Show resolved Hide resolved

subring: allow subrings larger than ring

206be19

Signed-off-by: Thor <thansen@digitalocean.com>

pracucci merged commit aa27aeb into cortexproject:master Jan 31, 2020

tomwilkie mentioned this pull request Aug 26, 2020

Added shuffle sharding support to generate a subring #3090

Merged

3 tasks

limits: distributor user subrings #1947

limits: distributor user subrings #1947

Uh oh!

Conversation

thorfour commented Jan 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pracucci commented Jan 3, 2020

Uh oh!

codesome commented Jan 6, 2020

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bboreham commented Jan 8, 2020

Uh oh!

thorfour commented Jan 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pstibrany commented Jan 10, 2020

Uh oh!

pstibrany Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

thorfour Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

pstibrany Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

thorfour Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pracucci commented Jan 17, 2020

Uh oh!

Uh oh!

jtlisi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pracucci commented Jan 31, 2020

Uh oh!

Uh oh!

thorfour commented Jan 2, 2020 •

edited

Loading

thorfour commented Jan 8, 2020 •

edited

Loading

thorfour Jan 10, 2020 •

edited

Loading