Skip to content

limits: distributor user subrings #1947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 31, 2020

Conversation

thorfour
Copy link
Contributor

@thorfour thorfour commented Jan 2, 2020

Signed-off-by: Thor thansen@digitalocean.com

Experimental TSDB Change Only

What this PR does: Opening a TSDB per user on each Ingester presents a problem with memory consumption. The series hashtable that TSDB pre-allocates per user is roughly 1.5MB.
image
Which means that at a large user count it eats up a large amount of memory per Ingester. More importantly when the number of ingesters/nodes are scaled out the memory consumption stays the same per node since the same number of TSDB's are allocated on each node.

Just as a demonstration I reduced the stripe size in the TSDB code.

+++ b/vendor/github.com/prometheus/prometheus/tsdb/head.go
@@ -1465,7 +1465,7 @@ type stripeSeries struct {
 }
 const (
-       stripeSize = 1 << 14
+       stripeSize = 1 << 8
        stripeMask = stripeSize - 1
 )

Which reduced overall memory consumption from ~94GiB to ~23Gib
image

Reducing the stripe size has performance implications, and doesn't resolve the root problem of metric footprint not scaling with node count.

This change adds an optional step when distributing metrics, that first creates a subring of ingesters based on the user ID, and then uses that subring to hash as normal. This means that for any given user, that user only has an open TSDB on N given Ingesters reducing the overall memory footprint and horizontally scaling with nodes. The setting is apart of the util.Limits code to provide a default and importantly per-user limits to prevent hot-spots being created by extra heavy users.

You can see in the image below the memory usage before (red arrow) with the reduced stripe size, and the memory usage after (green arrow) with a user sub ring of 4 and replication 3. So a user is allocated to 4 of the 16 nodes in the cluster.

image

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job @thorfour! As we discussed privately, I see the value in doing it for the experimental TSDB storage, in a cluster with many tenants. I just left a couple of minor comments.

@pracucci
Copy link
Contributor

pracucci commented Jan 3, 2020

@codesome While I believe this feature is great, do you also see any optimization we can do in TSDB to mitigate this?

@thorfour thorfour force-pushed the feat/tsdb/user-rings branch 2 times, most recently from 19eea95 to f8f6737 Compare January 3, 2020 18:37
@codesome
Copy link
Contributor

codesome commented Jan 6, 2020

@codesome While I believe this feature is great, do you also see any optimization we can do in TSDB to mitigate this?

Usually, such constant numbers are not exposed for users to set in case of Prometheus, and it might be the same case with stripeSize. As Prometheus is focused on a single installation of Prometheus on a node and performance, I don't think this constant would get reduced upstream.

@thorfour thorfour force-pushed the feat/tsdb/user-rings branch 5 times, most recently from 48b4833 to 2512db3 Compare January 7, 2020 15:19
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thorfour thorfour force-pushed the feat/tsdb/user-rings branch from 2512db3 to c7377dd Compare January 8, 2020 16:35
@thorfour thorfour force-pushed the feat/tsdb/user-rings branch from f29f9af to a396200 Compare January 8, 2020 17:15
@bboreham
Copy link
Contributor

bboreham commented Jan 8, 2020

Even with the chunk store, I have sometimes wished that users were restricted to subsets of ingesters, generally because 3+ ingesters crashed so I lost some data from everyone and I would prefer to apologise to a subset of users.

@thorfour
Copy link
Contributor Author

thorfour commented Jan 8, 2020

@codesome I definitely don't think the constant should be reduced. But is there a way to have it be user defined? So that TSDB's aren't as memory hungry but trade performance at upper bounds to do so?

It wouldn't even have to be a user option exposed on the Prometheus end, but one exposed only in the TSDB library, and Prometheus could simply use a constant.

@pstibrany
Copy link
Contributor

This PR may have implications to PR #1764.


// Subring returns a ring of n ingesters from the given ring
// Subrings are meant only for ingestor lookup and should have their data externalized.
func (r *Ring) Subring(key uint32, n int) (ReadRing, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use some caching here? We're going to call this on every push request, which looks like a lot of ring traversal. On the other hand, number of different keys isn't that high (= number of tenants), so perhaps we can use some of distributors memory to cache these subrings for some time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but was concerned about cache invalidation when the ring size changes. So I decided to at least punt on that for now.

pkg/ring/ring.go Outdated
start = r.search(key)
iterations = 0
)
for i := start; len(distinctHosts) < n && iterations < len(r.ringDesc.Tokens); i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this algorithm is too complicated for what we're trying to do here (pick N ingesters based on key). That said, when I tried to come up with alternatives, they were not any better – and would cause too much ingester jumps during scale-up/down.

Copy link
Contributor Author

@thorfour thorfour Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would have preferred a simpler solution as well, but I didn't want to thrash when ring size changes as you've pointed out. Maybe if we couple simple sharding with #1958 it may not be so much of a problem?

Tangentially related I also had the idea that if a Querier might be able to detect a ring size hasn't changed in the last N time, and could in theory stop sending queries to all ingesters and instead only send them to the subring.

@pracucci
Copy link
Contributor

@bboreham @gouthamve Have you had a chance to take a look at this PR? To my perspective, it's a safe feature, given doesn't change the actual Cortex logic until you enable the user sub ring.

@thorfour thorfour force-pushed the feat/tsdb/user-rings branch 3 times, most recently from 6766b0b to ebb3007 Compare January 21, 2020 18:42
@thorfour thorfour force-pushed the feat/tsdb/user-rings branch 4 times, most recently from e3b1760 to 2695a2d Compare January 28, 2020 17:32
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Will note that @pstibrany's comment on the performance of the SubRing function has me concerned and I think we will need to iterate on that. However, since it is an experimental feature I think this should be fine.

This will also be great if we could use it for reads. Connecting to each ingester is not ideal as things currently stand.

Signed-off-by: Thor <thansen@digitalocean.com>
@thorfour thorfour force-pushed the feat/tsdb/user-rings branch from 2695a2d to 3b9e821 Compare January 28, 2020 19:12
Signed-off-by: Thor <thansen@digitalocean.com>
@pracucci
Copy link
Contributor

I've done another round of tests, and LGTM. It's time merge it!

@pracucci pracucci merged commit aa27aeb into cortexproject:master Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants