-
Notifications
You must be signed in to change notification settings - Fork 820
limits: distributor user subrings #1947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job @thorfour! As we discussed privately, I see the value in doing it for the experimental TSDB storage, in a cluster with many tenants. I just left a couple of minor comments.
@codesome While I believe this feature is great, do you also see any optimization we can do in TSDB to mitigate this? |
19eea95
to
f8f6737
Compare
Usually, such constant numbers are not exposed for users to set in case of Prometheus, and it might be the same case with |
48b4833
to
2512db3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2512db3
to
c7377dd
Compare
f29f9af
to
a396200
Compare
Even with the chunk store, I have sometimes wished that users were restricted to subsets of ingesters, generally because 3+ ingesters crashed so I lost some data from everyone and I would prefer to apologise to a subset of users. |
@codesome I definitely don't think the constant should be reduced. But is there a way to have it be user defined? So that TSDB's aren't as memory hungry but trade performance at upper bounds to do so? It wouldn't even have to be a user option exposed on the Prometheus end, but one exposed only in the TSDB library, and Prometheus could simply use a constant. |
This PR may have implications to PR #1764. |
|
||
// Subring returns a ring of n ingesters from the given ring | ||
// Subrings are meant only for ingestor lookup and should have their data externalized. | ||
func (r *Ring) Subring(key uint32, n int) (ReadRing, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use some caching here? We're going to call this on every push request, which looks like a lot of ring traversal. On the other hand, number of different keys isn't that high (= number of tenants), so perhaps we can use some of distributors memory to cache these subrings for some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, but was concerned about cache invalidation when the ring size changes. So I decided to at least punt on that for now.
pkg/ring/ring.go
Outdated
start = r.search(key) | ||
iterations = 0 | ||
) | ||
for i := start; len(distinctHosts) < n && iterations < len(r.ringDesc.Tokens); i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this algorithm is too complicated for what we're trying to do here (pick N ingesters based on key). That said, when I tried to come up with alternatives, they were not any better – and would cause too much ingester jumps during scale-up/down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would have preferred a simpler solution as well, but I didn't want to thrash when ring size changes as you've pointed out. Maybe if we couple simple sharding with #1958 it may not be so much of a problem?
Tangentially related I also had the idea that if a Querier might be able to detect a ring size hasn't changed in the last N time, and could in theory stop sending queries to all ingesters and instead only send them to the subring.
a396200
to
5ad0e51
Compare
@bboreham @gouthamve Have you had a chance to take a look at this PR? To my perspective, it's a safe feature, given doesn't change the actual Cortex logic until you enable the user sub ring. |
6766b0b
to
ebb3007
Compare
e3b1760
to
2695a2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Will note that @pstibrany's comment on the performance of the SubRing
function has me concerned and I think we will need to iterate on that. However, since it is an experimental feature I think this should be fine.
This will also be great if we could use it for reads. Connecting to each ingester is not ideal as things currently stand.
Signed-off-by: Thor <thansen@digitalocean.com>
2695a2d
to
3b9e821
Compare
Signed-off-by: Thor <thansen@digitalocean.com>
I've done another round of tests, and LGTM. It's time merge it! |
Signed-off-by: Thor thansen@digitalocean.com
Experimental TSDB Change Only
What this PR does: Opening a TSDB per user on each Ingester presents a problem with memory consumption. The series hashtable that TSDB pre-allocates per user is roughly 1.5MB.

Which means that at a large user count it eats up a large amount of memory per Ingester. More importantly when the number of ingesters/nodes are scaled out the memory consumption stays the same per node since the same number of TSDB's are allocated on each node.
Just as a demonstration I reduced the stripe size in the TSDB code.
Which reduced overall memory consumption from ~94GiB to ~23Gib

Reducing the stripe size has performance implications, and doesn't resolve the root problem of metric footprint not scaling with node count.
This change adds an optional step when distributing metrics, that first creates a subring of ingesters based on the user ID, and then uses that subring to hash as normal. This means that for any given user, that user only has an open TSDB on N given Ingesters reducing the overall memory footprint and horizontally scaling with nodes. The setting is apart of the util.Limits code to provide a default and importantly per-user limits to prevent hot-spots being created by extra heavy users.
You can see in the image below the memory usage before (red arrow) with the reduced stripe size, and the memory usage after (green arrow) with a user sub ring of 4 and replication 3. So a user is allocated to 4 of the 16 nodes in the cluster.
Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]