-
Notifications
You must be signed in to change notification settings - Fork 820
Configurable active user series #3153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable active user series #3153
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on it! I have few concerns:
- The new metric adds more cognitive load to users when trying to understand the differences between chunks and blocks storage. Don't have an answer, but if it could work for the chunks storage too would probably easier to understand.
- I'm a bit scared of the lock contention in ActiveSeries (because it's an exclusive lock). I'm wondering if we switch the entry timestamp to an atomic, if a RWMutex could be used (and most of the time we would acquire the read because we expect the series to be already in the map).
pkg/ingester/ingester.go
Outdated
@@ -103,6 +105,8 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) { | |||
f.DurationVar(&cfg.MetadataRetainPeriod, "ingester.metadata-retain-period", 10*time.Minute, "Period at which metadata we have not seen will remain in memory before being deleted.") | |||
|
|||
f.DurationVar(&cfg.RateUpdatePeriod, "ingester.rate-update-period", 15*time.Second, "Period with which to update the per-user ingestion rates.") | |||
f.DurationVar(&cfg.ActiveSeriesUpdatePeriod, "ingester.active-series-update-period", 5*time.Minute, "How often to update active series metrics (blocks engine only).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would default to 1m, to have more accurate tracking (by default).
I agree. Let's do that. (Reason why I kept chunks out of this PR is basically to first see how well it works in practice.)
I will write a benchmark to see throughput of both approaches. |
Benchmark showing improvements when using RW lock and atomics to update timestamps.
|
pkg/ingester/active_series.go
Outdated
func (s *activeSeriesStripe) updateSeriesTimestamp(now time.Time, series labels.Labels, fp model.Fingerprint, labelsCopy func(labels.Labels) labels.Labels) { | ||
nowNanos := now.UnixNano() | ||
|
||
e := s.findEntryForSeries(fp, series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a race condition: we get a reference of an activeSeriesEntry
and then we do update it. In the meanwhile, the purge could remove it, because the read lock is kept only during findEntryForSeries()
. However, the next remote write push will add it back, right? If so, LGTM, but I would like your opinion too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted. Should be fixed now. (There was another problem... since atomic integer was not a pointer, but part of "entry", which was stored in a slice as a plain struct, without pointer, it could actually move in memory when slice grows. That's why atomic integers are now pointers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was another problem... since atomic integer was not a pointer, but part of "entry", which was stored in a slice as a plain struct, without pointer, it could actually move in memory when slice grows. That's why atomic integers are now pointers
Well spotted!
I'd be okay adding this to chunks store too. How much overhead will this add? |
By my counting, each (In blocks storage, we try to reuse labels copy if added to the ref cache at the same time. In chunks doing the same is not easy.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent job. I left a couple of nits. As discussed offline, I would add a enabled
flag (default to false) to not let every Cortex user paying the cost of this tracking if not required (this is usually required by vendors billing systems).
pkg/ingester/active_series.go
Outdated
func (s *activeSeriesStripe) updateSeriesTimestamp(now time.Time, series labels.Labels, fp model.Fingerprint, labelsCopy func(labels.Labels) labels.Labels) { | ||
nowNanos := now.UnixNano() | ||
|
||
e := s.findEntryForSeries(fp, series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was another problem... since atomic integer was not a pointer, but part of "entry", which was stored in a slice as a plain struct, without pointer, it could actually move in memory when slice grows. That's why atomic integers are now pointers
Well spotted!
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Active series are exported as metric. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Also use atomic to keep oldest timestamp. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
…s is updated when slice in the stripe map is appended. Avoid race when newly added entry is removed by purger due to timestamp being set to 0. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
…s to be in a loop). Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
…nction. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Added
|
I've also made sure that if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, still LGTM! Left final nits.
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Is this still true? |
Not anymore. Also there is now a flag to enable this, which is disabled by default. |
* Enable active series metrics in the ingester by default This change calculates and exports the `cortex_ingester_active_series` by default. Up to this point, the metric was disabled by default since calculating it consumes some amount of memory. The original PR (#3153) estimated for 1M active series at least 40MB and up to another 200MB depending on our luck reusing labels from the ref cache. We (Grafana) have been running with this setting enabled on our ingesters for some time and the resource usage doesn't appear to be significant. This feature appears to add between 1.2 - 1.6% in memory usage when enabled: ~140MB out of a total of ~10GB of memory used per ingester. The ingesters I measured this on * Have multiple tenants running production workloads * Have about 1.3M active series each * Have about a 10GB working set (as measured by `kubectl top` and exported k8s metrics) Based on this and the utility of the metric itself, I'd like to enable it by default. Screenshots of the pprof heap output attached. Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
What this PR does: This PR makes tracking of active series in the blocks storage configurable (there are
twothree parameters: how often to purge old series from memory and update metrics, and how old must series be to be considered inactive, and one to enable active series tracking [disabled by default]).This makes it possible to track active series more precisely even if they are still in the TSDB head, but haven't been appended any samples recently.
In the current state of the PR, this is only wired into the blocks engine, and not chunks engine.Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]