Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discuss: stateful series tracking staleness #31016

Closed
Tracked by #30705
sh0rez opened this issue Feb 2, 2024 · 6 comments
Closed
Tracked by #30705

discuss: stateful series tracking staleness #31016

sh0rez opened this issue Feb 2, 2024 · 6 comments
Labels
discussion needed Community discussion needed

Comments

@sh0rez
Copy link
Member

sh0rez commented Feb 2, 2024

Component(s)

deltatocumulative (wip), interval (wip), others?

Describe the issue you're reporting

Stateful components keep state about telemetry signals (like metric streams) in memory.
WIP processors like deltatocumulative and interval need to maintain (variable size) set of samples per tracked series.
As series may come and go, tracking those indefinitely directly causes unbound memory growth.

Systems like Prometheus solve this using "staleness", meaning that series not receiving fresh samples for a given time interval are considered "stale" and subsequently removed from tracking, thus freeing the memory occupied.

Given the functional overlap of several stateful metrics processors needing to track streams and expire that tracking, I think there is an opportunity to generalize this behavior, e.g. using a stream-map interface like follows:

type Map[T any] interface {
	Load(Stream) T
	Store(Stream, T)
	Forget(Stream)
}

A staleness implementation may look as following:

type PriorityQueue[K, P comparable] interface {
	Update(id K, prio P); Peek() P; Pop() K
}

type Staleness[T any] struct {
	Max time.Duration

	Map[T]
	pq PriorityQueue[Stream, time.Time]
}

func (s Staleness[T]) Store(id Stream, v T) {
	s.pq.Update(id, time.Now())
	s.Map.Store(id, v)
}

func (s Staleness[T]) Purge() {
	for {
		ts := s.pq.Peek()
		if time.Now().Sub(ts) < s.Max {
			break
		}
		id := s.pq.Pop()
		s.Map.Forget(id)
	}
}

/cc @RichieSams @djaglowski

@sh0rez sh0rez added the needs triage New item requiring triage label Feb 2, 2024
@sh0rez sh0rez changed the title discuss: stateful staleness discuss: stateful series tracking staleness Feb 2, 2024
@djaglowski
Copy link
Member

Thanks for opening this @sh0rez.

Would you be willing to propose a corresponding config as well? I assume the user should have control over the staleness timeout. Is there anything else?

A related concern I have is regarding how the user can manage cardinality. Should we also have the ability to set a max number of streams, and flush the oldest when we would exceed the max? I'm asking here because these are both directly related to managing the amount of data retention, so we might want to unify these concerns in a single package. I'm curious your thoughts on this.

@sh0rez
Copy link
Member Author

sh0rez commented Feb 2, 2024

I think we can lean on Prometheus experience for this, like this talk: https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf

tldr:

  • fixed interval has drawbacks, e.g. "target overlap", where a target was re-deployed but the old one not yet stale
  • prom has "staleness markers" (special NaN's) it inserts when:
    • target goes away (no longer in service discovery)
    • series no longer in scrape
    • scrape fails
  • it also does some trickery to avoid false positives: instead of inserting them right away when conditions are met, it sleeps until after the same scrape and tries then. as the tsdb is append only, this only succeeds if the conditions held true during that time.

This of course heavily builds on Prometheus data model assumptions, which are different from OTel.
Applying this to OTel leads me to some questions:

  • can we authoritatively detect a target going away?
  • can we authoritatively detect a series going away if the target remains?
  • can we do that interval trickery? does the single-writer principle help?

most importantly, do we even want that? e.g. a sporadic delta producer might be stale all the time. what are the use-cases we need to enable? prom-like monitoring + alerting? low-connectivity iot?

@crobert-1 crobert-1 added the discussion needed Community discussion needed label Feb 2, 2024
@djaglowski
Copy link
Member

Thanks for the detailed thoughts on this @sh0rez. At a high level, I like the idea of not reinventing the wheel but I don't have clear answers to your questions so would want to hear other people's thoughts as well. Perhaps some folks with more OTel && Prometheus experience can chime in.

@RichieSams
Copy link
Contributor

Apologies for the delayed response; I had some family health issues last week.

IMO, I think a fixed interval gets us 98% of the benefits and is very simple to implement / understand. While it does have the "overlap" issue, IMO, this isn't really a big issue. The "old" counter will no longer be modified, so while there is "overlap", any useful operations, like rate() will do the "right" thing, and users won't know the difference. IE, the rate() of the old series will be zero, while the rate() of the new series will start up.

@RichieSams
Copy link
Contributor

RichieSams commented Feb 6, 2024

@sh0rez @djaglowski I created a WIP PR implementing the above behaviour: ce07908

djaglowski pushed a commit that referenced this issue Feb 27, 2024
… staleness (#31089)

**Description:**
It's a glorified wrapper over a Map type, which allows values to be
expired based on a pre-supplied interval.

**Link to tracking Issue:**

#31016

**Testing:**
I added some basic tests of the PriorityQueue implementation as well as
the expiry behaviour of Staleness

**Documentation:**

All the new structs are documented
jpkrohling pushed a commit that referenced this issue Mar 5, 2024
**Description:** Removes stale series from tracking (and thus frees
their memory) using staleness logic from
#31089

**Link to tracking Issue:**
#30705,
#31016

**Testing:** `TestExpiry`
**Documentation:** README updated
@crobert-1 crobert-1 removed the needs triage New item requiring triage label Mar 5, 2024
DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
)

**Description:** Removes stale series from tracking (and thus frees
their memory) using staleness logic from
open-telemetry#31089

**Link to tracking Issue:**
open-telemetry#30705,
open-telemetry#31016

**Testing:** `TestExpiry`
**Documentation:** README updated
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
… staleness (open-telemetry#31089)

**Description:**
It's a glorified wrapper over a Map type, which allows values to be
expired based on a pre-supplied interval.

**Link to tracking Issue:**

open-telemetry#31016

**Testing:**
I added some basic tests of the PriorityQueue implementation as well as
the expiry behaviour of Staleness

**Documentation:**

All the new structs are documented
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
)

**Description:** Removes stale series from tracking (and thus frees
their memory) using staleness logic from
open-telemetry#31089

**Link to tracking Issue:**
open-telemetry#30705,
open-telemetry#31016

**Testing:** `TestExpiry`
**Documentation:** README updated
@sh0rez
Copy link
Member Author

sh0rez commented Mar 28, 2024

implementation and re-usable components are merged, closing

@sh0rez sh0rez closed this as completed Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed Community discussion needed
Projects
None yet
Development

No branches or pull requests

4 participants