Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how to report dropped metrics #1655

Open
jsuereth opened this issue Apr 27, 2021 · 3 comments
Open

Determine how to report dropped metrics #1655

jsuereth opened this issue Apr 27, 2021 · 3 comments
Labels
area:data-model For issues related to data model area:semantic-conventions Related to semantic conventions enhancement New feature or request spec:metrics Related to the specification/metrics directory

Comments

@jsuereth
Copy link
Contributor

In #1618, the question of who to handle "dropped" metrics within OTLP was asked. In that PR, a naive algorithm for Delta => Cumulative sum conversion was proposed that was intended to be lightweight (in memory consumption) and does NOT allow for out-of-order (or delayed) delta sum points.

This raises the question of how you would report "dropped" metrics within OTLP for this scenario. Do we want to provide some known (semantic) convention? Or should this be a first class attribute (like dropped_attribute_count on spans)?

For now, tagging this as DataModel + Semantic convention until we've had a chance to discuss the best way to handle this.

@jsuereth jsuereth added enhancement New feature or request area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory area:data-model For issues related to data model labels Apr 27, 2021
@jsuereth
Copy link
Contributor Author

jsuereth commented Jun 8, 2021

Context (in note form)

  • Tracing allows dropping data when over some limit. This is reflected beside every location that carries an attribute can report dropped attributes count.
  • Metrics API/SDK (currently) does not specify any mechanism to drop/limit the amount of data collected and reported. Instead it focuses on "aggregation" of data.
    • For Histogram, the data model requires reporting the total number of measurements in the aggregation in a field called "count".
    • For Sums (specifically synchronous instrument sums), there is no way to report the number of measurements during a reporting interval that lead to a value.
      • For synchronous Sum, it's possible no values were reported during an interval
      • For asynchronous sum it's possible an "exception" occurred during the fetch of the value.
    • Gauges are (currently) async only, meaning there should only be one Gauge per-reporting interval. It is possible there is an "exception" during processing of a gauge.
    • The View / Processor API is still being fleshed out. While a user will be able to drop metrics via this library, it's unclear what will be visible and how to proceed at this moment.
    • The Exemplar portion of the API is still unfinished, but the most likely location where dropping data for efficiency will come into play.
  • From a "Collector" Perspective, the following a sources of dropped metric data:
    • The Memory Limiter processor can be used to drop incoming metrics when the collector itself has exceeded memory threshold limits
    • The Attributes Processor can add/remove attributes from metric streams.
    • The Metrics Transform Processor can do a myriad of add/remove/create operations to metrics from same-source.
  • From an integration standpoint, we have a clear signal from pull-based receivers that metrics points may be missing.

Open Questions

  • Should the SDK be reporting "dropped" / "error" metric points when there is some exceptional scenario when pulling a gauge?
  • Should the SDK be reporting # of measurements that participate in a Sum?
  • Should the SDK report the # of dropped exemplars?
  • Are there other scenarios (like attribute/event/link limits in Tracing) where we want to report that data is missing in metrics not captured here?
  • Should the collector be reporting metric drops during processing? (e.g. the memory limiter reporting high fidelity dropped data would violate the purpose)

@lakamsani
Copy link

lakamsani commented Jun 8, 2021

heartbeat metrics like this can be useful. Would be good to avoid reporting transient errors. For example, report an error if a gauge cannot be collected the last 10 or some gauge-specific upper limit set via say metric config times the SDK tried to collect.

@dpk83
Copy link

dpk83 commented Dec 16, 2022

Copying the message from #2960 as this looks like the right place to discuss it.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:data-model For issues related to data model area:semantic-conventions Related to semantic conventions enhancement New feature or request spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

No branches or pull requests

3 participants