Metrics: Requirements for safe attribute removal #1297

jmacd · 2020-12-18T00:00:29Z

What are you trying to achieve?

Make it safe to remove labels from OTLP metrics in a way that
preserves their semantics. We have identified two scenarios where it
is not safe to simply strip labels away from metrics data.

When all identifying resource attributes have been removed from a
CUMULATIVE metric, the resulting timeseries has overlapping time
intervals (i.e., their StartTimeUnixNanos and TimeUnixNanos
overlap).
When SumObserver and UpDownSumObserver data points are stripped
of labels that were used to subdivide an observed sum, the resulting
points lose meaning.

An example for both is provided below.

CUMULATIVE points and the loss of unique labels

For the first item, OTLP has not specified how to interpret CUMULATIVE
data in this scenario. For reference, current OTLP Metrics Sum data
point defines AggregationTemporality as follows:

  // DELTA is an AggregationTemporality for a metric aggregator which reports
  // changes since last report time. Successive metrics contain aggregation of
  // values from continuous and non-overlapping intervals.
  //
  // The values for a DELTA metric are based only on the time interval
  // associated with one measurement cycle. There is no dependency on
  // previous measurements like is the case for CUMULATIVE metrics.

and:

  // CUMULATIVE is an AggregationTemporality for a metric aggregator which
  // reports changes since a fixed start time. This means that current values
  // of a CUMULATIVE metric depend on all previous measurements since the
  // start time. Because of this, the sender is required to retain this state
  // in some form. If this state is lost or invalidated, the CUMULATIVE metric
  // values MUST be reset and a new fixed start time following the last
  // reported measurement time sent MUST be used.

Simply translating overlapping CUMULATIVE data into Prometheus results in incorrect interpretation
because timeseries are not meant to have overlapping points (e.g.,
open-telemetry/opentelemetry-collector#2216).
For this reason, we should specify that CUMULATIVE timeseries MUST
always retain at least one uniquely identifying resource attribute.

This point suggests that service.instance.id should be a required
resource attribute.: #1034

We should also specify how consumers are expected to handle the
situation where, because of a mis-configuration, data is presented with some
degree of overlap. The source of this condition is sometimes unsafe label removal but also results from a so-called "zombie" process, the case where multiple writers unintentionally produce
overlapping points.

This issue proposes that Metrics systems SHOULD ensure that
overlapping points are considered duplicates, not valid points. When
overlapping points are stored and queried, the system SHOULD ensure
that only one data point is taken. We might specify that metrics systems
SHOULD apply a heuristic that prefers data points with the youngest
StartTimeUnixNanos under this condition, also that they SHOULD warn
the user about zombies or potentially unsafe label removal.

(UpDown)SumObserver points and subdivided sums

As an example of this scenario, consider observations being made
from a SumObserver callback:

process.cpu.usage{state=idle}
process.cpu.usage{state=user}
process.cpu.usage{state=system}

Now consider that we want to subdivide the process CPU usage by CPU
core number, making many more series:

process.cpu.usage{state=idle,cpu=0}
process.cpu.usage{state=idle,cpu=1}
process.cpu.usage{state=idle,cpu=...}
process.cpu.usage{state=user,cpu=0}
process.cpu.usage{state=user,cpu=1}
process.cpu.usage{state=user,cpu=...}
process.cpu.usage{state=system,cpu=0}
process.cpu.usage{state=system,cpu=1}
process.cpu.usage{state=system,cpu=...}

If these are were expressed as DELTA points (which requires
remembering the last value and performing subtraction), we can
correctly erase the cpu label requires by simple erasure. If these
are expressed as CUMULATIVE points, to erase the cpu label means
performing summation, somehow.

Proposed collector algorithm

I'll propose an algorithm that I think we can use in the
OTel-Collector to correctly erase labels in both of these cases. It
requires introducing a delay over which aggregation takes place for
both DELTA and CUMULATIVE points.

The OTel-Go model SDK already implements a correct label "reducer"
Processor, but the job of the SDK is much simpler than the Collector's
job in this case, for two reasons: (1) the SDK never resets, (2) the
collection is naturally aligned (i.e., all SumObservers have the same
TimeUnixNanos).

Step 1:

Input: DELTA or CUMULATIVE timeseries data with start time S.
Output: the same timeseries with one added label StartTime=S

Step 2: windowing / reducing

over a short window of time, aggregate each metric by resource and label set
for DELTA points, take sum
for CUMULATIVE points, take last value

Step 3: spatial aggregation

Process one interval
Remove any unwanted labels
Remove any unwanted resource attributes
Remove the StartTime label
Sum the resulting timeseries in each window
Output w/ a new start time (of the aggregator)

This mimics the behavior of the OTel-Go SDK by introducing a temporary
StartTime label to keep distinct timeseries separate until they are
summed. I will follow up with a more thorough explanation of this algorithm.

The text was updated successfully, but these errors were encountered:

jmacd · 2022-07-27T16:30:56Z

@hardproblems responding to open-telemetry/opentelemetry-collector-contrib#12300 (comment) and open-telemetry/opentelemetry-collector-contrib#12300 (comment)

This is kind of similar to open-telemetry/opentelemetry-collector-contrib#11870. There seem to be use cases where for one reason or another, the sender's unique ID isn't necessary or desirable as a resource attribute.

I understand the concern. There are definitely these use-cases. The single-writer principle and the definition for overlapping streams were created to prevent problems at the consumer when sender's do not use unique IDs.

Although I speculate that the problem with target_info discussed in collector issue 12300 is really a bug, the problem you're experiencing is more general and if you send overlapping data to Prometheus, you should expect errors and/or incorrect results. The fact that you're only seeing the error for target_info is a clue that target_info is incorrect. If you were to remove the Prometheus instance from any of your metrics without correctly aggregating the results, you will see the same type of problem. It sounds like you would like to erase the instance label, which is done in Prometheus via recording rules.

OpenTelemetry's metric data model discusses the requirements for safe removal but does not exactly spell out what the collector should do when it removes uniquely-identifying attributes. This issue was created, and I once drafted a PR #1649 which was premature at the time. The reason this is a corner case and not a earler problem in the OTel specification, is that it only impacts the collector. When an SDK is required to remove attributes (as it must to implement Views), it has many advantages over the collector. Because the SDK is a single-writer (by definition), it can easily combine (i.e., aggregate) multiple series to "safely" remove attributes--this is easy because its data is perfectly timestamp-aligned.

When the collector sees data from multiple senders and wants to erase identifying attributes, it is required to perform the necessary aggregation. This requires some amount of batching and interpolating and is quite a sophisticated bit of work, though I would argue significantly less work than implementing "full-feature" recording rules.

jack-berg · 2024-07-17T15:55:42Z

@jmacd is this still a priority and something you're willing to work on?

jmacd added the spec:metrics Related to the specification/metrics directory label Dec 18, 2020

andrewhsu added area:api Cross language API specification issue area:sdk Related to the SDK priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Dec 22, 2020

andrewhsu assigned jmacd Dec 22, 2020

jmacd mentioned this issue Dec 24, 2020

Add ability to flush metrics on demand from push controller open-telemetry/opentelemetry-go#1349

Closed

jmacd added area:data-model For issues related to data model and removed area:api Cross language API specification issue area:sdk Related to the SDK labels Feb 4, 2021

jmacd mentioned this issue Feb 9, 2021

Adding resource attributes post-creation (e.g. via auto-discovery) #1298

Open

This was referenced Apr 21, 2021

Clarify how Prometheus uses the OpenMetrics "Created" timestamp open-telemetry/wg-prometheus#46

Closed

Add detail about metric staleness #1648

Closed

Add detail about reaggregation for safe attribute removal #1649

Closed

jmacd mentioned this issue May 15, 2021

Counter, UpDownCounter, and Gauge instruments compared open-telemetry/oteps#156

Closed

jsuereth removed the release:required-for-ga Must be resolved before GA release, or nice to have before GA label May 25, 2021

jmacd mentioned this issue Jun 29, 2021

[new cumulative to delta processor] initial skeleton open-telemetry/opentelemetry-collector-contrib#3752

Merged

pirgeo mentioned this issue Nov 22, 2021

Data Classifications for resources and attributes open-telemetry/oteps#187

Closed

legendecas mentioned this issue May 27, 2022

Ability to remove/flush unused attributes in metric instruments open-telemetry/opentelemetry-js#2997

Open

jmacd mentioned this issue Jul 27, 2022

[exporter/prometheusremotewrite] "target" metric causes out-of-order samples errors on load balanced collector deployment open-telemetry/opentelemetry-collector-contrib#12300

Closed

jmacd changed the title ~~Metrics: Requirements for safe label removal~~ Metrics: Requirements for safe attribute removal Jan 16, 2024

austinlparker added the triage:deciding:tc-inbox Needs attention from the TC in order to move forward label Jun 18, 2024

jack-berg added triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor and removed triage:deciding:tc-inbox Needs attention from the TC in order to move forward labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics: Requirements for safe attribute removal #1297

Metrics: Requirements for safe attribute removal #1297

jmacd commented Dec 18, 2020 •

edited

Loading

jmacd commented Jul 27, 2022

jack-berg commented Jul 17, 2024

Metrics: Requirements for safe attribute removal #1297

Metrics: Requirements for safe attribute removal #1297

Comments

jmacd commented Dec 18, 2020 • edited Loading

CUMULATIVE points and the loss of unique labels

(UpDown)SumObserver points and subdivided sums

Proposed collector algorithm

jmacd commented Jul 27, 2022

jack-berg commented Jul 17, 2024

jmacd commented Dec 18, 2020 •

edited

Loading