Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Start Timestamp for Cumulative Metrics in Skipper Metrics #3087

Closed
silentz opened this issue May 30, 2024 · 1 comment · Fixed by #3089
Closed

Missing Start Timestamp for Cumulative Metrics in Skipper Metrics #3087

silentz opened this issue May 30, 2024 · 1 comment · Fixed by #3089

Comments

@silentz
Copy link

silentz commented May 30, 2024

Is your feature request related to a problem? Please describe.
In OpenTelemetry, every metric of cumulative type is required to have a StartTimestamp, which records when a metric is first recorded. However, Prometheus exporter doesn't provide StartTimestamp by default. The absence of StartTimestamp leads to inaccuracies that distort the telemetry data.

Example 01: When new StartTimestamp is less than actual StartTimestamp, it creates "fake tail" in metric values

image

Example 02: When new StartTimestamp is more than actual StartTimestamp, it creates "fake peak" in metric values

image

As you can see from the examples above, incorrect choice of StartTimestamp on OpenTelemetry side always leads to distortions in collected metrics data. And without StartTimestamp on Skipper side there is no chance to correctly guess StartTimestamp on telemetry collector side.

Describe the solution you would like
To remove these side effects in telemetry data, we need StartTimestamp to be set for each cumulative metric by Skipper. The value of StartTimestamp for a particular metric must be equal to the UNIX timestamp (in nanoseconds) of when this metric, with the exact same combination of attributes, was first seen by Skipper.

image

Describe alternatives you've considered (optional)
In our setup we also tested out start timestamp normalization algorithm, introduced by OpenTelemetry in this article. It describes a set of transformations to the cumulative sum metrics that helps handling unknown start timestamp. This approach can be used as a workaround in situations, where getting start timestamp of stream is not possible at all. But like any other hacks, this one has its own disadvantages. During collector restarts metric data will be completely lost:

image

This is a critical problem, since usually one OpenTelemetry collector is used per cluster / installation, so restart will cause loss of data. This can be definetly avoided in case we have StartTimestamp for each cumulative metric on Skipper side.

Additional context (optional)
Prometheus Data Model: https://prometheus.io/docs/concepts/metric_types/
OpenTelemetry Data Model: https://opentelemetry.io/docs/specs/otel/metrics/data-model/

Would you like to work on it?
Yes

@szuecs
Copy link
Member

szuecs commented May 30, 2024

Please no cache, there is no way how to cleanup and basically it's an unbounded data sync

AlexanderYastrebov added a commit that referenced this issue May 30, 2024
Add start label to each counter with the value of counter creation
timestamp as unix nanoseconds.

This enables OpenTelemetry cumulative temporality,
see https://opentelemetry.io/docs/specs/otel/metrics/data-model/#temporality

Example:
```
~$ curl -s localhost:9911/metrics | grep host_count
 # HELP skipper_serve_host_count Total number of requests of serving a host.
 # TYPE skipper_serve_host_count counter
skipper_serve_host_count{code="200",host="bar_test",method="GET",start="1717066533598500794"} 1
skipper_serve_host_count{code="200",host="foo_test",method="GET",start="1717066538031805059"} 2
```

Fixes #3087
AlexanderYastrebov added a commit that referenced this issue May 30, 2024
Add start label to each counter with the value of counter creation
timestamp as unix nanoseconds.

This enables OpenTelemetry cumulative temporality,
see https://opentelemetry.io/docs/specs/otel/metrics/data-model/#temporality

Example:
```
~$ curl -s localhost:9911/metrics | grep host_count
 # HELP skipper_serve_host_count Total number of requests of serving a host.
 # TYPE skipper_serve_host_count counter
skipper_serve_host_count{code="200",host="bar_test",method="GET",start="1717066533598500794"} 1
skipper_serve_host_count{code="200",host="foo_test",method="GET",start="1717066538031805059"} 2
```

Fixes #3087

Signed-off-by: Alexander Yastrebov <alexander.yastrebov@zalando.de>
AlexanderYastrebov added a commit that referenced this issue May 31, 2024
Add start label to each counter with the value of counter creation
timestamp as unix nanoseconds.

This enables OpenTelemetry cumulative temporality,
see https://opentelemetry.io/docs/specs/otel/metrics/data-model/#temporality

Example:
```
~$ curl -s localhost:9911/metrics | grep host_count
 # HELP skipper_serve_host_count Total number of requests of serving a host.
 # TYPE skipper_serve_host_count counter
skipper_serve_host_count{code="200",host="bar_test",method="GET",start="1717066533598500794"} 1
skipper_serve_host_count{code="200",host="foo_test",method="GET",start="1717066538031805059"} 2
```

Fixes #3087

Signed-off-by: Alexander Yastrebov <alexander.yastrebov@zalando.de>
AlexanderYastrebov added a commit that referenced this issue Jun 3, 2024
Add start label to each counter with the value of counter creation
timestamp as unix nanoseconds.

This enables OpenTelemetry cumulative temporality,
see https://opentelemetry.io/docs/specs/otel/metrics/data-model/#temporality

Example:
```
~$ curl -s localhost:9911/metrics | grep host_count
 # HELP skipper_serve_host_count Total number of requests of serving a host.
 # TYPE skipper_serve_host_count counter
skipper_serve_host_count{code="200",host="bar_test",method="GET",start="1717066533598500794"} 1
skipper_serve_host_count{code="200",host="foo_test",method="GET",start="1717066538031805059"} 2
```

Fixes #3087

Signed-off-by: Alexander Yastrebov <alexander.yastrebov@zalando.de>
JanardhanSharma pushed a commit to JanardhanSharma/skipper that referenced this issue Jul 19, 2024
Add start label to each counter with the value of counter creation
timestamp as unix nanoseconds.

This enables OpenTelemetry cumulative temporality,
see https://opentelemetry.io/docs/specs/otel/metrics/data-model/#temporality

Example:
```
~$ curl -s localhost:9911/metrics | grep host_count
 # HELP skipper_serve_host_count Total number of requests of serving a host.
 # TYPE skipper_serve_host_count counter
skipper_serve_host_count{code="200",host="bar_test",method="GET",start="1717066533598500794"} 1
skipper_serve_host_count{code="200",host="foo_test",method="GET",start="1717066538031805059"} 2
```

Fixes zalando#3087

Signed-off-by: Alexander Yastrebov <alexander.yastrebov@zalando.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants