Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create an "up" metric #2923

Open
tomasmota opened this issue Nov 7, 2022 · 14 comments
Open

How to create an "up" metric #2923

tomasmota opened this issue Nov 7, 2022 · 14 comments
Assignees
Labels
enhancement New feature or request spec:metrics Related to the specification/metrics directory

Comments

@tomasmota
Copy link

Sorry if this is out of place here, I have tried to ask in the slack but did not get any satisfiable answer.

What are you trying to achieve?
I want to create an up metric for my services.
When using bare prometheus, due to the pull mechanism, it is easy to know that if the scrape failed, the service is down. With otel, since you set a metric-expiration (which is nice to have for other metrics), the service might no longer be sending the up metric but it is still reported as being there with a value of 1, so we won't know that the service is down until it expires.

Is there some way to achieve this that I'm overlooking? Something I considered as a workaround is having an uptime metric, or current time metric, that we would expect to keep increasing if the service is up.

@tomasmota tomasmota added the spec:metrics Related to the specification/metrics directory label Nov 7, 2022
@tomasmota
Copy link
Author

Forgot to mention, this is using the prometheus exporter

@dashpole
Copy link
Contributor

The prometheus up metric is basically a healthcheck-as-a-metric. You could achieve almost the same behavior by having something external to the application periodically health-check the application, and report a metric based on the result. It won't be tied to your other metrics in the same way the prometheus up metric is, meaning this metric being in the healthy state doesn't imply your other metrics have been successfully collected in the way it does for prometheus. But it does give you an external health signal. The httpcheck receiver in the collector might be able to fulfil that role.

Alternatively, you could introduce a metric in the application whose value is the current time, and alert when the value of the metric is older than a threshold.

@jsuereth
Copy link
Contributor

If you're using a prometheus exporter + pull-based system, then an up metric should be created BY prometheus and makes a lot of sense.

up metrics aren't really the same in a push-based world. Typically, you'd have a heartbeat metric, rather than an up, and you need to interact with them subtly different. up means the endpoint was a live and CPU responded and such. A heartbeat, you need to expect it every N seconds, and delays/shifts can signal problems, but with a lot of noise.

Generally, I think the notion of "health" is important, but I wouldn't use the exact same solution for pull-based health + push-based health.

I know @jmacd has done a lot of thinking here. Josh, let us know if you think this idea has legs or if we should go a different direction.

@jsuereth jsuereth added the enhancement New feature or request label Nov 11, 2022
@tomasmota
Copy link
Author

I appreciate the feedback and suggestions. Might very well go with the current time solution, as there doesnt seem to be any other way to emulate a "liveness" metric in this setup.

As you can see from the diagram @jsuereth, it is a mix of push and pull, that is why it is hard to give the "up" metric responsibility to prometheus. Is there a standardized way of implementing such a heartbeat like you suggest? Or would using something like current time or seconds since start-up be as good as any other solution?

graph TD
    A[Service A] -->|send otlp| B(local otel collector)
    T[Service B] -->|send otlp| F(local otel collector)
    B[local otel collector] -->|send otlp| C(gateway otel collector)
    F[local otel collector] -->|send otlp| C(gateway otel collector)
    D(prometheus) -->|scrape| C(gateway otel collector)
Loading

@jsuereth
Copy link
Contributor

Totally understand this concern. You want a unified view of "up" in the world of mixed push/pull metrics.

My own thoughts here are that we should have something matching you diagram in how we monitor "up" for services. E.g. a way of monitoring here:

graph TD
    A[Pull-based Up metric] --> B(Derived uptime metric)
    T[Push-based Heartbeat metric] --> B(Derived uptime metric)
Loading

That is, a "derived" metric would be one that can query/join across other metrics to give a cohesive view.

In any case, I think this topic actually deserves an "expert group" to think through and propose a good working convention for OTEL here. It's worth collecting some metric experts, in addition to figuring out where Prometheus stands on this issue.

cc @jmacd @reyang @gouthamve for some attention.

@reyang
Copy link
Member

reyang commented Nov 14, 2022

@tomas-mota what does "up" mean? A concrete scenario might help as I bet everyone will have their different version of understanding of "up".

@tomasmota
Copy link
Author

My requirement here is simply "I want a metric that I can reliably query in order to know whether or not the service is running". As previously explained, a simple up=1 metric does not work here because of the expiration time in the collector.

@reyang
Copy link
Member

reyang commented Nov 14, 2022

My requirement here is simply "I want a metric that I can reliably query in order to know whether or not the service is running". As previously explained, a simple up=1 metric does not work here because of the expiration time in the collector.

I don't fully understand the ask here, maybe you only consider one single instance (do you have a service with multiple instances)? Anyways I think one could send the local timestamp as a gauge.

@tomasmota
Copy link
Author

Exactly, I was just trying to figure out if there would be an "otel" way of stating that the service is up. Using the timestamp is also totally fine for me, just wonder, like @jsuereth , if there should be a more out of the box way of doing this, so it is clear for other people.

@jsuereth jsuereth assigned jmacd and unassigned jsuereth Nov 15, 2022
@jsuereth
Copy link
Contributor

@jmacd Is going to take this to the prometheus WG and attempt to make progress specifically on how to handle "up" metrics from OTLP into prometheus.

@jmacd
Copy link
Contributor

jmacd commented Nov 15, 2022

Related OTEP:
open-telemetry/oteps#185 (@jsuereth)

Related issues:

#1078 (on topic, closed in favor of the OTEP above).

Tangentially-related:
#2711
#1273

@jsuereth
Copy link
Contributor

Thanks for reminding me of that proposal. I still think the general shape of the proposal is right, if we had the right names attached to it :)

@jmacd
Copy link
Contributor

jmacd commented Nov 16, 2022

#2825
#2824

@tomasmota
Copy link
Author

Awesome, thanks for taking this on @jsuereth ! Should I close this issue then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

No branches or pull requests

5 participants