-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create an "up" metric #2923
Comments
Forgot to mention, this is using the prometheus exporter |
The prometheus up metric is basically a healthcheck-as-a-metric. You could achieve almost the same behavior by having something external to the application periodically health-check the application, and report a metric based on the result. It won't be tied to your other metrics in the same way the prometheus up metric is, meaning this metric being in the healthy state doesn't imply your other metrics have been successfully collected in the way it does for prometheus. But it does give you an external health signal. The httpcheck receiver in the collector might be able to fulfil that role. Alternatively, you could introduce a metric in the application whose value is the current time, and alert when the value of the metric is older than a threshold. |
If you're using a prometheus exporter + pull-based system, then an
Generally, I think the notion of "health" is important, but I wouldn't use the exact same solution for pull-based health + push-based health. I know @jmacd has done a lot of thinking here. Josh, let us know if you think this idea has legs or if we should go a different direction. |
I appreciate the feedback and suggestions. Might very well go with the current time solution, as there doesnt seem to be any other way to emulate a "liveness" metric in this setup. As you can see from the diagram @jsuereth, it is a mix of push and pull, that is why it is hard to give the "up" metric responsibility to prometheus. Is there a standardized way of implementing such a heartbeat like you suggest? Or would using something like current time or seconds since start-up be as good as any other solution? graph TD
A[Service A] -->|send otlp| B(local otel collector)
T[Service B] -->|send otlp| F(local otel collector)
B[local otel collector] -->|send otlp| C(gateway otel collector)
F[local otel collector] -->|send otlp| C(gateway otel collector)
D(prometheus) -->|scrape| C(gateway otel collector)
|
Totally understand this concern. You want a unified view of "up" in the world of mixed push/pull metrics. My own thoughts here are that we should have something matching you diagram in how we monitor "up" for services. E.g. a way of monitoring here: graph TD
A[Pull-based Up metric] --> B(Derived uptime metric)
T[Push-based Heartbeat metric] --> B(Derived uptime metric)
That is, a "derived" metric would be one that can query/join across other metrics to give a cohesive view. In any case, I think this topic actually deserves an "expert group" to think through and propose a good working convention for OTEL here. It's worth collecting some metric experts, in addition to figuring out where Prometheus stands on this issue. cc @jmacd @reyang @gouthamve for some attention. |
@tomas-mota what does "up" mean? A concrete scenario might help as I bet everyone will have their different version of understanding of "up". |
My requirement here is simply "I want a metric that I can reliably query in order to know whether or not the service is running". As previously explained, a simple up=1 metric does not work here because of the expiration time in the collector. |
I don't fully understand the ask here, maybe you only consider one single instance (do you have a service with multiple instances)? Anyways I think one could send the local timestamp as a gauge. |
Exactly, I was just trying to figure out if there would be an "otel" way of stating that the service is up. Using the timestamp is also totally fine for me, just wonder, like @jsuereth , if there should be a more out of the box way of doing this, so it is clear for other people. |
@jmacd Is going to take this to the prometheus WG and attempt to make progress specifically on how to handle "up" metrics from OTLP into prometheus. |
Related OTEP: Related issues: #1078 (on topic, closed in favor of the OTEP above). |
Thanks for reminding me of that proposal. I still think the general shape of the proposal is right, if we had the right names attached to it :) |
Awesome, thanks for taking this on @jsuereth ! Should I close this issue then? |
Sorry if this is out of place here, I have tried to ask in the slack but did not get any satisfiable answer.
What are you trying to achieve?
I want to create an up metric for my services.
When using bare prometheus, due to the pull mechanism, it is easy to know that if the scrape failed, the service is down. With otel, since you set a metric-expiration (which is nice to have for other metrics), the service might no longer be sending the up metric but it is still reported as being there with a value of 1, so we won't know that the service is down until it expires.
Is there some way to achieve this that I'm overlooking? Something I considered as a workaround is having an uptime metric, or current time metric, that we would expect to keep increasing if the service is up.
The text was updated successfully, but these errors were encountered: