-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780
Comments
pinging @jpkrohling as code owner |
I digged deeper into this, and here are the issues I've found:
I'm implementing health checks in my own code for my exporter, and here is what I plan to try:
I'll implement this, and then comment in this bug whether it was successful. |
Here is the code which works for my usecase: exporter.go. |
Thanks for this investigation and sorry for taking so long to take a look at this.
This is likely because we still haven't made our internal observability ready for wider consumption by other components. But I think this specific characteristic should indeed be documented somewhere. Where do you think you would have found it more naturally?
I think they could/should share the same metric. If it's failing for metrics or traces, I would want to see that as part of the same metric as an admin. Other labels perhaps, but the same metric.
@skyduo would be the best person to comment on the choice of methods here, but this is used to invalidate old data, right? Wouldn't it be more suitable to calculate the failure rate instead? Perhaps by storing the state of the last events, setting the health check based on their success rate. Both parameters (# of last events to store, success rate) can have reasonable defaults and be configurable. |
In any case, if just switching from Start to End would provide a good solution for the short term, I'm all for it. |
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/README.md would be good. Smth like "It only supports monitoring exporter failures, and only exporters which use
Yeah, I agree that would make more sense. I'll leave it up to you and/or
That's true. However, |
@open-telemetry/collector-approvers , what do you think about this? Should all signals share a single metric to report failure to send data? We can still have one label per signal, but keeping the same metric would make more sense to me. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
The current implementation of |
Describe the bug
I'm trying to use "check_collector_pipeline" function of healthcheckextension. However, the health check always returns status code 200 to me. Even when my exporter is failing, health check is still OK.
In "additional context" section I break down the technical aspect of such behavior.
Steps to reproduce
I built a simple OTel collector:
Then I build a Docker image out of that agent, and I run that image (not showing commands here, because they are specific to my setup).
Then I run that image,
docker attach
to it, and issuecurl
requests to the health endpoint (shown below).What did you expect to see?
I expected to see status code 500 coming from health checks, because the exporter I'm using is constantly failing (on purpose).
What did you see instead?
However, the health check returns status code 500 only for a limited amount of time. Then, it always returns status code 200: https://pastebin.com/uR5X6DAr.
What version did you use?
v0.54.0
What config did you use?
agent-conf.yaml
Environment
OS: Debian GNU/Linux rodete, kernel version
5.17.11-1rodete2-amd64
Compiler(if manually compiled): go 1.18.3
Additional context
I investigated the problem, and I found out a technical cause. However, I'm not sure how to fix this properly.
I added more logging to the healthcheckextension: healthcheckextension.go, exporter.go.
Here are my OpenTelemetry logs coming from the run (beware, those are very verbose): https://pastebin.com/F2ZJQhAf.
Notice that the HealthCheckExtension's internal queue
exporterFailureQueue
is growing at first (log messageHC queue size: 0
, then is 1, 2, ..., 6). Then it becomes empty (size 0) or contains one element (log messageexiting `rotate` function, queue size is 1
).The extension itself decides whether exporter is healthy based on the size of the queue. Since it has <= 1 elements all the time, the health check returns 200 OK all the time.
The exporter itself was constantly failing (log messages containing
Exporting failed.
), but the queue was not growing. Why? Take for example failure at2022-06-28T14:35:48.345Z error exporterhelper/queued_retry.go:149 Exporting failed.
. There was one element added to theexporterFailureQueue
, with the following timestamps:Then, the extension compared Start with the current time, and removed the element from the
exporterFailureQueue
(the current time at that point was2022-06-28 14:35:48.344996512 +0000
).Overall, notice that the
Start
of every element ofexporterFailureQueue
is constant:14:33:18
. It corresponds to the start time of the collector.However,
exporter.go
comparesStart
with current time as ifStart
was the time of the exporter failure.Thus, as long as collector was started within
health_check::check_collector_pipeline::interval
from current time, the health check works fine. After theinterval
is crossed, health check always returns 200 OK.Solutions that I can imagine:
vd.End
instead ofvd.Start
with current time hereStart
time in obsreportconfig.goThe text was updated successfully, but these errors were encountered: