-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic using the load balancing exporter #31410
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Is this only happening with the k8s resolver? Can you try the DNS resolver instead and report back? |
@kentquirk, is this something you could take a look? |
Looks related to open-telemetry/opentelemetry-go-contrib#4895. |
I don't believe that's the issue here. From the attached logs it looks like the core dependency is at |
#31050 potentially resolves this issue. Currently in |
@grzn, is this something you started seeing in 0.94.0, or you haven't tried the loadbalancing exporter before? |
This isn't new to v0.94.0 |
@crobert-1 I think the problem is a bit different here. The data is being sent to an exporter that was shut down. So it must be some desynchronisation between routing and tracking the list of active exporters |
#31456 should resolve the panic |
Nice! Were you able to reproduce this panic in a UT? |
I wasn't but it became pretty clear to me after looking in the code |
@grzn, if you have a test cluster where you can try the build from the branch, that would be great. I can help you to push the image if needed. It's just one command to build |
@dmitryax I have clusters to test this on, but I need a tagged image. |
Maybe you can simulate this in UT by sending the traces to a dummy gRPC server that sleeps? |
Ok, I've built an amd64 linux image from the branch and pushed it to I'll try to reproduce it in a test in the meantime |
@dmitryax i need both the arm64 and amd64; once you publish it i'll give it a try |
I ended up compiling from your branch; deploying it now. |
Fix panic when a sub-exporter is shut down while still handling requests. This change wraps exporters with an additional working group to ensure that exporters are shut down only after they finish processing data. Fixes #31410 It has some small related refactoring changes. I can extract them in separate PRs if needed.
Okay so after restarting the deployment/collector, the daemonset/agent did not panic, but our backend pods show these errors:
the metrics show there are no backends
and the logs show
|
giong to rollback |
Can you confirm that these IPs are indeed collector instances behind your Kubernetes service named 10.0.47.151 Do you have more pods behind the service? If so, can you share metrics about them as well? |
missed your comment; |
I see this is merged, I'll try the main branch again this week and report back. |
The problem I reported on last week still happens on the Scenario:
when I restart the deployment, some of the daemonset replicas goes bad:
In this specific cluster, the deployment replica count is 5, the daemonset replica count is 20; out of the 20 pods, 1 went bad. So right now the situation in |
…elemetry#31456) Fix panic when a sub-exporter is shut down while still handling requests. This change wraps exporters with an additional working group to ensure that exporters are shut down only after they finish processing data. Fixes open-telemetry#31410 It has some small related refactoring changes. I can extract them in separate PRs if needed.
…wn (open-telemetry#31602) This resolves the issues seen in open-telemetry#31410 after merging open-telemetry#31456
…elemetry#31456) Fix panic when a sub-exporter is shut down while still handling requests. This change wraps exporters with an additional working group to ensure that exporters are shut down only after they finish processing data. Fixes open-telemetry#31410 It has some small related refactoring changes. I can extract them in separate PRs if needed.
…wn (open-telemetry#31602) This resolves the issues seen in open-telemetry#31410 after merging open-telemetry#31456
Component(s)
exporter/loadbalancing
What happened?
Description
We are running v0.94.0 in a number of k8s clusters, and are experiencing panics in the agent setup
Steps to Reproduce
I don't have an exact steps to reproduce, but this panic happens quite other across our clusters
Expected Result
No panic
Actual Result
Panic :)
Collector version
v0.94.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
Additional context
My guess is that the k8s resolver doesn't shutdown exporters properly?
The text was updated successfully, but these errors were encountered: