-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reporting fatal error during shutdown process (even from async goroutine) deadlocks #9824
Labels
Comments
MovieStoreGuy
pushed a commit
to open-telemetry/opentelemetry-collector-contrib
that referenced
this issue
May 22, 2024
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> The kafka receiver's shutdown method is to cancel the context of a running sub goroutine. However, a small bug was causing a fatal error to be reported during shutdown when this expected condition was hit. The fatal error being reported during shutdown was causing another bug to be hit, open-telemetry/opentelemetry-collector#9824. This fix means that shutdown won't be blocked in expected shutdown conditions, but the `core` bug referenced above means shutdown will still be block in unexpected error situations. This fix is being taken from a comment made by @Dennis8274 on the issue. **Link to tracking Issue:** <Issue number if applicable> Fixes #30789 **Testing:** <Describe what testing was performed and which tests were added.> Stepped through `TestTracesReceiverStart` in a debugger before the change to see the fatal status being reported. It was no longer reported after applying the fix. Manually tested running the collector with a kafka receiver and saw that before the fix it was indeed being blocked on a normal shutdown, but after the fix it shutdown as expected.
djaglowski
pushed a commit
to open-telemetry/opentelemetry-collector-contrib
that referenced
this issue
Oct 22, 2024
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Fixes an issue where the Kafka receiver would block on shutdown. There was an earlier fix for this issue [here](#32720). This does solve the issue, but it was only applied to the traces receiver, not the logs or metrics receiver. The issue is this go routine in the `Start()` functions for logs and metrics: ```go go func() { if err := c.consumeLoop(ctx, metricsConsumerGroup); err != nil { componentstatus.ReportStatus(host, componentstatus.NewFatalErrorEvent(err)) } }() ``` The `consumeLoop()` function returns a `context.Canceled` error when `Shutdown()` is called, which is expected. However `componentstatus.ReportStatus()` blocks while attempting to report this error. The reason/bug for this can be found [here](open-telemetry/opentelemetry-collector#9824). The previously mentioned PR fixed this for the traces receiver by checking if the error returned by `consumeLoop()` is `context.Canceled`: ```go go func() { if err := c.consumeLoop(ctx, consumerGroup); !errors.Is(err, context.Canceled) { componentstatus.ReportStatus(host, componentstatus.NewFatalErrorEvent(err)) } }() ``` Additionally, this is `consumeLoop()` for the traces receiver, with the logs and metrics versions being identical: ```go func (c *kafkaTracesConsumer) consumeLoop(ctx context.Context, handler sarama.ConsumerGroupHandler) error { for { // `Consume` should be called inside an infinite loop, when a // server-side rebalance happens, the consumer session will need to be // recreated to get the new claims if err := c.consumerGroup.Consume(ctx, c.topics, handler); err != nil { c.settings.Logger.Error("Error from consumer", zap.Error(err)) } // check if context was cancelled, signaling that the consumer should stop if ctx.Err() != nil { c.settings.Logger.Info("Consumer stopped", zap.Error(ctx.Err())) return ctx.Err() } } } ``` This does fix the issue, however the only error that can be returned by `consumeLoop()` is a canceled context. When we create the context and cancel function, we use `context.Background()`: ```go ctx, cancel := context.WithCancel(context.Background()) ``` This context is only used by `consumeLoop()` and the cancel function is only called in `Shutdown()`. Because `consumeLoop()` can only return a `context.Canceled` error, this PR removes this unused code for the logs, metrics, and traces receivers. Instead, `consumeLoop()` still logs the `context.Canceled` error but it does not return any error and the go routine simply just calls `consumeLoop()`. Additional motivation for removing the call to `componentstatus.ReportStatus()` is the underlying function called by it, `componentstatus.Report()` says it does not need to be called during `Shutdown()` or `Start()` as the service already does so for the given component, [comment here](https://github.com/open-telemetry/opentelemetry-collector/blob/main/component/componentstatus/status.go#L21-L25). Even if there wasn't a bug causing this call to block, the component still shouldn't call it since it would only be called during `Shutdown()`. <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes #30789 <!--Describe what testing was performed and which tests were added.--> #### Testing Tested in a build of the collector with these changes scraping logs from a Kafka instance. When the collector is stopped and `Shutdown()` gets called, the receiver did not block and the collector stopped gracefully as expected.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The problem is that during shutdown the otelcol no longer looks for updates on the
asyncErrorChannel
, and if any component will try to write to that channel will get into a deadlock.The text was updated successfully, but these errors were encountered: