Kafka receiver stuck while shutting down at v0.93.0 #30789

james-ryans · 2024-01-26T01:46:28Z

Component(s)

receiver/kafka

What happened?

Description

Shutting down kafka receiver got stuck forever while transitioning from StatusStopping to StatusStopped.

I've debugged this for a while and apparently this is because of consumeLoop returns context canceled error and do ReportStatus FatalErrorEvent at receiver/kafkareceiver/kafka_receiver.go between line 163-165. But the sync.Mutex.Lock() in ReportStatus FatalErrorEvent never gets unlocked (I don't know why), so that the ReportStatus for StatusStopped stuck forever while trying to acquire the mutex lock.

Also, I've tried to rollback to v0.92.0 and it works well. And traced to issue down to receiver/kafkareceiver/kafka_receiver.go at line 164 c.settings.ReportStatus(component.NewFatalErrorEvent(err)) changed at PR #30593 was the cause.

Steps to Reproduce

Create a collector with kafkareceiver factory in it. And have a receivers.kafka in the config.

Expected Result

Should be able to shutdown properly.

Actual Result

Stuck indefinitely while shutting down with the logs below.

Collector version

v0.93.0

Environment information

No response

OpenTelemetry Collector configuration

service:
  pipelines:
    traces:
      receivers: [kafka]

receivers:
  kafka:
    brokers:
      - localhost:9092
    encoding: otlp_proto # available encodings are otlp_proto, jaeger_proto, jaeger_json, zipkin_proto, zipkin_json, zipkin_thrift
    initial_offset: earliest # consume messages from the beginning

Log output

2024-01-26T08:32:45.266+0700	info	kafkareceiver@v0.93.0/kafka_receiver.go:431	Starting consumer group	{"kind": "receiver", "name": "kafka", "data_type": "traces", "partition": 0}
^C2024-01-26T08:32:53.626+0700	info	otelcol@v0.93.0/collector.go:258	Received signal from OS	{"signal": "interrupt"}
2024-01-26T08:32:53.626+0700	info	service@v0.93.0/service.go:179	Starting shutdown...
2024-01-26T08:32:54.010+0700	info	kafkareceiver@v0.93.0/kafka_receiver.go:181	Consumer stopped	{"kind": "receiver", "name": "kafka", "data_type": "traces", "error": "context canceled"}

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-26T01:46:42Z

Pinging code owners:

receiver/kafka: @pavolloffay @MovieStoreGuy

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2024-01-26T18:10:14Z

@mwear: Do you have any thoughts on why a component may never be able to get the lock when reporting status? It looks like this may be related to the work you've been doing on component status reporting.

Possible related PR: open-telemetry/opentelemetry-collector#8836

mwear · 2024-01-26T20:28:01Z

Based on the research @james-ryans did, this came in after this change: #30610. What I suspect is happening is that writing the fatal error to the asyncErrorChannel in serviceHost is blocking, so that ReportStatus never returns (and never releases its lock). Here is the suspect line: https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/host.go#L73.

I think this is a variation of this existing problem: open-telemetry/opentelemetry-collector#8116, which is also assigned to me. It has been on my todo list. I'll look into it.

crobert-1 · 2024-01-26T21:42:49Z

Thanks @mwear, appreciate your insight here!

github-actions · 2024-03-27T03:29:20Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/kafka: @pavolloffay @MovieStoreGuy

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme · 2024-03-30T06:18:24Z

This is open-telemetry/opentelemetry-collector#9824.

lahsivjar · 2024-04-22T12:23:23Z

This is open-telemetry/opentelemetry-collector#9824.

If I am not mistaken, this issue should happen for all receivers. Here is an example of a flaky test in the collector-contrib due to the same issue happening for the opencensus receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8742859512/job/23992117763. The test has goroutine dump which indicates to the same problem with asyncErrorChannel as is pointed out in the shared issue.

crobert-1 · 2024-04-22T15:06:25Z

This is open-telemetry/opentelemetry-collector#9824.

Here is an example of a flaky test in the collector-contrib due to the same issue happening for the opencensus receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8742859512/job/23992117763. The test has goroutine dump which indicates to the same problem with asyncErrorChannel as is pointed out in the shared issue.

Adding a reference to the issue for the flaky test: #27295

crobert-1 · 2024-04-24T14:38:23Z

+1 freq: #32667

Dennis8274 · 2024-04-25T02:30:45Z

quick fix as follows?

crobert-1 · 2024-04-26T19:28:10Z

Your fixed worked @Dennis8274, thanks for the suggestion! I've posted a PR to resolve this issue. 👍

@Dennis8274

**Description:** <Describe what has changed.>  The kafka receiver's shutdown method is to cancel the context of a running sub goroutine. However, a small bug was causing a fatal error to be reported during shutdown when this expected condition was hit. The fatal error being reported during shutdown was causing another bug to be hit, open-telemetry/opentelemetry-collector#9824. This fix means that shutdown won't be blocked in expected shutdown conditions, but the `core` bug referenced above means shutdown will still be block in unexpected error situations. This fix is being taken from a comment made by @Dennis8274 on the issue. **Link to tracking Issue:** <Issue number if applicable> Fixes #30789 **Testing:** <Describe what testing was performed and which tests were added.> Stepped through `TestTracesReceiverStart` in a debugger before the change to see the fatal status being reported. It was no longer reported after applying the fix. Manually tested running the collector with a kafka receiver and saw that before the fix it was indeed being blocked on a normal shutdown, but after the fix it shutdown as expected.

tejas-contentstack · 2024-08-30T07:48:10Z

I'm facing a similar issue while shutting down the Kafka receiver. If we try to shut down the collector, it stops the consumer however, starts the consumer right after. Check in the logs snapshot

Collector version: v0.101.0

Here's the Otel config file:

receivers:

  kafka/metrics:
    brokers:  ["0.0.0.0:9092", "localhost:9092", "localhost:9092"]
    group_id: otel-metrics-consumer
    topic: topic-otel-metrics
    header_extraction:
      extract_headers: true
      headers:
        - uid
    metadata:
      full: true
      
exporters:
  exporter:


service:
  telemetry:
    logs:
      output_paths: ["stdout"]
      error_output_paths: ["stderr"]
    metrics:
      level: none
  extensions: []
  pipelines: 
    metrics:
      receivers: [kafka/metrics]
      processors: []
      exporters: [exporter]

crobert-1 · 2024-08-30T16:14:15Z

Hello @tejas-contentstack, thanks for adding frequency! In this case it may be best to open a new issue and reference this one with it, since the error message is slightly different. I agree it looks like it may be a similar problem, but it would be best to investigate fully to make sure we're not missing anything 👍

tbm48813 · 2024-09-26T15:40:01Z

Hello, I am seeing the same behavior in version 0.109.0. The issue is that the receiver never shuts down, it only hangs while trying. Relevant log entry is:
{"level":"info","ts":"2024-09-26T11:27:48.191-0400","caller":"kafkareceiver@v0.108.0/kafka_receiver.go:388","msg":"Consumer stopped","kind":"receiver","name":"kafka/kafkastream__logs","data_type":"logs","error":"context canceled"}
The service needs to be manually stopped each time as it will hang here.

djaglowski · 2024-09-26T15:43:37Z

It looks like this might be a very slightly different problem but also may be an easy fix (See #35438). I'm just going to reopen this issue and resolve it again with the new PR if that works.

djaglowski · 2024-09-26T16:36:20Z

My quick fix was wishful thinking. Still, I given that we have two reports of the receiver still not shutting down correctly after #32720, I think we might as well leave the issue open until we have a robust solution.

jsirianni · 2024-10-01T18:27:17Z

I was seeing this as well. The receiver is working great until it is time to shutdown.

dpaasman00 · 2024-10-11T15:29:41Z

Working on a fix for this!

dpaasman00 · 2024-10-14T15:03:24Z

Above PR should resolve this issue!

#### Description Fixes an issue where the Kafka receiver would block on shutdown. There was an earlier fix for this issue [here](#32720). This does solve the issue, but it was only applied to the traces receiver, not the logs or metrics receiver. The issue is this go routine in the `Start()` functions for logs and metrics: ```go go func() { if err := c.consumeLoop(ctx, metricsConsumerGroup); err != nil { componentstatus.ReportStatus(host, componentstatus.NewFatalErrorEvent(err)) } }() ``` The `consumeLoop()` function returns a `context.Canceled` error when `Shutdown()` is called, which is expected. However `componentstatus.ReportStatus()` blocks while attempting to report this error. The reason/bug for this can be found [here](open-telemetry/opentelemetry-collector#9824). The previously mentioned PR fixed this for the traces receiver by checking if the error returned by `consumeLoop()` is `context.Canceled`: ```go go func() { if err := c.consumeLoop(ctx, consumerGroup); !errors.Is(err, context.Canceled) { componentstatus.ReportStatus(host, componentstatus.NewFatalErrorEvent(err)) } }() ``` Additionally, this is `consumeLoop()` for the traces receiver, with the logs and metrics versions being identical: ```go func (c *kafkaTracesConsumer) consumeLoop(ctx context.Context, handler sarama.ConsumerGroupHandler) error { for { // `Consume` should be called inside an infinite loop, when a // server-side rebalance happens, the consumer session will need to be // recreated to get the new claims if err := c.consumerGroup.Consume(ctx, c.topics, handler); err != nil { c.settings.Logger.Error("Error from consumer", zap.Error(err)) } // check if context was cancelled, signaling that the consumer should stop if ctx.Err() != nil { c.settings.Logger.Info("Consumer stopped", zap.Error(ctx.Err())) return ctx.Err() } } } ``` This does fix the issue, however the only error that can be returned by `consumeLoop()` is a canceled context. When we create the context and cancel function, we use `context.Background()`: ```go ctx, cancel := context.WithCancel(context.Background()) ``` This context is only used by `consumeLoop()` and the cancel function is only called in `Shutdown()`. Because `consumeLoop()` can only return a `context.Canceled` error, this PR removes this unused code for the logs, metrics, and traces receivers. Instead, `consumeLoop()` still logs the `context.Canceled` error but it does not return any error and the go routine simply just calls `consumeLoop()`. Additional motivation for removing the call to `componentstatus.ReportStatus()` is the underlying function called by it, `componentstatus.Report()` says it does not need to be called during `Shutdown()` or `Start()` as the service already does so for the given component, [comment here](https://github.com/open-telemetry/opentelemetry-collector/blob/main/component/componentstatus/status.go#L21-L25). Even if there wasn't a bug causing this call to block, the component still shouldn't call it since it would only be called during `Shutdown()`.  #### Link to tracking issue Fixes #30789  #### Testing Tested in a build of the collector with these changes scraping logs from a Kafka instance. When the collector is stopped and `Shutdown()` gets called, the receiver did not block and the collector stopped gracefully as expected.

james-ryans added bug Something isn't working needs triage New item requiring triage labels Jan 26, 2024

github-actions bot added the receiver/kafka label Jan 26, 2024

james-ryans mentioned this issue Jan 26, 2024

[jaeger-v2] Add kafka exporter and receiver configuration jaegertracing/jaeger#4971

Closed

4 tasks

crobert-1 added the priority:p1 High label Jan 26, 2024

This was referenced Jan 30, 2024

Weekly Report: 2024-01-23 - 2024-01-30 #30848

Closed

Weekly Report: 2024-01-30 - 2024-02-06 #31055

Closed

This was referenced Feb 13, 2024

Weekly Report: 2024-02-06 - 2024-02-13 #31192

Closed

Weekly Report: 2024-02-13 - 2024-02-20 #31323

Closed

This was referenced Feb 20, 2024

Weekly Report: 2024-02-13 - 2024-02-20 asuresh4/opentelemetry-collector-contrib#11541

Open

Weekly Report: 2024-02-20 - 2024-02-27 #31422

Closed

Weekly Report: 2024-02-20 - 2024-02-27 asuresh4/opentelemetry-collector-contrib#11542

Open

This was referenced Mar 5, 2024

Weekly Report: 2024-02-27 - 2024-03-05 #31560

Closed

Weekly Report: 2024-02-27 - 2024-03-05 asuresh4/opentelemetry-collector-contrib#11543

Open

Weekly Report: 2024-03-05 - 2024-03-12 #31693

Closed

james-ryans mentioned this issue Mar 16, 2024

[Jaeger v2] Add Kafka exporter and receiver jaegertracing/jaeger#4868

Closed

6 tasks

This was referenced Mar 19, 2024

Weekly Report: 2024-03-12 - 2024-03-19 #31825

Closed

Weekly Report: 2024-03-12 - 2024-03-19 asuresh4/opentelemetry-collector-contrib#11544

Open

Weekly Report: 2024-03-19 - 2024-03-26 #31947

Closed

github-actions bot added the Stale label Mar 27, 2024

crobert-1 removed the Stale label Mar 27, 2024

atoulme removed the needs triage New item requiring triage label Mar 30, 2024

crobert-1 mentioned this issue Apr 24, 2024

getting stuck in the shutdown process caused by kafka receiver #32667

Closed

crobert-1 mentioned this issue Apr 26, 2024

[receiver/kafka] Fix bug that was blocking shutdown #32720

Merged

MovieStoreGuy closed this as completed in #32720 May 22, 2024

djaglowski mentioned this issue Sep 26, 2024

[receiver/kafka] Use proper context for shutdown #35438

Closed

djaglowski reopened this Sep 26, 2024

dpaasman00 mentioned this issue Oct 14, 2024

[receiver/kafkareceiver] fix: Kafka receiver blocking shutdown #35767

Merged

djaglowski closed this as completed in #35767 Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka receiver stuck while shutting down at v0.93.0 #30789

Kafka receiver stuck while shutting down at v0.93.0 #30789

james-ryans commented Jan 26, 2024 •

edited

Loading

github-actions bot commented Jan 26, 2024

crobert-1 commented Jan 26, 2024

mwear commented Jan 26, 2024

crobert-1 commented Jan 26, 2024

github-actions bot commented Mar 27, 2024

atoulme commented Mar 30, 2024

lahsivjar commented Apr 22, 2024

crobert-1 commented Apr 22, 2024

crobert-1 commented Apr 24, 2024

Dennis8274 commented Apr 25, 2024

crobert-1 commented Apr 26, 2024

tejas-contentstack commented Aug 30, 2024

crobert-1 commented Aug 30, 2024

tbm48813 commented Sep 26, 2024

djaglowski commented Sep 26, 2024

djaglowski commented Sep 26, 2024

jsirianni commented Oct 1, 2024

dpaasman00 commented Oct 11, 2024

dpaasman00 commented Oct 14, 2024

Kafka receiver stuck while shutting down at v0.93.0 #30789

Kafka receiver stuck while shutting down at v0.93.0 #30789

Comments

james-ryans commented Jan 26, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jan 26, 2024

crobert-1 commented Jan 26, 2024

mwear commented Jan 26, 2024

crobert-1 commented Jan 26, 2024

github-actions bot commented Mar 27, 2024

atoulme commented Mar 30, 2024

lahsivjar commented Apr 22, 2024

crobert-1 commented Apr 22, 2024

crobert-1 commented Apr 24, 2024

Dennis8274 commented Apr 25, 2024

crobert-1 commented Apr 26, 2024

tejas-contentstack commented Aug 30, 2024

crobert-1 commented Aug 30, 2024

tbm48813 commented Sep 26, 2024

djaglowski commented Sep 26, 2024

djaglowski commented Sep 26, 2024

jsirianni commented Oct 1, 2024

dpaasman00 commented Oct 11, 2024

dpaasman00 commented Oct 14, 2024

james-ryans commented Jan 26, 2024 •

edited

Loading