panic using the load balancing exporter #31410

grzn · 2024-02-26T14:04:31Z

Component(s)

exporter/loadbalancing

What happened?

Description

We are running v0.94.0 in a number of k8s clusters, and are experiencing panics in the agent setup

Steps to Reproduce

I don't have an exact steps to reproduce, but this panic happens quite other across our clusters

Expected Result

No panic

Actual Result

Panic :)

Collector version

v0.94.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

connectors:
      null
    exporters:
      file/logs:
        path: /dev/null
      file/traces:
        path: /dev/null
      loadbalancing/traces:
        protocol:
          otlp:
            retry_on_failure:
              enabled: true
              max_elapsed_time: 30s
              max_interval: 5s
            sending_queue:
              enabled: true
              num_consumers: 20
              queue_size: 50000
            timeout: 20s
            tls:
              insecure: true
        resolver:
          k8s:
            service: opentelemetry-collector.default
    extensions:
      health_check: {}
    processors:
      batch:
        send_batch_max_size: 4096
        send_batch_size: 4096
        timeout: 100ms
      filter/fastpath:
        traces:
          span:
          - (end_time_unix_nano - start_time_unix_nano <= 1000000) and parent_span_id.string
            != ""
      k8sattributes:
        extract:
          annotations: null
          labels:
          - key: app
          metadata:
          - k8s.deployment.name
          - k8s.namespace.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - container.id
          - container.image.name
          - container.image.tag
        filter:
          node_from_env_var: K8S_NODE_NAME
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: resource_attribute
            name: k8s.pod.ip
        - sources:
          - from: resource_attribute
            name: host.name
      memory_limiter:
        check_interval: 1s
        limit_percentage: 95
        spike_limit_percentage: 10
      resource:
        attributes:
        - action: insert
          key: k8s.node.name
          value: ${K8S_NODE_NAME}
      resource/add_agent_k8s:
        attributes:
        - action: insert
          key: k8s.pod.name
          value: ${K8S_POD_NAME}
        - action: insert
          key: k8s.pod.uid
          value: ${K8S_POD_UID}
        - action: insert
          key: k8s.namespace.name
          value: ${K8S_NAMESPACE}
      resource/add_cluster_name:
        attributes:
        - action: upsert
          key: k8s.cluster.name
          value: test-eu3
      resource/add_environment:
        attributes:
        - action: insert
          key: deployment.environment
          value: test
      resourcedetection:
        detectors:
        - env
        - eks
        - ec2
        - system
        override: false
        timeout: 10s
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-agent
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${K8S_POD_IP}:9090
    service:
      extensions:
      - health_check
      pipelines:
        traces:
          exporters:
          - loadbalancing/traces
          processors:
          - memory_limiter
          - filter/fastpath
          - k8sattributes
          - resource
          - resource/add_cluster_name
          - resource/add_environment
          - resource/add_agent_k8s
          - resourcedetection
          receivers:
          - otlp
      telemetry:
        logs:
          encoding: json
          initial_fields:
            service: opentelemetry-agent
          level: INFO
          sampling:
            enabled: true
            initial: 3
            thereafter: 0
            tick: 60s
        metrics:
          address: 0.0.0.0:9090

Log output

net/http/server.go:3086 +0x4cc
created by net/http.(*Server).Serve in goroutine 745
net/http/server.go:2009 +0x518
net/http.(*conn).serve(0x4004ab4000, {0x862ab78, 0x4001dcfaa0})
net/http/server.go:2938 +0xbc
net/http.serverHandler.ServeHTTP({0x85dff10?}, {0x8608d30?, 0x40001dc380?}, 0x6?)
go.opentelemetry.io/collector/config/confighttp@v0.94.1/clientinfohandler.go:26 +0x100
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP(0x400212cd08, {0x8608d30, 0x40001dc380}, 0x4005a3ae00)
net/http/server.go:2136 +0x38
net/http.HandlerFunc.ServeHTTP(0x4005a3ae00?, {0x8608d30?, 0x40001dc380?}, 0x4005a31af0?)
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.48.0/handler.go:83 +0x40
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1({0x8608d30?, 0x40001dc380?}, 0x4005a31ad8?)
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.48.0/handler.go:225 +0xf44
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP(0x4002c0d110, {0x8608d30?, 0x40001dc380}, 0x4005a3af00, {0x85aa920, 0x4001854080})
go.opentelemetry.io/collector/config/confighttp@v0.94.1/compression.go:147 +0x150
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP(0x4001854080, {0x8620d30, 0x40070b4540}, 0x4005a3b000)
net/http/server.go:2514 +0x144
net/http.(*ServeMux).ServeHTTP(0x4001854080?, {0x8620d30, 0x40070b4540}, 0x4005a3b000)
net/http/server.go:2136 +0x38
net/http.HandlerFunc.ServeHTTP(0x4005a31398?, {0x8620d30?, 0x40070b4540?}, 0x0?)
go.opentelemetry.io/collector/receiver/otlpreceiver@v0.94.1/otlp.go:129 +0x28
go.opentelemetry.io/collector/receiver/otlpreceiver.(*otlpReceiver).startHTTPServer.func1({0x8620d30?, 0x40070b4540?}, 0x6422920?)
go.opentelemetry.io/collector/receiver/otlpreceiver@v0.94.1/otlphttp.go:43 +0xb0
go.opentelemetry.io/collector/receiver/otlpreceiver.handleTraces({0x8620d30, 0x40070b4540}, 0x4005a3b000, 0x400521c4e0?)
go.opentelemetry.io/collector/receiver/otlpreceiver@v0.94.1/internal/trace/otlp.go:42 +0xa4
go.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export(0x400212ca38, {0x862ab78, 0x400521c630}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector@v0.94.1/internal/fanoutconsumer/traces.go:60 +0x208
go.opentelemetry.io/collector/internal/fanoutconsumer.(*tracesConsumer).ConsumeTraces(0x4002c82c60, {0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/processor@v0.94.1/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/processor@v0.94.1/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/processor@v0.94.1/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/processor@v0.94.1/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/processor@v0.94.1/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/processor@v0.94.1/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/processor@v0.94.1/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/processor@v0.94.1/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.94.0/trace_exporter.go:121 +0x160
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*traceExporterImp).ConsumeTraces(0x4002c0f170, {0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.94.0/trace_exporter.go:134 +0x16c
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*traceExporterImp).consumeTrace(0x40055565b8?, {0x862ab78, 0x400521c690}, 0xa?)
go.opentelemetry.io/collector/consumer@v0.94.1/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/exporter@v0.94.1/exporterhelper/traces.go:99 +0xb4
go.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func1({0x862ab78, 0x400521c690}, {0x400614f3f8?, 0x4013439754?})
go.opentelemetry.io/collector/exporter@v0.94.1/exporterhelper/common.go:199 +0x50
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send(0x401368f540, {0x862ab78?, 0x400521c690?}, {0x85dfa10?, 0x400614f758?})
go.opentelemetry.io/collector/exporter@v0.94.1/exporterhelper/queue_sender.go:154 +0xa8
go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).send(0x400faaaf00, {0x862ab78?, 0x400521c690?}, {0x85dfa10, 0x400614f758})
go.opentelemetry.io/collector/exporter@v0.94.1/internal/queue/bounded_memory_queue.go:43
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Offer(...)
runtime/panic.go:914 +0x218
panic({0x64d27e0?, 0x8591d40?})
go.opentelemetry.io/otel/sdk@v1.23.0/trace/span.go:437 +0x7f8
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x4003f3c900, {0x0, 0x0, 0x286b4?})
go.opentelemetry.io/otel/sdk@v1.23.0/trace/span.go:405 +0x2c
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.func1()
runtime/panic.go:920 +0x26c
panic({0x64d27e0?, 0x8591d40?})
net/http/server.go:1868 +0xb0
net/http.(*conn).serve.func1()
goroutine 419349 [running]:
2024/02/26 12:34:34 http: panic serving 10.0.58.206:49066: send on closed channel

Additional context

My guess is that the k8s resolver doesn't shutdown exporters properly?

github-actions · 2024-02-26T14:05:01Z

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2024-02-26T15:40:55Z

Is this only happening with the k8s resolver? Can you try the DNS resolver instead and report back?

jpkrohling · 2024-02-26T15:41:12Z

@kentquirk, is this something you could take a look?

MrAlias · 2024-02-26T19:29:34Z

Looks related to open-telemetry/opentelemetry-go-contrib#4895.

crobert-1 · 2024-02-26T19:38:10Z

Looks related to open-telemetry/opentelemetry-go-contrib#4895.

I don't believe that's the issue here. From the attached logs it looks like the core dependency is at v0.94.1, which reverted the dependency to a version unaffected by that issue.

crobert-1 · 2024-02-26T20:30:49Z

#31050 potentially resolves this issue.

Currently in main the loadbalancer starts the resolver (k8sresolver in this case), but does not call its shutdown method in the exporter's Shutdown method. This PR will make it so that the load balancer properly calls shutdown on its loadbalancer, regardless of type. Also, the PR shuts down the loadbalancer for traces/metrics/logs exporters as well, as that also is not currently done.

dmitryax · 2024-02-26T23:40:15Z

@grzn, is this something you started seeing in 0.94.0, or you haven't tried the loadbalancing exporter before?

grzn · 2024-02-27T05:43:08Z

This isn't new to v0.94.0
We saw it also in v0.91.0, and that's the first version we started using the load balancing exporter

dmitryax · 2024-02-27T17:45:39Z

@crobert-1 I think the problem is a bit different here. The data is being sent to an exporter that was shut down. So it must be some desynchronisation between routing and tracking the list of active exporters

dmitryax · 2024-02-27T22:07:58Z

#31456 should resolve the panic

grzn · 2024-02-28T11:13:26Z

Nice!
Than's @dmitryax I'll try this once merged and released.

Were you able to reproduce this panic in a UT?

dmitryax · 2024-02-28T17:19:34Z

Were you able to reproduce this panic in a UT?

I wasn't but it became pretty clear to me after looking in the code

dmitryax · 2024-02-28T17:27:59Z

@grzn, if you have a test cluster where you can try the build from the branch, that would be great. I can help you to push the image if needed. It's just one command to build make docker-otelcontribcol.

grzn · 2024-02-29T08:47:57Z

@dmitryax I have clusters to test this on, but I need a tagged image.

grzn · 2024-02-29T08:48:40Z

Maybe you can simulate this in UT by sending the traces to a dummy gRPC server that sleeps?

dmitryax · 2024-02-29T20:09:05Z

Ok, I've built an amd64 linux image from the branch and pushed it to danoshin276/otelcontribcol:lb-fixed-1. Let me know if you need an arm64 image instead. The executable is under /otelcontribcol.

I'll try to reproduce it in a test in the meantime

grzn · 2024-03-04T21:35:59Z

@dmitryax i need both the arm64 and amd64; once you publish it i'll give it a try

grzn · 2024-03-05T08:48:08Z

I ended up compiling from your branch; deploying it now.

Fix panic when a sub-exporter is shut down while still handling requests. This change wraps exporters with an additional working group to ensure that exporters are shut down only after they finish processing data. Fixes #31410 It has some small related refactoring changes. I can extract them in separate PRs if needed.

grzn · 2024-03-05T11:10:26Z

Okay so after restarting the deployment/collector, the daemonset/agent did not panic, but our backend pods show these errors:

traces export: context deadline exceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded

the metrics show there are no backends

# TYPE otelcol_loadbalancer_num_backends gauge
otelcol_loadbalancer_num_backends{resolver="k8s",service_instance_id="0b8491ac-21b9-4935-b3f1-1f21f7e620c0",service_name="otelcontribcol",service_version="0.96.0-dev"} 0

and the logs show

2024-03-05 11:06:03.886Z {"level":"warn","ts":1709636763.8864968,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #95 SubChannel #96] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.29.198:4317\", ServerName: \"10.0.29.198:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.29.198:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:03.896Z {"level":"warn","ts":1709636763.8962934,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:03.908Z {"level":"warn","ts":1709636763.908809,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:04.898Z {"level":"warn","ts":1709636764.898518,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:06.780Z {"level":"warn","ts":1709636766.7802808,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:24.910Z {"level":"warn","ts":1709636784.9104936,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:46.574Z {"level":"warn","ts":1709636806.5742695,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:51.959Z {"level":"warn","ts":1709636811.959513,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: no route to host\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:07:08.639Z {"level":"warn","ts":1709636828.6391737,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:07:25.169Z {"level":"warn","ts":1709636845.1695504,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: no route to host\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:07:32.757Z {"level":"warn","ts":1709636852.7577434,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:07:59.047Z {"level":"warn","ts":1709636879.047226,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:08:14.679Z {"level":"warn","ts":1709636894.6795218,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: no route to host\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:08:28.773Z {"level":"warn","ts":1709636908.773698,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:09:14.311Z {"level":"warn","ts":1709636954.3115778,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:09:31.399Z {"level":"warn","ts":1709636971.399436,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: no route to host\"","service":"opentelemetry-agent","grpc_log":true}

grzn · 2024-03-05T11:10:44Z

giong to rollback

jpkrohling · 2024-03-05T11:43:16Z

Can you confirm that these IPs are indeed collector instances behind your Kubernetes service named otelcontribcol? If they are, can you confirm they have OTLP receivers and that the port 4317 is exposed? Can you share the metrics from one of those collectors as well?

10.0.47.151
10.0.59.158
10.0.29.198

Do you have more pods behind the service? If so, can you share metrics about them as well?

dmitryax · 2024-03-05T18:07:02Z

#31602 should solve the issues you see now. @grzn do you have a chance to try this branch?

…wn (#31602) This resolves the issues seen in #31410 after merging #31456

grzn · 2024-03-10T18:43:27Z

missed your comment;
yes, all of the IPs are pods behind the service, and this happens when I restart the service so the old pods are dead and I get no metrics from them.
I can get metrics from the new ones.

grzn · 2024-03-10T18:44:42Z

I see this is merged, I'll try the main branch again this week and report back.

grzn · 2024-03-12T12:44:23Z

The problem I reported on last week still happens on the main branch.

Scenario:

pods sendings traces to otel deployed sas a daemonset
the otel daemonset uses loadbalancing exporter and k8s resolver to send traces to an otel deployment
otel deployment sends traces to 3rdparty, uses loadbalacing processor

when I restart the deployment, some of the daemonset replicas goes bad:

the pods sendings traces to this replica fail to send traces
the metric otelcol_loadbalancer_num_backends drops down to zero

In this specific cluster, the deployment replica count is 5, the daemonset replica count is 20; out of the 20 pods, 1 went bad.

So right now the situation in main is worse than before the attempted fix.

…elemetry#31456) Fix panic when a sub-exporter is shut down while still handling requests. This change wraps exporters with an additional working group to ensure that exporters are shut down only after they finish processing data. Fixes open-telemetry#31410 It has some small related refactoring changes. I can extract them in separate PRs if needed.

…wn (open-telemetry#31602) This resolves the issues seen in open-telemetry#31410 after merging open-telemetry#31456

…elemetry#31456) Fix panic when a sub-exporter is shut down while still handling requests. This change wraps exporters with an additional working group to ensure that exporters are shut down only after they finish processing data. Fixes open-telemetry#31410 It has some small related refactoring changes. I can extract them in separate PRs if needed.

…wn (open-telemetry#31602) This resolves the issues seen in open-telemetry#31410 after merging open-telemetry#31456

grzn added bug Something isn't working needs triage New item requiring triage labels Feb 26, 2024

github-actions bot added the exporter/loadbalancing label Feb 26, 2024

This was referenced Feb 27, 2024

Weekly Report: 2024-02-20 - 2024-02-27 #31422

Closed

Weekly Report: 2024-02-20 - 2024-02-27 asuresh4/opentelemetry-collector-contrib#11542

Open

dmitryax mentioned this issue Feb 27, 2024

[exporter/loadbalancing] Fix panic on a sub-exporter shutdown #31456

Merged

crobert-1 removed the needs triage New item requiring triage label Feb 28, 2024

dmitryax mentioned this issue Mar 5, 2024

[exporter/loadbalancing] Tests for shutdown/consume race #31566

Closed

jpkrohling closed this as completed in #31456 Mar 5, 2024

dmitryax mentioned this issue Mar 5, 2024

[exporter/loadbalancing] Do not block resolver by the exporter shutdown #31602

Merged

dmitryax added a commit that referenced this issue Mar 7, 2024

[exporter/loadbalancing] Do not block resolver by the exporter shutdo…

dbae3a1

…wn (#31602) This resolves the issues seen in #31410 after merging #31456

github-actions bot mentioned this issue Jul 1, 2024

Link Checker Report signalfx/splunk-otel-collector#5039

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic using the load balancing exporter #31410

panic using the load balancing exporter #31410

grzn commented Feb 26, 2024

github-actions bot commented Feb 26, 2024

jpkrohling commented Feb 26, 2024

jpkrohling commented Feb 26, 2024

MrAlias commented Feb 26, 2024

crobert-1 commented Feb 26, 2024

crobert-1 commented Feb 26, 2024

dmitryax commented Feb 26, 2024

grzn commented Feb 27, 2024

dmitryax commented Feb 27, 2024

dmitryax commented Feb 27, 2024

grzn commented Feb 28, 2024 •

edited

Loading

dmitryax commented Feb 28, 2024

dmitryax commented Feb 28, 2024

grzn commented Feb 29, 2024

grzn commented Feb 29, 2024

dmitryax commented Feb 29, 2024

grzn commented Mar 4, 2024 •

edited

Loading

grzn commented Mar 5, 2024

grzn commented Mar 5, 2024

grzn commented Mar 5, 2024

jpkrohling commented Mar 5, 2024

dmitryax commented Mar 5, 2024 •

edited

Loading

grzn commented Mar 10, 2024

grzn commented Mar 10, 2024

grzn commented Mar 12, 2024 •

edited

Loading

panic using the load balancing exporter #31410

panic using the load balancing exporter #31410

Comments

grzn commented Feb 26, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Feb 26, 2024

jpkrohling commented Feb 26, 2024

jpkrohling commented Feb 26, 2024

MrAlias commented Feb 26, 2024

crobert-1 commented Feb 26, 2024

crobert-1 commented Feb 26, 2024

dmitryax commented Feb 26, 2024

grzn commented Feb 27, 2024

dmitryax commented Feb 27, 2024

dmitryax commented Feb 27, 2024

grzn commented Feb 28, 2024 • edited Loading

dmitryax commented Feb 28, 2024

dmitryax commented Feb 28, 2024

grzn commented Feb 29, 2024

grzn commented Feb 29, 2024

dmitryax commented Feb 29, 2024

grzn commented Mar 4, 2024 • edited Loading

grzn commented Mar 5, 2024

grzn commented Mar 5, 2024

grzn commented Mar 5, 2024

jpkrohling commented Mar 5, 2024

dmitryax commented Mar 5, 2024 • edited Loading

grzn commented Mar 10, 2024

grzn commented Mar 10, 2024

grzn commented Mar 12, 2024 • edited Loading

grzn commented Feb 28, 2024 •

edited

Loading

grzn commented Mar 4, 2024 •

edited

Loading

dmitryax commented Mar 5, 2024 •

edited

Loading

grzn commented Mar 12, 2024 •

edited

Loading