Opentelemetry collector memory leak/non-optimized GC #26087

yehaotian · 2023-08-24T00:03:05Z

Component(s)

exporter/loadbalancing

What happened?

Description

We observe opentelemetry collector load balancing layer memory leak which will trigger OOM killed for nomad jobs restart.
The Mem metrics for high load region:

pprof top 10 resutls:
--inuse_space:

      flat  flat%   sum%        cum   cum%
 2198.65MB 60.34% 60.34%  2198.65MB 60.34%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
  972.33MB 26.68% 87.02%   972.33MB 26.68%  go.uber.org/zap/zapcore.newCounters
  142.02MB  3.90% 90.92%   142.02MB  3.90%  bufio.NewReaderSize (inline)
   75.63MB  2.08% 92.99%    75.63MB  2.08%  bytes.growSlice
   41.43MB  1.14% 94.13%    51.56MB  1.41%  compress/flate.NewWriter
   38.90MB  1.07% 95.20%    38.90MB  1.07%  go.opentelemetry.io/collector/exporter/exporterhelper/internal.NewBoundedMemoryQueue
   22.09MB  0.61% 95.80%    22.09MB  0.61%  golang.org/x/net/http2/hpack.(*headerFieldTable).addEntry
   20.76MB  0.57% 96.37%    20.76MB  0.57%  golang.org/x/net/http2.(*Framer).startWriteDataPadded
       2MB 0.055% 96.43%  2258.75MB 61.98%  google.golang.org/grpc/internal/transport.newHTTP2Client
    1.50MB 0.041% 96.47%    77.13MB  2.12%  bytes.(*Buffer).grow

--inuse_objects:

      flat  flat%   sum%        cum   cum%
    131074 15.39% 15.39%     163842 19.24%  google.golang.org/grpc/internal/grpcutil.EncodeDuration
     40345  4.74% 20.13%      40345  4.74%  runtime.malg
     38232  4.49% 24.61%      38232  4.49%  google.golang.org/grpc/internal/transport.(*controlBuffer).get
     32774  3.85% 28.46%      32774  3.85%  strconv.formatBits
     32769  3.85% 32.31%      32769  3.85%  fmt.Sprintf
     32769  3.85% 36.16%      32769  3.85%  google.golang.org/grpc/internal/channelz.newIdentifer
     32768  3.85% 40.00%      32768  3.85%  go.opentelemetry.io/otel/metric/noop.MeterProvider.Meter
     26216  3.08% 43.08%      26216  3.08%  context.newCancelCtx
     21846  2.56% 45.65%      21846  2.56%  google.golang.org/grpc/internal/buffer.NewUnbounded
     21846  2.56% 48.21%      21846  2.56%  google.golang.org/grpc/internal/transport.NewServerTransport.func3

The Mem metrics for low load region:

So we are not sure if GC for go is not functioning well or there is an actual mem leak.

Steps to Reproduce

Expected Result

Actual Result

Collector version

0.83.0

Environment information

Environment

nomad + docker

OpenTelemetry Collector configuration

receivers:
  jaeger/withendpoint:
    protocols:
      grpc:
        endpoint: {{env "NOMAD_HOST_ADDR_jaeger_grpc_receiver_port"}}

  otlp/httpendpoint:
    protocols:
      http:
        endpoint: {{env "NOMAD_HOST_ADDR_oltp_http_receiver_port"}}

  otlp/grpcendpoint:
    protocols:
      grpc:
        endpoint: {{env "NOMAD_HOST_ADDR_oltp_grpc_receiver_port"}}

exporters:
  logging:
  
  loadbalancing:
    protocol:
      otlp:
        keepalive:
          time: 30s
          timeout: 20s
        timeout: 30s
        tls:
          insecure: true
        sending_queue:
          queue_size: 1500
    resolver:
      static:
        hostnames:
{{ range service "${otel_collector_backend_service_name}" }}
        - {{.Address}}:{{.Port}}
{{ end }}

processors:
  batch:
    send_batch_size: 5000
    send_batch_max_size: 7000

  filter/span:
    error_mode: ignore
    traces:
      span:
        - 'IsMatch(attributes["path"], ".*[h|H]ealth.*") == true'
        - 'IsMatch(name, ".*\\.[h|H]ealth.*") == true'
        - 'IsMatch(attributes["action"], ".*[h|H]ealth.*") == true'
      spanevent:
        - 'name == "message"'

extensions:
  health_check/withendpoint:
    endpoint: :{{env "NOMAD_HOST_PORT_health_check_extension_port"}}
    path: "/health/status"
    check_collector_pipeline:
      enabled: true
      interval: "5m"
      exporter_failure_threshold: 5
  pprof/withendpoint:
    endpoint: :{{env "NOMAD_HOST_PORT_pprof_port"}}

service:
  extensions: [ health_check/withendpoint, pprof/withendpoint ]

  telemetry:
    logs:
      level: ${telemetry_log_level}
      encoding: json
    metrics:
      level: detailed
      address: :{{env "NOMAD_HOST_PORT_metrics_port"}}

  pipelines:
    traces:
      receivers: [ jaeger/withendpoint, otlp/httpendpoint, otlp/grpcendpoint ]
      processors: [ filter/span, batch ]
      exporters: [ loadbalancing ]

Log output

No response

Additional context

We are using Nomad signal change_mode to re-render template when backend otel collector list get updated. And everytime there is a signaling re-render, there will be a mem jump.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-08-24T00:03:25Z

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2023-08-24T09:18:03Z

Can you check the metrics emitted by the collector? Perhaps there are more items in the queue during specific events?

yehaotian · 2023-08-24T17:08:43Z

otelcol_exporter_queue_size_gauge is basically 0, is there any other queue metrics I can check?
Also otelcol_process_runtime_heap_alloc_bytes_gauge reflect the correlated mem usage, any other related metrics I can look into to tell what is the main contributor?

yehaotian · 2023-08-24T22:01:40Z

one question get raised internally:
How will opentelemetry collector reload when the configuration change? For this case, the backend list will update from time to time then trigger the Loadbalancer layer re-render

yehaotian · 2023-08-25T19:05:02Z

Probably related to open-telemetry/opentelemetry-collector#5966
The otel collector has not implemented SIGHUP handling yet

jpkrohling · 2023-08-28T10:45:25Z

The collector does not auto-updates its config, although the load balancing exporter will update the list of its backends if the DNS or Kubernetes resolvers are used.

atoulme · 2023-09-09T00:38:15Z

And AFAIK it's not on the roadmap to update config via SIGHUP.

github-actions · 2023-11-08T03:29:38Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-01-07T05:19:21Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

yehaotian added bug Something isn't working needs triage New item requiring triage labels Aug 24, 2023

github-actions bot added the exporter/loadbalancing label Aug 24, 2023

This was referenced Sep 3, 2023

Weekly Report: 2023-08-27 - 2023-09-03 kevinslin/opentelemetry-collector-contrib#20

Open

Weekly Report: 2023-08-27 - 2023-09-03 kevinslin/opentelemetry-collector-contrib#21

Open

github-actions bot mentioned this issue Sep 6, 2023

Weekly Report: 2023-08-30 - 2023-09-06 kevinslin/opentelemetry-collector-contrib#22

Open

atoulme removed the needs triage New item requiring triage label Sep 9, 2023

github-actions bot added the Stale label Nov 8, 2023

github-actions bot added the closed as inactive label Jan 7, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opentelemetry collector memory leak/non-optimized GC #26087

Opentelemetry collector memory leak/non-optimized GC #26087

yehaotian commented Aug 24, 2023

github-actions bot commented Aug 24, 2023

jpkrohling commented Aug 24, 2023

yehaotian commented Aug 24, 2023

yehaotian commented Aug 24, 2023

yehaotian commented Aug 25, 2023

jpkrohling commented Aug 28, 2023

atoulme commented Sep 9, 2023

github-actions bot commented Nov 8, 2023

github-actions bot commented Jan 7, 2024