Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opentelemetry collector memory leak/non-optimized GC #26087

Closed
yehaotian opened this issue Aug 24, 2023 · 9 comments
Closed

Opentelemetry collector memory leak/non-optimized GC #26087

yehaotian opened this issue Aug 24, 2023 · 9 comments

Comments

@yehaotian
Copy link

Component(s)

exporter/loadbalancing

What happened?

Description

We observe opentelemetry collector load balancing layer memory leak which will trigger OOM killed for nomad jobs restart.
The Mem metrics for high load region:

Screenshot 2023-08-23 at 4 32 26 PM

pprof top 10 resutls:
--inuse_space:

      flat  flat%   sum%        cum   cum%
 2198.65MB 60.34% 60.34%  2198.65MB 60.34%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
  972.33MB 26.68% 87.02%   972.33MB 26.68%  go.uber.org/zap/zapcore.newCounters
  142.02MB  3.90% 90.92%   142.02MB  3.90%  bufio.NewReaderSize (inline)
   75.63MB  2.08% 92.99%    75.63MB  2.08%  bytes.growSlice
   41.43MB  1.14% 94.13%    51.56MB  1.41%  compress/flate.NewWriter
   38.90MB  1.07% 95.20%    38.90MB  1.07%  go.opentelemetry.io/collector/exporter/exporterhelper/internal.NewBoundedMemoryQueue
   22.09MB  0.61% 95.80%    22.09MB  0.61%  golang.org/x/net/http2/hpack.(*headerFieldTable).addEntry
   20.76MB  0.57% 96.37%    20.76MB  0.57%  golang.org/x/net/http2.(*Framer).startWriteDataPadded
       2MB 0.055% 96.43%  2258.75MB 61.98%  google.golang.org/grpc/internal/transport.newHTTP2Client
    1.50MB 0.041% 96.47%    77.13MB  2.12%  bytes.(*Buffer).grow

--inuse_objects:

      flat  flat%   sum%        cum   cum%
    131074 15.39% 15.39%     163842 19.24%  google.golang.org/grpc/internal/grpcutil.EncodeDuration
     40345  4.74% 20.13%      40345  4.74%  runtime.malg
     38232  4.49% 24.61%      38232  4.49%  google.golang.org/grpc/internal/transport.(*controlBuffer).get
     32774  3.85% 28.46%      32774  3.85%  strconv.formatBits
     32769  3.85% 32.31%      32769  3.85%  fmt.Sprintf
     32769  3.85% 36.16%      32769  3.85%  google.golang.org/grpc/internal/channelz.newIdentifer
     32768  3.85% 40.00%      32768  3.85%  go.opentelemetry.io/otel/metric/noop.MeterProvider.Meter
     26216  3.08% 43.08%      26216  3.08%  context.newCancelCtx
     21846  2.56% 45.65%      21846  2.56%  google.golang.org/grpc/internal/buffer.NewUnbounded
     21846  2.56% 48.21%      21846  2.56%  google.golang.org/grpc/internal/transport.NewServerTransport.func3

The Mem metrics for low load region:
Screenshot 2023-08-23 at 4 33 05 PM

So we are not sure if GC for go is not functioning well or there is an actual mem leak.

Steps to Reproduce

Expected Result

Actual Result

Collector version

0.83.0

Environment information

Environment

nomad + docker

OpenTelemetry Collector configuration

receivers:
  jaeger/withendpoint:
    protocols:
      grpc:
        endpoint: {{env "NOMAD_HOST_ADDR_jaeger_grpc_receiver_port"}}

  otlp/httpendpoint:
    protocols:
      http:
        endpoint: {{env "NOMAD_HOST_ADDR_oltp_http_receiver_port"}}

  otlp/grpcendpoint:
    protocols:
      grpc:
        endpoint: {{env "NOMAD_HOST_ADDR_oltp_grpc_receiver_port"}}

exporters:
  logging:
  
  loadbalancing:
    protocol:
      otlp:
        keepalive:
          time: 30s
          timeout: 20s
        timeout: 30s
        tls:
          insecure: true
        sending_queue:
          queue_size: 1500
    resolver:
      static:
        hostnames:
{{ range service "${otel_collector_backend_service_name}" }}
        - {{.Address}}:{{.Port}}
{{ end }}

processors:
  batch:
    send_batch_size: 5000
    send_batch_max_size: 7000

  filter/span:
    error_mode: ignore
    traces:
      span:
        - 'IsMatch(attributes["path"], ".*[h|H]ealth.*") == true'
        - 'IsMatch(name, ".*\\.[h|H]ealth.*") == true'
        - 'IsMatch(attributes["action"], ".*[h|H]ealth.*") == true'
      spanevent:
        - 'name == "message"'

extensions:
  health_check/withendpoint:
    endpoint: :{{env "NOMAD_HOST_PORT_health_check_extension_port"}}
    path: "/health/status"
    check_collector_pipeline:
      enabled: true
      interval: "5m"
      exporter_failure_threshold: 5
  pprof/withendpoint:
    endpoint: :{{env "NOMAD_HOST_PORT_pprof_port"}}

service:
  extensions: [ health_check/withendpoint, pprof/withendpoint ]

  telemetry:
    logs:
      level: ${telemetry_log_level}
      encoding: json
    metrics:
      level: detailed
      address: :{{env "NOMAD_HOST_PORT_metrics_port"}}

  pipelines:
    traces:
      receivers: [ jaeger/withendpoint, otlp/httpendpoint, otlp/grpcendpoint ]
      processors: [ filter/span, batch ]
      exporters: [ loadbalancing ]

Log output

No response

Additional context

We are using Nomad signal change_mode to re-render template when backend otel collector list get updated. And everytime there is a signaling re-render, there will be a mem jump.

Screenshot 2023-08-23 at 5 00 52 PM

Screenshot 2023-08-23 at 4 47 28 PM

@yehaotian yehaotian added bug Something isn't working needs triage New item requiring triage labels Aug 24, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

Can you check the metrics emitted by the collector? Perhaps there are more items in the queue during specific events?

@yehaotian
Copy link
Author

otelcol_exporter_queue_size_gauge is basically 0, is there any other queue metrics I can check?
Also otelcol_process_runtime_heap_alloc_bytes_gauge reflect the correlated mem usage, any other related metrics I can look into to tell what is the main contributor?

@yehaotian
Copy link
Author

one question get raised internally:
How will opentelemetry collector reload when the configuration change? For this case, the backend list will update from time to time then trigger the Loadbalancer layer re-render

@yehaotian
Copy link
Author

Probably related to open-telemetry/opentelemetry-collector#5966
The otel collector has not implemented SIGHUP handling yet

@jpkrohling
Copy link
Member

The collector does not auto-updates its config, although the load balancing exporter will update the list of its backends if the DNS or Kubernetes resolvers are used.

@atoulme
Copy link
Contributor

atoulme commented Sep 9, 2023

And AFAIK it's not on the roadmap to update config via SIGHUP.

Copy link
Contributor

github-actions bot commented Nov 8, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 8, 2023
Copy link
Contributor

github-actions bot commented Jan 7, 2024

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants