Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Exporter/LoadBalncer] Increased Memory Utilization after bumping from 0.94.0 to 0.99.0 #33435

Open
NickAnge opened this issue Jun 7, 2024 · 9 comments
Assignees
Labels

Comments

@NickAnge
Copy link

NickAnge commented Jun 7, 2024

Component(s)

exporter/loadbalancing

What happened?

Description

Hello team.

We recently upgraded our internal collectors from version 0.94.0 to 0.99.0, and we observed a rise in memory usage at the load balancer deployment collectors, as depicted in the image below. This persisted even after updating to the latest version, 0.101.0.

Screenshot 2024-06-07 at 19 04 31

We enabled profiling to our collectors (pprof ) component observed inuse_memory and inuse_objects. I seperated by investigation between 3 pods with low, medium and high memory usage.

Inuse Memory - Top

Low Memory Usage Pod

Screenshot 2024-06-07 at 19 08 07

Medium Memory Usage Pod

Screenshot 2024-06-07 at 19 08 40

High Memory Usage Pod

Screenshot 2024-06-07 at 19 08 48

Inuse_objects - top

Low Memory Usage Pod

Screenshot 2024-06-07 at 19 10 19

Medium Memory Usage Pod

Screenshot 2024-06-07 at 19 10 02

High Memory Usage Pod

Screenshot 2024-06-07 at 19 10 12

Steps to Reproduce

  1. Deployment mode used as Load Balancer with version 0.94.0
  2. Bump the version to 0.101.0

Expected Result

Expected result was the memory to remain the same over time, after the bump of the version

Actual Result

High memory usage after bumping the version

Collector version

0.101.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        max_recv_msg_size_mib: 20

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 95
    spike_limit_percentage: 15
  k8sattributes:
    passthrough: true

exporters:
  loadbalancing/spans:
    protocol:
      otlp:
        sending_queue:
          enabled: true
          num_consumers: 100
          queue_size: 500
        retry_on_failure:
          enabled: true
          initial_interval: 2s
          max_interval: 2s
          max_elapsed_time: 10s
        tls:
          insecure: true
        timeout: 1
    resolver:
      k8s:
        service: service
  loadbalancing/metrics:
    routing_key: metric
    protocol:
      otlp:
        sending_queue:
          enabled: true
          num_consumers: 50
          queue_size: 500
        retry_on_failure:
          enabled: true
          initial_interval: 2s
          max_interval: 2s
          max_elapsed_time: 10s
        tls:
          insecure: true
        timeout: 1
    resolver:
      k8s:
        service: service

extensions:
  health_check:
  pprof:
    endpoint: :1777

service:
  extensions: [ health_check , pprof]
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ memory_limiter ]
      exporters: [ loadbalancing/spans ]
    logs:
      receivers: [ otlp ]
      processors: [ memory_limiter ]
      exporters: [ loadbalancing/spans ]
    metrics:
      receivers: [ otlp ]
      processors: [ memory_limiter, k8sattributes ]
      exporters: [ loadbalancing/metrics ]

Log output

No response

Additional context

No response

@NickAnge NickAnge added bug Something isn't working needs triage New item requiring triage labels Jun 7, 2024
Copy link
Contributor

github-actions bot commented Jun 7, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

Thank you for the detailed report, I'll take a look and try to reproduce it. In the meantime, can you try switching to the DNS resolver instead of the k8s resolver? I'm not 100% sure yet it would show a difference, but the DNS resolver is known to consume fewer resources in other situations.

    resolver:
      k8s:
        service: service

@NickAnge
Copy link
Author

Thanks @jpkrohling .
We have discussed internally the replacement of the K8s resolver with dns resolver. The conclusion was to stay with K8s resolver as it is faster into computing/resolve the endpoints of the backing collectors in case of rollout or outage.

Let me know if you need me to provide some more information about the issue, and thanks a lot for taking a look

@jpkrohling
Copy link
Member

Can you temporarily replace it, and see if the memory profile is different? If we can isolate this behavior to this resolver specifically, it's easier to find a solution.

@NickAnge
Copy link
Author

This memory issue happened to our production environments only (probably because of higher traffic), so I am not sure if we can change it there even if it is temporarily :/. Did you manage to reproduce at your setup ?

@jpkrohling
Copy link
Member

I wasn't able to try it out. I might be able to find some time later this week, but next week I'm AFK again. If anyone is interested in this issue, it would help me a lot if I can have a confirmation that this is isolated to the k8s resolver.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 19, 2024
@jpkrohling jpkrohling removed the Stale label Aug 19, 2024
@jpkrohling jpkrohling self-assigned this Aug 19, 2024
@jpkrohling jpkrohling removed the needs triage New item requiring triage label Aug 19, 2024
@dmedinag
Copy link

just pinging here the owner of exporter/loadbalancing: @jpkrohling to avoid having this issue stale

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants