Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel collector created using otel operator not setting hpa memory utilization config correctly #3283

Closed
shine17 opened this issue Sep 12, 2024 · 4 comments · Fixed by #3293
Closed
Labels
bug Something isn't working needs triage

Comments

@shine17
Copy link

shine17 commented Sep 12, 2024

Component(s)

collector

What happened?

Description

Otel collector created using otel operator not setting hpa memory utilization config correctly

Steps to Reproduce

Deploy otel operator.
Create otel collector deployment object with min of 3 replicas and max of 6 replicas

replicas : {{ .Values.minReplicaCount }}
resources:
    limits:
      cpu: 100m
      memory: 1024Mi
      # ephemeral-storage: 50Mi
    requests:
      cpu: 100m
      memory: 64Mi
autoscaler:
    minReplicas: {{ .Values.minReplicaCount }}
    maxReplicas: {{ .Values.maxReplicaCount }}
    targetCPUUtilization: 80
    targetMemoryUtilization: 65
    behavior:
      scaleDown:
        policies:
        - periodSeconds: 600
          type: Pods
          value: 1
        selectPolicy: Min
        stabilizationWindowSeconds: 900
      scaleUp:
        policies:
        - periodSeconds: 60
          type: Pods
          value: 2
        - periodSeconds: 60
          type: Percent
          value: 100
        selectPolicy: Max
        stabilizationWindowSeconds: 60

The targetMemoryUtilization is not honored and hpa always scale the collector pods although the memory utilization is less than 30 percent of the limit for each collector pods.

pod memory data -

NAME CPU(cores) MEMORY(bytes)
otel-gateway-collector-7898f79fdd-27l9j 1m 55Mi

hpa data -
otel-gateway-collector OpenTelemetryCollector/otel-gateway 112%/65%, 4%/80% 3 6 6 106m

Expected Result

Scaling should happen only based on targetMemoryUtilization percentage.

Actual Result

Scaling happens since it calculates targetMemoryUtilization incorrectly.

Also please provide test cases for targetMemoryUtilization in the repo. I don't find test cases for targetMemoryUtilization in the repo https://github.com/open-telemetry/opentelemetry-operator/tree/main/tests/e2e-autoscale/autoscale

Kubernetes Version

1.29.7

Operator version

0.108.0

Collector version

0.109.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Log output

No response

Additional context

No response

@shine17 shine17 added bug Something isn't working needs triage labels Sep 12, 2024
@jaronoff97
Copy link
Contributor

can you share the generated HPA resource?

@shine17
Copy link
Author

shine17 commented Sep 15, 2024

can you share the generated HPA resource?

@jaronoff97 this is the hpa configuration in my deployment yaml

 replicas : {{ .Values.minReplicaCount }}
  autoscaler:
    minReplicas: {{ .Values.minReplicaCount }}
    maxReplicas: {{ .Values.maxReplicaCount }}
    targetCPUUtilization: 80
    targetMemoryUtilization: 65
    behavior:
      scaleDown:
        policies:
        - periodSeconds: 600
          type: Pods
          value: 1
        selectPolicy: Min
        stabilizationWindowSeconds: 900
      scaleUp:
        policies:
        - periodSeconds: 60
          type: Pods
          value: 2
        - periodSeconds: 60
          type: Percent
          value: 100
        selectPolicy: Max
        stabilizationWindowSeconds: 60
  securityContext:
    allowPrivilegeEscalation: false
    privileged: false
    readOnlyRootFilesystem: true
  resources:
    limits:
      cpu: 1000m
      memory: 1024Mi
    requests:
      cpu: 50m
      memory: 64Mi
      

Below is the generated hpa yaml

kubectl get hpa   otel-gateway-collector -n monitoringapps 

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: otel-gateway-deployment
    meta.helm.sh/release-namespace: monitoringapps
  creationTimestamp: "2024-09-14T06:55:59Z"
  labels:
    app.kubernetes.io/component: opentelemetry-collector
    app.kubernetes.io/instance: monitoringapps.otel-gateway
    app.kubernetes.io/managed-by: opentelemetry-operator
    app.kubernetes.io/name: otel-gateway-collector
    app.kubernetes.io/part-of: opentelemetry
    app.kubernetes.io/version: latest
  name: otel-gateway-collector
  namespace: monitoringapps
  ownerReferences:
  - apiVersion: opentelemetry.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: OpenTelemetryCollector
    name: otel-gateway
    uid: 8766a8asfadadadad
  resourceVersion: "549018"
  uid: f06sfaffffffff
spec:
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 600
        type: Pods
        value: 1
      selectPolicy: Min
      stabilizationWindowSeconds: 900
    scaleUp:
      policies:
      - periodSeconds: 60
        type: Pods
        value: 2
      - periodSeconds: 60
        type: Percent
        value: 100
      selectPolicy: Max
      stabilizationWindowSeconds: 60
  maxReplicas: 6
  metrics:
  - resource:
      name: memory
      target:
        averageUtilization: 65
        type: Utilization
    type: Resource
  - resource:
      name: cpu
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  minReplicas: 3
  scaleTargetRef:
    apiVersion: opentelemetry.io/v1beta1
    kind: OpenTelemetryCollector
    name: otel-gateway
status:
  conditions:
  - lastTransitionTime: "2024-09-14T06:56:14Z"
    message: recommended size matches current size
    reason: ReadyForNewScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2024-09-14T09:35:19Z"
    message: the HPA was able to successfully calculate a replica count from memory
      resource utilization (percentage of request)
    reason: ValidMetricFound
    status: "True"
    type: ScalingActive
  - lastTransitionTime: "2024-09-15T06:39:18Z"
    message: the desired replica count is more than the maximum replica count
    reason: TooManyReplicas
    status: "True"
    type: ScalingLimited
  currentMetrics:
  - resource:
      current:
        averageUtilization: 108
        averageValue: 72924501333m
      name: memory
    type: Resource
  - resource:
      current:
        averageUtilization: 3
        averageValue: 1m
      name: cpu
    type: Resource
  currentReplicas: 6
  desiredReplicas: 6
  lastScaleTime: "2024-09-15T06:39:18Z"

@shine17
Copy link
Author

shine17 commented Sep 15, 2024

@jaronoff97 Could you add a test case for memory based autoscale here, so that you can reproduce it.

https://github.com/open-telemetry/opentelemetry-operator/tree/main/tests/e2e-autoscale/autoscale

@jaronoff97
Copy link
Contributor

@shine17 we already have a test case for CPU, i copied it for memory and was unable to reproduce your bug. Is this an issue with the operator? you mentioned deployment.yaml, where is that coming from for you? Is it possible your helm chart is misconfigured?

#3293

if you are able to reproduce this locally, can you please provide a full working example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants