[target-allocator] targets assigned to old pod after HPA scaled down #1028

moh-osman3 · 2022-08-10T06:14:49Z

Observed an issue while load testing with the HPA created from the collector CRD.

Context:

in my collector spec:

spec:
  mode: {{ .Values.collector.mode }}
  image: {{ .Values.collector.image }}
  minReplicas: 1
  maxReplicas: 20
  targetAllocator:
    enabled: true
    image: ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:latest
    serviceAccount: {{ .Release.Name }}-collector-targetallocator
    prometheusCR:
      enabled: false

While load testing, the HPA scales the collector statefulset up to 12 pods. After lowering metric workload for testing, HPA scales back down to a single collector pod in the statefulset.

What I expected:

I expect the target allocator to assign targets to only remaining pod after scale down

What actually happened:

collector-0 does not have targets assigned in the TA, while collector-11 has the target I expected. Collector-11 was terminated and therefore should not have any targets.

$ kubectl get po -n opentelemetry
NAME                                                         READY   STATUS    RESTARTS   AGE
curl-moh                                                     1/1     Running   0          115m
lightstep-collector-collector-0                              1/1     Running   0          57m
lightstep-collector-targetallocator-b6865b5bb-dc4w5          1/1     Running   0          113m
opentelemetry-operator-controller-manager-575cdcbc57-4d24t   2/2     Running   0          11h

[root@curl-moh:/]$ curl http://lightstep-collector-targetallocator:80/jobs/serviceMonitor%2Favalanche%2Favalanche%2F0/targets?collector_id=lightstep-collector-collector-0
[]

[ root@curl-moh:/ ]$ curl http://lightstep-collector-targetallocator:80/jobs/serviceMonitor%2Favalanche%2Favalanche%2F0/targets?collector_id=lightstep-collector-collector-11
[
  {
    "targets": [
      "10.0.7.184:9001"
    ],
    "labels": {...}
  }
]

I wonder if this has to do with HPA stabilization window. It seems that target allocator is reallocating targets when the targets change. But stabilization window implies it will take several minutes for the unneeded pods to be terminated. Therefore the allocator will see the reduction in targets and reassign targets to the collectors (even though HPA is about to scale down). If there is no change in targets after collector scale down is complete, this results in a terminated collector pod being assigned targets.

The text was updated successfully, but these errors were encountered:

moh-osman3 · 2022-10-11T19:31:08Z

After adding some logs found more info as to why this bug is occurring.

{"level":"info","ts":1665127123.2009046,"logger":"allocator","msg":"The number of collectors to be added is: 0"}
{"level":"info","ts":1665127123.2011936,"logger":"allocator","msg":"The number of collectors to be removed is: 0"}
{"level":"info","ts":1665127123.439969,"logger":"allocator","msg":"The number of collectors to be added is: 0"}
{"level":"info","ts":1665127123.440088,"logger":"allocator","msg":"The number of collectors to be removed is: 0"}
{"level":"info","ts":1665127127.2063422,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665127338.1837273,"logger":"allocator","msg":"false","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1665127338.1838017,"logger":"allocator","msg":"Collector pod watch event stopped no event","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1665127427.2042353,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665127437.2033138,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665127767.2002187,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665127797.1970944,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665128392.1984103,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665128397.1981525,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665128402.1978114,"logger":"allocator","msg":"targets handled successfully"}

Pod watcher is stopped at some point on scale down and after that happens, only SetTargets is ever called and no calls to SetCollectors is made. This means an inaccurate representation of collectors is used by the TA.

This bug has been difficult to reproduce reliably, which is making testing difficult, but the issue seems to be related to this line.

pavolloffay added the area:target-allocator Issues for target-allocator label Aug 10, 2022

moh-osman3 mentioned this issue Nov 9, 2022

[target-allocator] restart pod watcher when no event is found #1237

Merged

pavolloffay closed this as completed in #1237 Dec 19, 2022

moh-osman3 mentioned this issue Aug 15, 2023

REQUEST: New membership for moh-osman3 open-telemetry/community#1647

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[target-allocator] targets assigned to old pod after HPA scaled down #1028

[target-allocator] targets assigned to old pod after HPA scaled down #1028

moh-osman3 commented Aug 10, 2022

moh-osman3 commented Oct 11, 2022

[target-allocator] targets assigned to old pod after HPA scaled down #1028

[target-allocator] targets assigned to old pod after HPA scaled down #1028

Comments

moh-osman3 commented Aug 10, 2022

What I expected:

What actually happened:

moh-osman3 commented Oct 11, 2022