Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[target-allocator] targets assigned to old pod after HPA scaled down #1028

Closed
moh-osman3 opened this issue Aug 10, 2022 · 1 comment · Fixed by #1237
Closed

[target-allocator] targets assigned to old pod after HPA scaled down #1028

moh-osman3 opened this issue Aug 10, 2022 · 1 comment · Fixed by #1237
Labels
area:target-allocator Issues for target-allocator

Comments

@moh-osman3
Copy link
Contributor

Observed an issue while load testing with the HPA created from the collector CRD.

Context:

in my collector spec:

spec:
  mode: {{ .Values.collector.mode }}
  image: {{ .Values.collector.image }}
  minReplicas: 1
  maxReplicas: 20
  targetAllocator:
    enabled: true
    image: ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:latest
    serviceAccount: {{ .Release.Name }}-collector-targetallocator
    prometheusCR:
      enabled: false

While load testing, the HPA scales the collector statefulset up to 12 pods. After lowering metric workload for testing, HPA scales back down to a single collector pod in the statefulset.

What I expected:

I expect the target allocator to assign targets to only remaining pod after scale down

What actually happened:

collector-0 does not have targets assigned in the TA, while collector-11 has the target I expected. Collector-11 was terminated and therefore should not have any targets.

$ kubectl get po -n opentelemetry
NAME                                                         READY   STATUS    RESTARTS   AGE
curl-moh                                                     1/1     Running   0          115m
lightstep-collector-collector-0                              1/1     Running   0          57m
lightstep-collector-targetallocator-b6865b5bb-dc4w5          1/1     Running   0          113m
opentelemetry-operator-controller-manager-575cdcbc57-4d24t   2/2     Running   0          11h

[root@curl-moh:/]$ curl http://lightstep-collector-targetallocator:80/jobs/serviceMonitor%2Favalanche%2Favalanche%2F0/targets?collector_id=lightstep-collector-collector-0
[]

[ root@curl-moh:/ ]$ curl http://lightstep-collector-targetallocator:80/jobs/serviceMonitor%2Favalanche%2Favalanche%2F0/targets?collector_id=lightstep-collector-collector-11
[
  {
    "targets": [
      "10.0.7.184:9001"
    ],
    "labels": {...}
  }
]

I wonder if this has to do with HPA stabilization window. It seems that target allocator is reallocating targets when the targets change. But stabilization window implies it will take several minutes for the unneeded pods to be terminated. Therefore the allocator will see the reduction in targets and reassign targets to the collectors (even though HPA is about to scale down). If there is no change in targets after collector scale down is complete, this results in a terminated collector pod being assigned targets.

@pavolloffay pavolloffay added the area:target-allocator Issues for target-allocator label Aug 10, 2022
@moh-osman3
Copy link
Contributor Author

After adding some logs found more info as to why this bug is occurring.

{"level":"info","ts":1665127123.2009046,"logger":"allocator","msg":"The number of collectors to be added is: 0"}
{"level":"info","ts":1665127123.2011936,"logger":"allocator","msg":"The number of collectors to be removed is: 0"}
{"level":"info","ts":1665127123.439969,"logger":"allocator","msg":"The number of collectors to be added is: 0"}
{"level":"info","ts":1665127123.440088,"logger":"allocator","msg":"The number of collectors to be removed is: 0"}
{"level":"info","ts":1665127127.2063422,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665127338.1837273,"logger":"allocator","msg":"false","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1665127338.1838017,"logger":"allocator","msg":"Collector pod watch event stopped no event","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1665127427.2042353,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665127437.2033138,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665127767.2002187,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665127797.1970944,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665128392.1984103,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665128397.1981525,"logger":"allocator","msg":"targets handled successfully"}
{"level":"info","ts":1665128402.1978114,"logger":"allocator","msg":"targets handled successfully"}

Pod watcher is stopped at some point on scale down and after that happens, only SetTargets is ever called and no calls to SetCollectors is made. This means an inaccurate representation of collectors is used by the TA.

This bug has been difficult to reproduce reliably, which is making testing difficult, but the issue seems to be related to this line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:target-allocator Issues for target-allocator
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants