opentelemetry-operator manager crashes during instrumentation injection attempt #3303

sergeykad · 2024-09-24T14:36:10Z

Component(s)

auto-instrumentation

What happened?

Description

opentelemetry-operator manager crashes

Steps to Reproduce

Install opentelemetry-operator on Kubernetes cluster
Restart a pod that has the following configuration:

  annotations:
    instrumentation.opentelemetry.io/inject-java: "true"

Expected Result

A side-car is added to the pod and the service is instrumented with open-telemetry.

Actual Result

opentelemetry-operator crashes with the log seen below.

Kubernetes Version

1.25

Operator version

v0.109.0

Collector version

v0.69.0

Environment information

Environment

OS: Rocky Linux 9.3

Log output

{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ConfigMap"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceAccount"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Service"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Deployment"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.DaemonSet"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.StatefulSet"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Ingress"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v2.HorizontalPodAutoscaler"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodDisruptionBudget"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceMonitor"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodMonitor"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting Controller","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector"}
{"level":"INFO","timestamp":"2024-09-24T13:57:32Z","logger":"collector-upgrade","message":"no instances to upgrade"}
{"level":"DEBUG","timestamp":"2024-09-24T13:57:32Z","logger":"controller-runtime.certwatcher","message":"certificate event","event":"CHMOD     \"/tmp/k8s-webhook-server/serving-certs/tls.key\""}
{"level":"INFO","timestamp":"2024-09-24T13:57:32Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"DEBUG","timestamp":"2024-09-24T13:57:32Z","logger":"controller-runtime.certwatcher","message":"certificate event","event":"CHMOD     \"/tmp/k8s-webhook-server/serving-certs/tls.crt\""}
{"level":"INFO","timestamp":"2024-09-24T13:57:32Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"INFO","timestamp":"2024-09-24T13:57:36Z","message":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}
{"level":"INFO","timestamp":"2024-09-24T13:57:36Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
{"level":"DEBUG","timestamp":"2024-09-24T13:58:52Z","logger":"controller-runtime.certwatcher","message":"certificate event","event":"CHMOD     \"/tmp/k8s-webhook-server/serving-certs/tls.key\""}
{"level":"INFO","timestamp":"2024-09-24T13:58:52Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"DEBUG","timestamp":"2024-09-24T13:58:52Z","logger":"controller-runtime.certwatcher","message":"certificate event","event":"CHMOD     \"/tmp/k8s-webhook-server/serving-certs/tls.crt\""}
{"level":"INFO","timestamp":"2024-09-24T13:58:52Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"DEBUG","timestamp":"2024-09-24T13:59:26Z","message":"annotation not present in deployment, skipping sidecar injection","namespace":"optimus","name":""}
{"level":"DEBUG","timestamp":"2024-09-24T13:59:26Z","message":"injecting Java instrumentation into pod","otelinst-namespace":"optimus","otelinst-name":"instrumentation"}

Additional context

There are no additional log messages. The manager just disappears.

jaronoff97 · 2024-09-25T14:38:51Z

does the manager pod have any reason for its crash? OOMKilled maybe? I haven't been able to reproduce this.

sergeykad · 2024-09-25T16:04:03Z

There was no reason at all. It just died and a new pod started.
It looks like something crashed during instrumentation injection since it's the last message and it never added the sidecar.

I performed a similar deployment on a Minikube and it works fine but crashes on our production Kubernetes.
If there is an option to enable more detailed logs or do some other test I can try it.

jaronoff97 · 2024-09-25T16:09:25Z

you can follow the guide here on how to enable debug logs https://github.com/open-telemetry/opentelemetry-operator/blob/main/DEBUG.md, is it possible the operator doesn't have the permission to do mutation?

sergeykad · 2024-09-26T11:13:09Z

We already added --zap-log-level debug as seen in the attached log. If any additional parameters can help we will add them.

The operator gives itself the required permissions, so it's probably not the problem. We use the default resources.

The only possible reason for the error I can think of is that the cluster has no direct Internet access, but it can pull Docker images from the configured Docker proxy.

omerozery · 2024-10-15T10:45:43Z

We upgraded the operator version to 0.110.0 and the chart to 0.71.0 and it still crashes with absolutely no details in the log nor in the describe section of the pod.
it doesn't get Out Of Memory Killed, the host has plenty of free memory space
we removed the resources limit, left only the requests.

these are the parameters we use to deploy the helm chart:

admissionWebhooks:
  autoGenerateCert:
    enabled: true
  certManager:
    enabled: false
manager:
  collectorImage:
    repository: otel/opentelemetry-collector-k8s
  extraArgs:
  - --enable-nginx-instrumentation=true
  - --zap-log-level=debug

The operator watches all namespaces
We use calico as our CNI, with default configurations
Our kubenretes is installed via kubeadm
We use RBAC
k8s hosts has no access to the internet and they get images through a NEXUS repository (so if there are special images from external unknown repo we configure it manually beforehand)

jaronoff97 · 2024-10-15T19:17:26Z

@omerozery Can you share any logs from the operator? Given you have debug logs enabled you should be seeing something.

sergeykad · 2024-10-16T14:55:57Z

@jaronoff97 it's the same issue I described above. You can see the whole log up to the point that the service crashes in the issue description.

jaronoff97 · 2024-10-16T14:59:05Z

Sorry, that was unclear from Omer's comment. Without more information or a way to reproduce this, there's not much I can do to assist unfortunately. If you'd like, we can follow up in slack (CNCF slack, #otel-operator channel) and go through some more specific kubernetes debugging steps?

pavolloffay · 2024-10-16T15:17:31Z

Maybe there was a nil pointer exception and the operator pod was restarted.

I would suggest using kubectl logs operator-pod -f - follow the logs and try to reproduce. You will see how it crashes and then paste the logs here.

sergeykad · 2024-10-19T20:25:39Z

Maybe there was a nil pointer exception and the operator pod was restarted.

I would suggest using kubectl logs operator-pod -f - follow the logs and try to reproduce. You will see how it crashes and then paste the logs here.

That's what we did. See the log above.

iblancasa · 2024-10-21T10:07:17Z

Can you share the Instrumentation object? Also, can you try to reproduce using the resources we have in one of our tests folder?

sergeykad · 2024-10-23T12:44:53Z

We found the OOMKILL in the dmesg (kernel messages) of the host, not the kubelet (kubernetes level) and not in the containerd (the containerd level). Just to be clear there was plenty of free memory on the k8s host.

We run telemetry with the default configuration, so I guess the problem is there.
We have over 200 services running and while we configured most of them to be instrumented the issue can be triggered by restarting a single one.

iblancasa · 2024-10-23T18:35:47Z

@sergeykad #3303 (comment)

sergeykad · 2024-10-27T14:53:21Z

I do not know how to reproduce it, but adding resources solved the problem for us.

sergeykad added bug Something isn't working needs triage labels Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opentelemetry-operator manager crashes during instrumentation injection attempt #3303

opentelemetry-operator manager crashes during instrumentation injection attempt #3303

sergeykad commented Sep 24, 2024 •

edited

Loading

jaronoff97 commented Sep 25, 2024 •

edited

Loading

sergeykad commented Sep 25, 2024

jaronoff97 commented Sep 25, 2024

sergeykad commented Sep 26, 2024

omerozery commented Oct 15, 2024 •

edited

Loading

jaronoff97 commented Oct 15, 2024

sergeykad commented Oct 16, 2024

jaronoff97 commented Oct 16, 2024 •

edited

Loading

pavolloffay commented Oct 16, 2024

sergeykad commented Oct 19, 2024

iblancasa commented Oct 21, 2024

sergeykad commented Oct 23, 2024

iblancasa commented Oct 23, 2024

sergeykad commented Oct 27, 2024

opentelemetry-operator manager crashes during instrumentation injection attempt #3303

opentelemetry-operator manager crashes during instrumentation injection attempt #3303

Comments

sergeykad commented Sep 24, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

jaronoff97 commented Sep 25, 2024 • edited Loading

sergeykad commented Sep 25, 2024

jaronoff97 commented Sep 25, 2024

sergeykad commented Sep 26, 2024

omerozery commented Oct 15, 2024 • edited Loading

jaronoff97 commented Oct 15, 2024

sergeykad commented Oct 16, 2024

jaronoff97 commented Oct 16, 2024 • edited Loading

pavolloffay commented Oct 16, 2024

sergeykad commented Oct 19, 2024

iblancasa commented Oct 21, 2024

sergeykad commented Oct 23, 2024

iblancasa commented Oct 23, 2024

sergeykad commented Oct 27, 2024

sergeykad commented Sep 24, 2024 •

edited

Loading

jaronoff97 commented Sep 25, 2024 •

edited

Loading

omerozery commented Oct 15, 2024 •

edited

Loading

jaronoff97 commented Oct 16, 2024 •

edited

Loading