Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/k8sattribute] Handle resource deletion on DeletedFinalStateUnknown #27632

Closed
c-kruse opened this issue Oct 11, 2023 · 5 comments
Closed
Assignees
Labels
bug Something isn't working priority:p2 Medium processor/k8sattributes k8s Attributes processor

Comments

@c-kruse
Copy link
Contributor

c-kruse commented Oct 11, 2023

Component(s)

processor/k8sattributes

What happened?

Description

We have a similar component that we develop in the sumo logic collector distribution. This bug was discovered there. It appears that it exists here as well, although it is comparatively benign as it will just leak a bit of memory rather than panic as was the case in our processor.

Related issue: SumoLogic/sumologic-otel-collector#1277

Relevant doc snippet from pkg.go.dev/k8s.io/client-go:

OnDelete will get the final state of the item if it is known, otherwise it will get an object of type DeletedFinalStateUnknown. This can happen if the watch is closed and misses the delete event and we don't notice the deletion until the subsequent re-list.

Steps to Reproduce

Running the collector in a k8s cluster, create a network segmentation between the kube apiserver and the collector. Wait long enough for the open connections to break, then re-establish connectivity. You should see log entries for this and similar cases: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/k8sattributesprocessor/internal/kube/client.go#L236

Happy to share info on the setup I used to reproduce on our distro, but it won't likely be worth much more than inspiration for y'all.

Expected Result

I think the resource should be deleted as usual.

Actual Result

An error is logged and the resource is left hanging.

Collector version

v0.86.0, v0.85.0

Environment information

N/A

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@c-kruse c-kruse added bug Something isn't working needs triage New item requiring triage labels Oct 11, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1 crobert-1 added the processor/k8sattributes k8s Attributes processor label Oct 11, 2023
@github-actions
Copy link
Contributor

Pinging code owners for processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@TylerHelmuth TylerHelmuth added priority:p2 Medium and removed needs triage New item requiring triage labels Oct 11, 2023
@TylerHelmuth
Copy link
Member

@c-kruse thanks for opening this issue. Do you plan to submit a fix?

@c-kruse
Copy link
Contributor Author

c-kruse commented Oct 11, 2023

@TylerHelmuth I would be glad to. May be a day or two out.

dmitryax pushed a commit that referenced this issue Oct 19, 2023
…27847)

**Description:**

The k8s go client's cache expects OnDelete handlers to handle objects of
type DeletedFinalStateUnknown when the cache's watch mechanism misses a
delete and notices later. This changes the processor to handle such
deletes as if they were normal, rather than logging an error and
dropping the change.

**Link to tracking Issue:**
#27632

**Testing:**

Only what you see in the unit tests. I am open to suggestions, but I
don't see this being a code path we can reasonably cover in the e2e test
suite.

Verified manually locally on a kind cluster.
* Stood up two deployments loosely based off e2e testing resources, one
w/ a collector built from this branch and the other
docker.io/otel/opentelemetry-collector-contrib:latest.
* Both included an additional container in the collector pod I used to
fiddle with iptables rules.
* Added rules to reject traffic to/from the kube api server
* Deleted some namespaces containing deployments generating telemetry.
* Restored connectivity by removing the iptables rules.
* Observed the collector built from this branch was silent (aside from
the junk the k8s client logs due to the broken connection)
* Observed the latest
([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore))
collector logged a handful of errors for the deleted resources
(api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long
enough for Namespace.)

```
2023-10-19T02:18:37.781Z        error   kube/client.go:236      object received was not of type api_v1.Pod      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat
ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen
:latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor@v0.87.0/internal/kube/client.go:236
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/client-go@v0.28.2/tools/cache/controller.go:253
...

2023-10-19T02:19:03.970Z        error   kube/client.go:868      object received was not of type apps_v1.ReplicaSet      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor@v0.87.0/internal/kube/client.go:868
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/client-go@v0.28.2/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        k8s.io/client-go@v0.28.2/tools/cache/shared_informer.go:979
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
...
```
**Documentation:** N/A - it is not clear to me whether or not this
should land on the changelog. Its impact on users is marginal.

Signed-off-by: Christian Kruse <ctkruse99@gmail.com>
martin-majlis-s1 pushed a commit to scalyr/opentelemetry-collector-contrib that referenced this issue Oct 20, 2023
…pen-telemetry#27847)

**Description:**

The k8s go client's cache expects OnDelete handlers to handle objects of
type DeletedFinalStateUnknown when the cache's watch mechanism misses a
delete and notices later. This changes the processor to handle such
deletes as if they were normal, rather than logging an error and
dropping the change.

**Link to tracking Issue:**
open-telemetry#27632

**Testing:**

Only what you see in the unit tests. I am open to suggestions, but I
don't see this being a code path we can reasonably cover in the e2e test
suite.

Verified manually locally on a kind cluster.
* Stood up two deployments loosely based off e2e testing resources, one
w/ a collector built from this branch and the other
docker.io/otel/opentelemetry-collector-contrib:latest.
* Both included an additional container in the collector pod I used to
fiddle with iptables rules.
* Added rules to reject traffic to/from the kube api server
* Deleted some namespaces containing deployments generating telemetry.
* Restored connectivity by removing the iptables rules.
* Observed the collector built from this branch was silent (aside from
the junk the k8s client logs due to the broken connection)
* Observed the latest
([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore))
collector logged a handful of errors for the deleted resources
(api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long
enough for Namespace.)

```
2023-10-19T02:18:37.781Z        error   kube/client.go:236      object received was not of type api_v1.Pod      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat
ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen
:latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor@v0.87.0/internal/kube/client.go:236
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/client-go@v0.28.2/tools/cache/controller.go:253
...

2023-10-19T02:19:03.970Z        error   kube/client.go:868      object received was not of type apps_v1.ReplicaSet      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor@v0.87.0/internal/kube/client.go:868
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/client-go@v0.28.2/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        k8s.io/client-go@v0.28.2/tools/cache/shared_informer.go:979
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
...
```
**Documentation:** N/A - it is not clear to me whether or not this
should land on the changelog. Its impact on users is marginal.

Signed-off-by: Christian Kruse <ctkruse99@gmail.com>
sigilioso pushed a commit to carlossscastro/opentelemetry-collector-contrib that referenced this issue Oct 27, 2023
…pen-telemetry#27847)

**Description:**

The k8s go client's cache expects OnDelete handlers to handle objects of
type DeletedFinalStateUnknown when the cache's watch mechanism misses a
delete and notices later. This changes the processor to handle such
deletes as if they were normal, rather than logging an error and
dropping the change.

**Link to tracking Issue:**
open-telemetry#27632

**Testing:**

Only what you see in the unit tests. I am open to suggestions, but I
don't see this being a code path we can reasonably cover in the e2e test
suite.

Verified manually locally on a kind cluster.
* Stood up two deployments loosely based off e2e testing resources, one
w/ a collector built from this branch and the other
docker.io/otel/opentelemetry-collector-contrib:latest.
* Both included an additional container in the collector pod I used to
fiddle with iptables rules.
* Added rules to reject traffic to/from the kube api server
* Deleted some namespaces containing deployments generating telemetry.
* Restored connectivity by removing the iptables rules.
* Observed the collector built from this branch was silent (aside from
the junk the k8s client logs due to the broken connection)
* Observed the latest
([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore))
collector logged a handful of errors for the deleted resources
(api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long
enough for Namespace.)

```
2023-10-19T02:18:37.781Z        error   kube/client.go:236      object received was not of type api_v1.Pod      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat
ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen
:latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor@v0.87.0/internal/kube/client.go:236
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/client-go@v0.28.2/tools/cache/controller.go:253
...

2023-10-19T02:19:03.970Z        error   kube/client.go:868      object received was not of type apps_v1.ReplicaSet      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor@v0.87.0/internal/kube/client.go:868
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/client-go@v0.28.2/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        k8s.io/client-go@v0.28.2/tools/cache/shared_informer.go:979
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
...
```
**Documentation:** N/A - it is not clear to me whether or not this
should land on the changelog. Its impact on users is marginal.

Signed-off-by: Christian Kruse <ctkruse99@gmail.com>
@dmitryax
Copy link
Member

Fixed by #27847

jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
…pen-telemetry#27847)

**Description:**

The k8s go client's cache expects OnDelete handlers to handle objects of
type DeletedFinalStateUnknown when the cache's watch mechanism misses a
delete and notices later. This changes the processor to handle such
deletes as if they were normal, rather than logging an error and
dropping the change.

**Link to tracking Issue:**
open-telemetry#27632

**Testing:**

Only what you see in the unit tests. I am open to suggestions, but I
don't see this being a code path we can reasonably cover in the e2e test
suite.

Verified manually locally on a kind cluster.
* Stood up two deployments loosely based off e2e testing resources, one
w/ a collector built from this branch and the other
docker.io/otel/opentelemetry-collector-contrib:latest.
* Both included an additional container in the collector pod I used to
fiddle with iptables rules.
* Added rules to reject traffic to/from the kube api server
* Deleted some namespaces containing deployments generating telemetry.
* Restored connectivity by removing the iptables rules.
* Observed the collector built from this branch was silent (aside from
the junk the k8s client logs due to the broken connection)
* Observed the latest
([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore))
collector logged a handful of errors for the deleted resources
(api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long
enough for Namespace.)

```
2023-10-19T02:18:37.781Z        error   kube/client.go:236      object received was not of type api_v1.Pod      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat
ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen
:latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor@v0.87.0/internal/kube/client.go:236
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/client-go@v0.28.2/tools/cache/controller.go:253
...

2023-10-19T02:19:03.970Z        error   kube/client.go:868      object received was not of type apps_v1.ReplicaSet      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor@v0.87.0/internal/kube/client.go:868
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/client-go@v0.28.2/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        k8s.io/client-go@v0.28.2/tools/cache/shared_informer.go:979
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
...
```
**Documentation:** N/A - it is not clear to me whether or not this
should land on the changelog. Its impact on users is marginal.

Signed-off-by: Christian Kruse <ctkruse99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Medium processor/k8sattributes k8s Attributes processor
Projects
None yet
Development

No branches or pull requests

4 participants