Do not report problems until kube-apiserver is ready

In https://github.com/kubernetes/node-problem-detector/pull/288, we changed NPD to run custom plugins on startup. I hoped this would allow NPD to always report an event immediately when the cluster is just created, no matter how big the `invoke_internal` is.

However, this will not always work due to its interaction with kube-apiserver. What I observed during cluster creation was below.

1. NPD started and invoked the custom plugin immediately, and then sent an event to kube-apiserver.
1. The event was failed to be sent because kube-apiserver was not running yet. The event library will retry sending the event.
    `Unable to write event: 'Post https://x.x.x.x/api/v1/namespaces/default/events: dial tcp 3
4.68.6.201:443: connect: connection refused' (may retry after sleeping)`
1. kube-apiserver started.
1. The event was re-sent to kube-apiserver but was rejected this time without further retry because of a permission error:
    `events is forbidden: User "system:node-problem-detector" cannot create resource "events" in API group "" in the namespace "default"' (will not retry!)`
1. https://github.com/kubernetes/kubernetes/blob/c8b45cd25c18e65798dde49fc7011495ea6021d5/cluster/gce/gci/configure-helper.sh#L568 was called to set up the permission.

There is a small window between (3) and (5) - if the event is rejected during that interval the event will never be resent again.

Changing the event library to always retry on permission error may or may not make sense. But what we can do in NPD is to introduce a configurable `initial_delay` for custom plugins. In this case, I can configure it to 1m with `invoke_internal` still being 6h. The plugin will run after 1m when the NPD starts.

/cc @wangzhen127 @Random-Liu 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not report problems until kube-apiserver is ready #295

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development