Description
In #288, we changed NPD to run custom plugins on startup. I hoped this would allow NPD to always report an event immediately when the cluster is just created, no matter how big the invoke_internal
is.
However, this will not always work due to its interaction with kube-apiserver. What I observed during cluster creation was below.
- NPD started and invoked the custom plugin immediately, and then sent an event to kube-apiserver.
- The event was failed to be sent because kube-apiserver was not running yet. The event library will retry sending the event.
Unable to write event: 'Post https://x.x.x.x/api/v1/namespaces/default/events: dial tcp 3 4.68.6.201:443: connect: connection refused' (may retry after sleeping)
- kube-apiserver started.
- The event was re-sent to kube-apiserver but was rejected this time without further retry because of a permission error:
events is forbidden: User "system:node-problem-detector" cannot create resource "events" in API group "" in the namespace "default"' (will not retry!)
- https://github.com/kubernetes/kubernetes/blob/c8b45cd25c18e65798dde49fc7011495ea6021d5/cluster/gce/gci/configure-helper.sh#L568 was called to set up the permission.
There is a small window between (3) and (5) - if the event is rejected during that interval the event will never be resent again.
Changing the event library to always retry on permission error may or may not make sense. But what we can do in NPD is to introduce a configurable initial_delay
for custom plugins. In this case, I can configure it to 1m with invoke_internal
still being 6h. The plugin will run after 1m when the NPD starts.