Description
openedon Aug 11, 2022
Context
During my investigation for the the PR 32539 I have noticed that there might be room for performance improvements in the Kubernetes module of metricbeat.
Each metricbeat instance is storing metrics about all the nodes in the Kubernetes cluster but only metrics about pods and containers on the same node where that instance of metricbeat is running. This replicates how the previous expiring cache worked but it is now more evident and can have detrimental effect in clusters with lots of nodes. This is because, with lots of nodes we might end up wasting lots of memory on unused metrics from other nodes. This behaviour is due to how the watcher notifies events from Kubernetes and it wasn't modified by the afore mentioned PR.
Possible solution is for each metricbeat to filter out events generated by other nodes than the one where it is running. This should simplify the MetricRepo API since we wouldn't need to handle the deletion of nodes but only events from Pods and Containers.
During the same investigation, I noticed that when a Pod is deleted, it first calls the update
function (to add its metrics again) to be deleted few seconds after. I am not sure if this is intended since the status of the pod is Terminating
already. Also I noticed that the call to deletePod
is executing twice. This might be because there is more than 1 watcher or because the code is shared between multiple metricsets.