Contribute to `Gardener Node Agent` for exposing metrics to gain better visibility of node joining timeouts.

**How to categorize this issue?**

/area monitoring
/kind enhancement
/priority 3

**What would you like to be added**:
With the introduction of a [`gardener-node-agent`](https://github.com/gardener/gardener/issues/8023) which is a controller-runtime based go implementation of the cloud-config downloader, it might be possible to get more insights to what happens during the processing of a node when it joins the cluster or rather fails to join a cluster. 
This can help us isolate if the timeouts are happening at the infra layer or there is something wrong during the node processing within the kubernetes runtime.

This may require us to expose some metrics from the node-agent or enhance its logging to tailor for making directed queries from its logs to identify node joining issues. 
This will make life easier for the MCM operators in identifying such issues with more determinism then what is possible as of today.

**Why is this needed**:
Currently we often have issues to analyze and identify why the node hasn't joined in `20mins` window of default timeout. 
All we have in the logs is following: 
`Machine shoot--<project-name>--<shoot-name>-<worker-pool>-<zone>-865f7-zggql failed to join the cluster in 20m0s minutes.`
The current approach to identify what has gone wrong if the issue persist requires you to follow some FAQ [#my-machine-is-not-joining-the-cluster-why](https://gardener.cloud/docs/other-components/machine-controller-manager/faq/#my-machine-is-not-joining-the-cluster-why) to begin with and also might require you to explore the Infra and see if the respective instance status to ascertain if was created successfully but fails to join the cluster.

This is currently a time consuming task with an expectation of fair knowledge of MCM internal to ascertain the root cause.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contribute to `Gardener Node Agent` for exposing metrics to gain better visibility of node joining timeouts. #837

ashwani2k
openedon Jul 31, 2023

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Contribute to Gardener Node Agent for exposing metrics to gain better visibility of node joining timeouts. #837

Description

ashwani2kopenedon Jul 31, 2023

Activity

Metadata