Skip to content

Contribute to Gardener Node Agent for exposing metrics to gain better visibility of node joining timeouts. #837

Open

Description

How to categorize this issue?

/area monitoring
/kind enhancement
/priority 3

What would you like to be added:
With the introduction of a gardener-node-agent which is a controller-runtime based go implementation of the cloud-config downloader, it might be possible to get more insights to what happens during the processing of a node when it joins the cluster or rather fails to join a cluster.
This can help us isolate if the timeouts are happening at the infra layer or there is something wrong during the node processing within the kubernetes runtime.

This may require us to expose some metrics from the node-agent or enhance its logging to tailor for making directed queries from its logs to identify node joining issues.
This will make life easier for the MCM operators in identifying such issues with more determinism then what is possible as of today.

Why is this needed:
Currently we often have issues to analyze and identify why the node hasn't joined in 20mins window of default timeout.
All we have in the logs is following:
Machine shoot--<project-name>--<shoot-name>-<worker-pool>-<zone>-865f7-zggql failed to join the cluster in 20m0s minutes.
The current approach to identify what has gone wrong if the issue persist requires you to follow some FAQ #my-machine-is-not-joining-the-cluster-why to begin with and also might require you to explore the Infra and see if the respective instance status to ascertain if was created successfully but fails to join the cluster.

This is currently a time consuming task with an expectation of fair knowledge of MCM internal to ascertain the root cause.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    area/monitoringMonitoring (including availability monitoring and alerting) relatedkind/enhancementEnhancement, improvement, extensionlifecycle/staleNobody worked on this for 6 months (will further age)priority/3Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions