Description
openedon May 25, 2023
How to categorize this issue?
/area quality robustness
/kind enhancement
/priority 3
What would you like to be added:
A way to temporarily prevent node from getting deleted. For eg, when we cordon/drain a node and investigate it, sometimes it gets deleted automatically because it's not healthy. It would be really useful to be able to keep a node alive to investigate it and find the root cause of a given problem.
It could be something like an annotation to add to a node
resource (ideally not machine
since shoot owner might also find this useful). I also think this should add another annotation with something like a timeout threshold (that can be increased if needs be) to prevent people from forgetting a node with that state.
** Update 2Aug meeting with Etienne **
Investigation would be needed in following phases:
Pending
(machine is not joining cases)Unknown
machine (pods not working so cordon/drain node and then inspect)Running
machine (pods not working, but machineRunning
, probably because the issue couldn't be tracked through a node condition)
Terminating
WON'T need any investigation as the resources are in deletion phase, and could have been partly deleted by the time , machine is marked to be ignored from deletion.
Why is this needed:
This would be useful to troubleshoot nodes that are suddenly stop working as expected (RCA purposes)
Activity