Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable machine replacement #10946

Open
Meecr0b opened this issue Jul 26, 2024 · 10 comments
Open

Configurable machine replacement #10946

Meecr0b opened this issue Jul 26, 2024 · 10 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Meecr0b
Copy link

Meecr0b commented Jul 26, 2024

What would you like to be added (User Story)?

As a operator i would like to be able to configure a time after machines are getting replaced automatically for testing and security reasons.

Detailed Description

Problem Statement:

Regularly replacing machines help in testing application behavior during rolling updates and ensures machines are refreshed periodically, especially important after security incidents.

Proposed Solution:

Implement rolloutBefore.machineExpiry{Minutes,Hours,Days} parameter within the Cluster API (like rolloutBefore.certificatesExpiryDays implemented for KCP), allowing users to specify the maximum time a machine should exist before being automatically replaced.

Benefits:

  • Testing Rolling Updates: Simplifies the process of regularly testing how applications behave during rolling updates.
  • Security and Compliance: Ensures machines are periodically replaced, reducing the risk of lingering vulnerabilities and ensuring machines are clean post-security incidents.
  • Operational Efficiency: Automates a routine maintenance task, reducing manual workload and the potential for human error.

Impact:

  • This feature would be highly valuable for IT operations teams managing Kubernetes clusters, particularly those with strict compliance and security requirements.
  • It enhances cluster maintenance workflows, contributing to overall system reliability and security.

Anything else you would like to add?

Current workarounds:

  • setting spec.rolloutAfter periodically via CronJob for MachineDeployment
  • running clusterctl alpha rollout restart machinedeployment/my-md-0 periodically

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 26, 2024
@sbueringer
Copy link
Member

/triage accepted

/cc @fabriziopandini @chrischdi

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 26, 2024
@fabriziopandini
Copy link
Member

q: is this about replacing nodes (the node at Kubernetes level) or the entire machine where the node is hosted?

@Meecr0b
Copy link
Author

Meecr0b commented Jul 30, 2024

Hi @fabriziopandini it's about machines, i'll update the issue.

@fabriziopandini fabriziopandini added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jul 31, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 31, 2024
@fabriziopandini
Copy link
Member

ACK, thanks for the clarification
We need to think a bit about API modeling, but this is a nice feature to have
/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

ACK, thanks for the clarification
We need to think a bit about API modeling, but this is a nice feature to have
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 31, 2024
@fabriziopandini fabriziopandini added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jul 31, 2024
@dineshba
Copy link

I would like to contribute to this feature. If no one is already started, Shall I pick this up? cc: @sbueringer @fabriziopandini @Meecr0b

I can do the initial research and share the api-modeling, high-level code changes for the first review

@sbueringer
Copy link
Member

@dineshba Feel free to go ahead

@dineshba
Copy link

/assign dineshba

@dineshba
Copy link

I would like to share my initial idea and get feedback on the approach

cc: @sbueringer @fabriziopandini (tagging @ykakarap also as he has contributed to MachineDeployment rolloutAfter feature #8216 which is similar to this)

Feature Description

We want to specify a duration in spec after which machines should get replaced. MachineExpiry{Minutes,Hours,Days}. This feature should be available for machines managed by KCP and for machines managed by MachineDeployments/MachineSets.

Related exisiting features

  • RolloutAfter in MachineDeployment and KCP.
    // RolloutAfter is a field to indicate a rollout should be performed
	// after the specified time even if no changes have been made to the
	// KubeadmControlPlane.
	// Example: In the YAML the time can be specified in the RFC3339 format.
	// To specify the rolloutAfter target as March 9, 2023, at 9 am UTC
	// use "2023-03-09T09:00:00Z".
    // +optional
	RolloutAfter *metav1.Time `json:"rolloutAfter,omitempty"`
  • RolloutBefore.CertificatesExpiryDays in KCP (cert based rollout is only for controlplane machines)
type KubeadmControlPlaneSpec struct {
    // RolloutBefore is a field to indicate a rollout should be performed
	// if the specified criteria is met.
	// +optional
	RolloutBefore *RolloutBefore `json:"rolloutBefore,omitempty"`
}

// RolloutBefore describes when a rollout should be performed on the KCP machines.
type RolloutBefore struct {
  // CertificatesExpiryDays indicates a rollout needs to be performed if the
  // certificates of the machine will expire within the specified days.
  // +optional
  CertificatesExpiryDays *int32 `json:"certificatesExpiryDays,omitempty"`
}

API Change Proposal

Option 1: Specify machine expiry under RolloutBefore

  • Add MachineExpiryDays to existing RolloutBefore of KubeadmControlPlaneSpec
  • Add RolloutBefore.MachineExpiryDays for MachineDeploymentSpec

In this approach, we are trying to extend the existing API. We want to rollout once it reached the specified duration. Adding it under RolloutBefore struct looks not appropriate. Please suggest your inputs

Option 2: Specify machine expiry under new struct named Rollout (or better name)

type KubeadmControlPlaneSpec struct {
  // Rollout indicates different capabilites to
  // rollout the machine when the specified conditions are met
  Rollout *Rollout `json:"rollout,omitempty"`
}  
type Rollout struct {
  // MachineExpiry indicates the duration after which the machine
  // will be rolled out. If Creation time - Current time  > duration,
  // then rollout and replace the expired machine
  MachineExpiryDays*int`json:"machineexpiry,omitempty"`
}

In this new Rollout struct, we can define MachineExpiryDays. We can add other machine maintenance to this struct in the future.

Code Changes:

Note: I am new to cluster-api repo and my logic may not be fully correct. I'll add the complete logic, unit tests, manually test it and then finally raise the PR. Please let me know if I am in the right direction regarding the api changes and code changes.

@fabriziopandini
Copy link
Member

On my todo list, just but unfortunately not much bandwidth currently 😓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants