Skip to content

Refactor MCM to leverage controller-runtime #724

Open

Description

How to categorize this issue?

/area quality
/kind enhancement
/priority 2

What would you like to be added:
MCM should be re-written using controller-runtime framework. Currently we use client-go directly . As a part of this we would also remove the in-tree deprecated code #622

Why is this needed:
Some advantages are:

  • lesser LOC
  • automatic metric registration
  • (more could be added here)

FAQ for controller runtime -> https://github.com/kubernetes-sigs/controller-runtime/blob/v0.12.1/FAQ.md

BrainStorm / Ideas

Small refactoring / changes

  • node should be tainted as well during cordoning before drain, not just marking unschedulable=true

TODO

if _, annotationPresent := node.ObjectMeta.Annotations[machineutils.NotManagedByMCM]; annotationPresent
  • isMachineStatusSimilar is confounding, delicate and easily broken. Replace with a better implementation. This is happening because the Description field mixes arbitrary text with constants such as machineutils.GetVMStatus, etc. Consider alternative approach such as use of custom conditions / introducing operation label as separate field, etc
  • The triggerCreationFlow in the MC file pkg/util/provider/machinecontroller/machine.go has logic to check for Azure specific stale node. Provider specific handling should not percolate to the common library code. Re-think on this while refactoring.
  • Fix Machine.Status.LastOperation being used in several places as the next operation when Machine.Status.LastOperation.State is MachineStateProcessing. Also MachineState is bad name - it actually means MachineOperationState. Instead of computing and recording the next operation, reconciler should compute this based on current conditions. Consider introducing new conditions and leveraging the gardener flow library for conditional execution.
  • After setting machine phase to Terminating, if we get stuck in the controller.getVmStatus under the cases: case codes.Unknown, codes.DeadlineExceeded, codes.Aborted, codes.Unavailable:, then we keep repeating getVMStatus with a short retry. We should prefer exponential backoff here.
  • In controller.getVMStatus consider failing permanently for Unimplemented or error in decoding statuses returned by Driver instead of just queuing again. Check all the Driver.GetMachineStatus implementations for providers before doing this.
  • In controller.drainNode, we need to revise the logic for skipping the drain if the read only FS condition is true.
  • In controller.deleteVM we have a perma-loop when status decode fails.
  • controller.deleteMachineFinalizers(ctx, machine) need not return a retry period as we ignore it.
  • controller.triggerUpdationFlow is not being used currently. Study and consider removal.
  • drain.Options.getPodsForDeletion should be simplified.
  • In drain.Options.evictPod we should not use the beta1 Policy client.
  • Providers
  • Relook at Provide a way to initialize out-of-tree providers #513
  • [Edge Case] Creation/Health Timeout not respected #811
  • Make creation and deletion of resources associated with a single machine robust. One idea is to retry creation of all resources (NIC, Disks, VM) for a machine till the point that machine contract/spec changes. At which point mark the sub-resources (NIC, Disk) that got created as stale/orphan and that should be removed via the orphan collector. This avoids multiple calls to delete each resource if its creation fails. The reason is that the deletion of that resource can also fail as well. We have seen that in azure many times.
  • For better tracing, every request made to the driver must have a unique request-id. This will then be used as a correlation-id for one more more provide API calls. For example: If you wish to delete a machine in Azure provider, then there are many different AZ API calls that are made. In order to easily diagnose why the call to the Driver failed, all AZ API calls needs a correlation. This can easily be achieved via a correlation-ID which will be the request-ID. This will require changes to be done in the API request/response structs.
  • Error handling is not up-to-the-mark. We use util/provider/machinecodes/status but it currently loses the underline error. There is also no method that accepts an error and a code. The custom error need to be enhanced.
  • Re-look at the need to have multiple machine-sets per machine-deployment. Is there a real need for this complication or it could drastically simplified?
  • Static distribution of total min replicas across machine-deployments should not be done at the time of creation of change of worker pools at the shoot level.
  • Create a state machine to handle deterministic transitions between VM states.
  • Right now MachineClass is validated by mcm-provider- driver implementation methods repeatedly even if there is no change being done to the MachineClass. This is sub-optimal. We should ideally add a validating webhook for MachineClass that will do the validation before persisting the object to etcd. Majority of the validations can be done inside the webhook. Gardener extensions has validating webhooks today but they currently do not validate the ProviderConfig. Current view is that we should not link our webhooks to extension webhooks as they will directly tie us to g/g. To support use cases where customers can choose to use MCM independent of g/g shoot validation (done in extension webhooks) should not be used. Instead MCM should create its own validating webhooks which are invoked in addition to the ones defined in gardener extensions.
  • Provide a generic way to inject faults. This allows to run different scenarios by simulating provider responses.
  • Rolling updates do not timeout today. Typically customers have a limited window within which they wish to complete the updates. During the rolling updates we cordon the nodes to prevent new workloads to be scheduled onto them. It is also possible that due to insufficient capacity CA is not able to launch new nodes. This leads to customers unable to schedule workloads as many old nodes would have been cordoned off and new nodes cannot be launched. We need a way to end the update process after a defined timeout.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    area/qualityOutput qualification (tests, checks, scans, automation in general, etc.) relatedeffort/6mEffort for issue is around 6 monthsexp/expertIssue that requires expert level knowledgeimportant-soonkind/cleanupSomething that is not needed anymore and can be cleaned upkind/enhancementEnhancement, improvement, extensionlifecycle/staleNobody worked on this for 6 months (will further age)needs/planningNeeds (more) planning with other MCM maintainerspriority/2Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions