Bug 2033751: Return Error when trying to use Scheduler Policy #390

damemi · 2022-01-04T14:17:33Z

The scheduler's Policy api was removed in 1.23, see kubernetes/kubernetes#105828

As an alternative we could also keep just using the default (lownodeutilization) here, in order to not break the scheduler on upgrades. In that case, log a message that we're using the default profile instead. Open to ideas

openshift-ci · 2022-01-04T14:17:38Z

@damemi: This pull request references Bugzilla bug 2033751, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.10.0) matches configured target release for branch (4.10.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 2033751: Return Error when trying to use Scheduler Policy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

soltysh

/lgtm

damemi · 2022-01-04T14:31:43Z

/hold
discussing if this error is the right approach

ingvagabund · 2022-01-06T13:00:22Z

/retest

ingvagabund · 2022-01-06T13:12:02Z

pkg/operator/targetconfigcontroller/targetconfigcontroller.go

 	configMap.Data["forceRedeploymentReason"] = operatorSpec.ForceRedeploymentReason
 	configMap.Data["version"] = version.Get().String()
 	appliedConfigMap, changed, err := resourceapply.ApplyConfigMap(ctx, configMapsGetter, recorder, configMap)
 	if changed && len(config.Spec.Policy.Name) > 0 {


With manageKubeSchedulerConfigMap_v311_00_to_latest returning error when len(config.Spec.Policy.Name) > 0, the kube-scheduler container will keep crash looping since the configmap with the profile will be missing (assuming the scheduler/cluster object was specified with .spec.policy.name set during the cluster provisioning). I wonder if you took this case into account and whether it would make more sense to move len(config.Spec.Policy.Name) > 0 check alongside the config, err := configSchedulerLister.Get("cluster") line at the beginning of the managePod_v311_00_to_latest function and return an error as well so the pod does not get created and kept crash looping until the policy field is cleared?

The case covers the bootstrapping phase in which (when .spec.policy.name is not empty) the installation fails. So it is unlikely any admin will update the scheduler/cluster object rather than running the installation again. So, the net benefit for the normal installation is quite low. On the other hand when the hypershift topology is used (with one cluster hosting many control planes), postponing the creation of the kube-scheduler pod might safe the step of debugging why the pod is crash looping (depending on how the control plan is provisioned).

My comment is more for the debugging purposes than the functional/conceptual ones. The operator will go degraded when the policy field is set.

ingvagabund · 2022-01-19T18:18:08Z

/hold cancel
/lgtm

openshift-ci · 2022-01-19T18:18:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi, ingvagabund, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [damemi,ingvagabund,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ingvagabund · 2022-01-20T08:11:43Z

/retest-required

openshift-bot · 2022-01-20T09:15:50Z

/retest-required