-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterapi scale from zero support #4840
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
i am placing a pre-empitve hold on this to allow time for reviews. @enxebre ptal |
a762e2f
to
9450e75
Compare
@@ -226,7 +234,64 @@ func (ng *nodegroup) Nodes() ([]cloudprovider.Instance, error) { | |||
// node by default, using manifest (most likely only kube-proxy). | |||
// Implementation optional. | |||
func (ng *nodegroup) TemplateNodeInfo() (*schedulerframework.NodeInfo, error) { | |||
return nil, cloudprovider.ErrNotImplemented | |||
if !ng.scalableResource.CanScaleFromZero() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💥
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go
Outdated
Show resolved
Hide resolved
updated to remove redundant checks for cpu and memory data |
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured_test.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils_test.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils_test.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup_test.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go
Outdated
Show resolved
Hide resolved
This allows a Machine{Set,Deployment} to scale up/down from 0, providing the following annotations are set: ```yaml apiVersion: v1 items: - apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0" machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6" machine.openshift.io/vCPU: "2" machine.openshift.io/memoryMb: 8G machine.openshift.io/GPU: "1" machine.openshift.io/maxPods: "100" ``` Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods` are optional. For autoscaling from zero, the autoscaler should convert the mem value received in the appropriate annotation to bytes using powers of two consistently with other providers and fail if the format received is not expected. This gives robust behaviour consistent with cloud providers APIs and providers implementations. https://cloud.google.com/compute/all-pricing https://www.iec.ch/si/binary.htm https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366 Co-authored-by: Enxebre <alberto.garcial@hotmail.com> Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk> Co-authored-by: Michael McCune <elmiko@redhat.com>
57fffa1
to
d3edc45
Compare
update
i am doing some manual testing now to make sure nothing broke, but i think this is ready for re-review |
manual testing is working as expected against capi 1.0.5 and kubemark 0.3.0 |
I've given this a review again and it LGTM, do we want to give others a few days to review and then try to get this merged? Is there anyone specific we want to get to review this before we label it? |
sgtm, though we definitely need to add some sort of smoke testing to validate PRs here. |
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go
Outdated
Show resolved
Hide resolved
thanks a lot @elmiko, this looks great to me other than #4840 (comment) |
fyi, i'm leaving this on hold so we don't merge early since it's already approved. i will remove the hold once we work out the last discussion item. |
This commit is a combination of several commits. Significant details are preserved below. * update functions for resource annotations This change converts some of the functions that look at annotation for resource usage to indicate their usage in the function name. This helps to make room for allowing the infrastructure reference as an alternate source for the capacity information. * migrate capacity logic into a single function This change moves the logic to collect the instance capacity from the TemplateNodeInfo function into a method of the unstructuredScalableResource named InstanceCapacity. This new function is created to house the logic that will decide between annotations and the infrastructure reference when calculating the capacity for the node. * add ability to lookup infrastructure references This change supplements the annotation lookups by adding the logic to read the infrastructure reference if it exists. This is done to determine if the machine template exposes a capacity field in its status. For more information on how this mechanism works, please see the cluster-api enhancement[0]. * add documentation for capi scaling from zero * improve tests for clusterapi scale from zero this change adds functionality to test the dynamic client behavior of getting the infrastructure machine templates. * update README with information about rbac changes this adds more information about the rbac changes necessary for the scale from zero support to work. * remove extra check for scaling from zero since the CanScaleFromZero function checks to see if both CPU and memory are present, there is no need to check a second time. This also adds some documentation to the CanScaleFromZero function to make it clearer what is happening. * update unit test for capi scale from zero adding a few more cases and details to the scale from zero unit tests, including ensuring that the int based annotations do not accept other unit types. [0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md
d3edc45
to
1a65fde
Compare
update
|
i'm doing some digging around client-go, have found https://github.com/kubernetes/client-go/tree/master/tools/cache but i think we need to add a little more code to make it work. i was under the false impression that the cache was automatic with client-go. hopefully i'll have another patch soon with the cache added. |
i talked with @JoelSpeed a little about client-go internals, i think what i need to add here is the ability to add new informers to the client when we observe new infrastructure machine templates. i'm doing a little more digging to see how we can do that with our current setup. |
this change adds logic to create informers for the infrastructure machine templates that are discovered during the scale from zero checks. it also adds tests and a slight change to the controller structure to account for the dynamic informer creation.
ok, i have finally understood the magical incantations for client-go to get the dynamic informers created and running for the infrastructure templates. i think this is good to merge whenever folks are happy with the code quality. /hold cancel |
This looks good and I think we've addressed everyones comments. I know Mike has been testing this and was seeing it working. I think it's good to merge now. |
clusterapi scale from zero support
Which component this PR applies to?
cluster-autoscaler
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds scale from zero capability to the clusterapi provider for cluster-autoscaler.
Which issue(s) this PR fixes:
Fixes #3150
Special notes for your reviewer:
The cluster-api-provider-kubemark has implemented the provider part of the opt-in API for scaling from zero. I recommend using this as a way to test with a live cluster.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Usage docs are included in the README file.