-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update types_gcprovider.go #1044
update types_gcprovider.go #1044
Conversation
8ca7406
to
d9eb7d1
Compare
/lgtm |
machine/v1beta1/types_gcpprovider.go
Outdated
// GCPAcceleratorConfig describes type and count of accelerator cards attached to the instance on GCP. | ||
type GCPAcceleratorConfig struct { | ||
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance | ||
AcceleratorCount int64 `json:"acceleratorCount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AcceleratorCount int64 `json:"acceleratorCount,omitempty"` | |
// +optional | |
AcceleratorCount int64 `json:"acceleratorCount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, please add optional tags for consistency with other fields
machine/v1beta1/types_gcpprovider.go
Outdated
AcceleratorCount int64 `json:"acceleratorCount,omitempty"` | ||
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance. | ||
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-a100, nvidia-tesla-p4, nvidia-tesla-t4 | ||
AcceleratorType string `json:"acceleratorType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AcceleratorType string `json:"acceleratorType,omitempty"` | |
// +optional | |
AcceleratorType string `json:"acceleratorType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we consider these as optional
? Whenever GCPAcceleratorConfig
is actully specified in the GuestAccelerators
field in needs to have both these values set. The entire GuestAccelerators
field is optional, but if set, then within that field both type and count are required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the field is not optional then I think you should remove omitempty
99b25fa
to
523cd94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest adding some //+kubebuilder:Enum
and //+kubebuilder:default
tags, but it looks like we don't use kubebuilder for the machine API. I also wasn't sure about the use of bool
and pointers, which we avoid in other APIs, but it looks like they're prevalent in the machine API.
machine/v1beta1/types_gcpprovider.go
Outdated
@@ -54,9 +54,20 @@ type GCPMachineProviderSpec struct { | |||
// ProjectID is the project in which the GCP machine provider will create the VM. | |||
// +optional | |||
ProjectID string `json:"projectID,omitempty"` | |||
// GuestAccelerators is a list of GPUs to be attached to the VM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these necessarily GPUs? "Accelerators" is a broad enough term that it could encompass FPGAs, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Google Cloud Platform provides graphics processing units (accelerators) that you can add to VM instances to improve or accelerate performance when working with intensive workloads. " - from GCP docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Google Cloud Platform provides graphics processing units (accelerators) that you can add to VM instances to improve or accelerate performance when working with intensive workloads. " - from GCP docs
If these are explicitly GPUs, it seems worth calling them such in our API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the only thing we can add here currently are GPUs. All documentation about accelerator types links back to GPUs. @SamuelStuchly let's rename this field to GPUs
machine/v1beta1/types_gcpprovider.go
Outdated
@@ -54,9 +54,20 @@ type GCPMachineProviderSpec struct { | |||
// ProjectID is the project in which the GCP machine provider will create the VM. | |||
// +optional | |||
ProjectID string `json:"projectID,omitempty"` | |||
// GuestAccelerators is a list of GPUs to be attached to the VM. | |||
// +optional | |||
GuestAccelerators []*GCPAcceleratorConfig `json:"guestAccelerators,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a slice of pointers? Any reason not to use []GCPAcceleratorConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention
We expect our openshift APIs to follow openshift and kubernetes conventions for a consistent look and feel. Can you describe why its important to allow an element to nil? What additional power does it grant the user of this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any reason we need this to be a point, please change this one Sam
machine/v1beta1/types_gcpprovider.go
Outdated
// OnHostMaintenance Determines the behavior when a maintenance event occurs that might cause the instance to reboot (default MIGRATE). | ||
// This is required to be set to TERMINATE if you want to provision machine with attached GPUs. | ||
// +optional | ||
OnHostMaintenance string `json:"onHostMaintenance,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider defining a type and named constants for the allowed values.
Are "MIGRATE" and "TERMINATE" the only allowed values?
Can you elaborate on what these values mean? I can only guess that "MIGRATE" could mean the cloud platform would do some kind of live migration of the VM, whereas "TERMINATE" could mean the cloud platform would do a regular OS shutdown, forcing the eviction of pods.
// OnHostMaintenance Determines the behavior when a maintenance event occurs that might cause the instance to reboot (default MIGRATE). | |
// This is required to be set to TERMINATE if you want to provision machine with attached GPUs. | |
// +optional | |
OnHostMaintenance string `json:"onHostMaintenance,omitempty"` | |
// OnHostMaintenance determines the behavior when a maintenance event occurs that might cause the instance to reboot. | |
// Allowed values are "MIGRATE" or "TERMINATE". The default value is "MIGRATE". | |
// "MIGRATE" means [...]. "TERMINATE" means [...]. | |
// This is required to be set to "TERMINATE" if the machine has attached accelerators. | |
// +optional | |
OnHostMaintenance string `json:"onHostMaintenance,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This as well is set as string following google compute api. I see that for example in google's .NET api package they have it set as enum
. However in golang package it is simply a string
. We could possibly change to its own type, however it doesnt seem necessary to me, considering we would be diverging from compute api convntion.
I can add an extra comment concerning MIGRATE
and TERMINATE
meaning.
machine/v1beta1/types_gcpprovider.go
Outdated
// AutomaticRestart Determines the behavior when an instance crashes or is stopped by the system (default true). | ||
// Can not be true with preepmtible instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "the system" mean OpenShift, GCP, or something else?
// AutomaticRestart Determines the behavior when an instance crashes or is stopped by the system (default true). | |
// Can not be true with preepmtible instances. | |
// AutomaticRestart determines the behavior when an instance crashes or is stopped by the system (default true). | |
// Cannot be true with preemptible instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taken from GCP docs, instance
is referring to the VM running on GCP.
If it seems confusing i can update this with clearer description.
machine/v1beta1/types_gcpprovider.go
Outdated
// AutomaticRestart Determines the behavior when an instance crashes or is stopped by the system (default true). | ||
// Can not be true with preepmtible instances. | ||
// +optional | ||
AutomaticRestart *bool `json:"automaticRestart,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's usually better to avoid bool
in API definitions. I see that other types the machine API uses it in many places though, so maybe it's all right here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention
We expect our CRDs to follow kubernetes API conventions. While its not unheard of to have *bool
as a type, I'm also interested in how that power is intended to be wielded by a user and how it is better than some kind of restart policy.
RestartPolicy, with possible values of RestartAlways, RemainStopped (or RestartNever for congruency), and "" appears more kube-like and explicitly describes the alternatives available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SamuelStuchly Have you looked at whether the cluster api equivalent has this option? Curious to see how they handle it
Bool pointer would normally be used to allow three states, but I don't think we need that here, we could invert the type to make it just a bool eg DisableAutomaticRestart bool // default false
, but I don't have an issue with moving this to the enum style suggested above either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as i can see they do not have this option yet. Cluster api does not support GPUs on gcp kubernetes-sigs/cluster-api-provider-gcp#289, so i assume they have not had use for this option yet.
machine/v1beta1/types_gcpprovider.go
Outdated
// GCPAcceleratorConfig describes type and count of accelerator cards attached to the instance on GCP. | ||
type GCPAcceleratorConfig struct { | ||
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance | ||
AcceleratorCount int64 `json:"acceleratorCount"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an int64 allows for a lot of accelerator cards! Probably an int32 would suffice.
The "Accelerator" prefix seems redundant (spec.guestAccelerators[*].acceleratorCount
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention
Can you describe the value this provides to a user of the API? What is the logical upper bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upper bound of this will be in the order of two digits, if not 1, we can definitely get away with an int32
Based on the limits described in the docs, the upper bound is 16 GPUs per instance
Machine-api gcp provider uses Google's https://pkg.go.dev/google.golang.org/api/compute/v1. Therefore our structures are modeled after google's compute api structures. Hence the bool pointers, variable names, etc. |
523cd94
to
93ff8d7
Compare
machine/v1beta1/types_gcpprovider.go
Outdated
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance | ||
AcceleratorCount int64 `json:"acceleratorCount"` | ||
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance. | ||
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-a100, nvidia-tesla-p4, nvidia-tesla-t4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need validation tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make these typed constants as I suggested above?
Do we need validation tags when we this API is only used as a raw extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make these into the typed constants, we will have to update this in case gcp changes supported cards.
Also this line was supposed to be more of comment of which GPUs does gcp allow. If we were to make this into constants accepted by AcceleratorType
field, I will remove nvidia-tesla-a100 type since it cant be specified in this field. This type is supported only in machine types with already attached gpus.
machine/v1beta1/types_gcpprovider.go
Outdated
// Preemptible indicates if created instance is preemptible | ||
// +optional | ||
Preemptible bool `json:"preemptible,omitempty"` | ||
// OnHostMaintenance determines the behavior when a maintenance event occurs that might cause the instance to reboot. | ||
// [default] MIGRATE - causes Compute Engine to live migrate an instance when there is a maintenance event. | ||
// TERMINATE - stops an instance instead of migrating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enumerated values in kube APIs are "Terminate". Also, you're missing the tags for CRD validation on this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change this to be a typed string and then add constants that match that type with the correct values? That way we don't have to spell them out in our controllers and tests multiple times.
@deads2k is making it a const like this idiomatic for enums?
type GCPHostMaintenanceType string
const (
MigrateHostMaintenanceType GCPHostMaintenanceType = "Migrate"
TerminateHostMaintenanceType GCPHostMaintenanceType = "Terminate"
)
We don't have CRD validation for the GCP provider spec (or any provider spec) due to it being passed in as a raw extension, so we could add the CRD validation tags but they unfortunately have no effect right now
machine/v1beta1/types_gcpprovider.go
Outdated
// This is required to be set to TERMINATE if you want to provision machine with attached GPUs. | ||
// +optional | ||
OnHostMaintenance string `json:"onHostMaintenance,omitempty"` | ||
// AutomaticRestart determines behavior when an instance crashes or is stopped due to maintenance event (default true). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// AutomaticRestart determines behavior when an instance crashes or is stopped due to maintenance event (default true). | |
// AutomaticRestart determines the behavior when an instance crashes or is stopped due to a maintenance event (default true). |
Thanks, this is a little clearer. Is it going to be obvious to people using this API what "a maintenance event" means though, or who or what could initiate a "maintenance event"? Would it be accurate to say "when an instance crashes or the underlying infrastructure provider stops the instance as part of a maintenance event"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it is clear enough. But sure it cant hurt to specify it even better.
@Miciah @deads2k I see where you are coming from. However we have been following provider api conventions across MAPI for a long time. What you are suggesting might not just apply to changes made in this PR. This PR is part of migration of many api configs, that follow same conventions, which have already been merged before. This PR is also blocking other further migration PRs since we need the dependency. Therefore there is argument from our side to merge asap. We of course value your input and want to reach an understanding. |
machine/v1beta1/types_gcpprovider.go
Outdated
// GCPAcceleratorConfig describes type and count of accelerator cards attached to the instance on GCP. | ||
type GCPAcceleratorConfig struct { | ||
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance | ||
AcceleratorCount int32 `json:"acceleratorCount"` | ||
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance. | ||
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-p4, nvidia-tesla-t4 | ||
// +kubebuilder:validation:Pattern=`^nvidia-tesla-(k80|p100|v100|p4|t4)$` | ||
AcceleratorType string `json:"acceleratorType"` | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rename this inline with the rename above or remove the accelerator prefixes as below
// GCPAcceleratorConfig describes type and count of accelerator cards attached to the instance on GCP. | |
type GCPAcceleratorConfig struct { | |
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance | |
AcceleratorCount int32 `json:"acceleratorCount"` | |
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance. | |
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-p4, nvidia-tesla-t4 | |
// +kubebuilder:validation:Pattern=`^nvidia-tesla-(k80|p100|v100|p4|t4)$` | |
AcceleratorType string `json:"acceleratorType"` | |
} | |
// GCPGPUConfig describes type and count of accelerator cards attached to the instance on GCP. | |
type GCPGPUConfig struct { | |
// Count is number of GPUs to be attached to an instance | |
Count int32 `json:"count"` | |
// Type is the type of GPU to be attached to an instance. | |
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-p4, nvidia-tesla-t4 | |
// +kubebuilder:validation:Pattern=`^nvidia-tesla-(k80|p100|v100|p4|t4)$` | |
Type string `json:"type"` | |
} | |
71473e6
to
313a5bb
Compare
machine/v1beta1/types_gcpprovider.go
Outdated
// This is required to be set to TerminateHostMaintenanceType if you want to provision machine with attached GPUs. | ||
// +kubebuilder:validation:Enum=MigrateHostMaintenanceType;TerminateHostMaintenanceType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This is required to be set to TerminateHostMaintenanceType if you want to provision machine with attached GPUs. | |
// +kubebuilder:validation:Enum=MigrateHostMaintenanceType;TerminateHostMaintenanceType; | |
// This is required to be set to Terminate if you want to provision machine with attached GPUs. | |
// Otherwise, allowed values are Migrate and Terminate. | |
// +kubebuilder:validation:Enum=Migrate;Terminate; |
// This is required to be set to TerminateHostMaintenanceType if you want to provision machine with attached GPUs. | ||
// +kubebuilder:validation:Enum=MigrateHostMaintenanceType;TerminateHostMaintenanceType; | ||
// +optional | ||
OnHostMaintenance GCPHostMaintenanceType `json:"onHostMaintenance,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to describe what happens when the value is empty. We often say, " "" means no opinion and the platform chooses a default, which is subject to change over time, currently that default is ".
machine/v1beta1/types_gcpprovider.go
Outdated
// AutomaticRestart determines the behavior when an instance crashes or the underlying infrastructure provider stops the instance as part of a maintenance event (default true). | ||
// Cannot be true with preemptible instances. | ||
// +optional | ||
AutomaticRestart bool `json:"automaticRestart,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this one slipped through the cracks: #1044 (comment)
Perhaps something like RestartPolicy
, with possible values of RestartAlways, RemainStopped (or RestartNever for congruency), and ""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see much value in changing bool
to RestartPolicy
, since it supports only two possible values anyway. It just adds changes to existing code, plus extra need for validation.
I added it as a separate commit, so if you agree we can just remove it.
313a5bb
to
b7ee881
Compare
machine/v1beta1/types_gcpprovider.go
Outdated
// If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always". | ||
// +kubebuilder:validation:Enum=Always;Never; | ||
// +optional | ||
AutomaticRestart corev1.RestartPolicy `json:"automaticRestart,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change the field to be called RestartPolicy
and define our own enum values. Using the corev1.RestartPolicy
enum does work, except there are 3 possible values there, and currently, we only support 2 possible values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this, however i assumed we do not want to rename the field, since we do not have docs concerning the providerSpec
, it might be confusing for customer if the field does not match the GCP docs.
I can make the change and mention it in the comment that this RestartPolicy
field is reffering to AutomaticRestart
in google docs.
b7ee881
to
a1fab72
Compare
a1fab72
to
0033533
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexander-demichev, deads2k, JoelSpeed, SamuelStuchly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Gcp providerSpec was updated for the support of GuestAccelerators. These changes were not included in the version which was migrated to this repository. Therefore need for this update.