Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update types_gcprovider.go #1044

Merged

Conversation

SamuelStuchly
Copy link
Contributor

Gcp providerSpec was updated for the support of GuestAccelerators. These changes were not included in the version which was migrated to this repository. Therefore need for this update.

@JoelSpeed
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2021
// GCPAcceleratorConfig describes type and count of accelerator cards attached to the instance on GCP.
type GCPAcceleratorConfig struct {
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance
AcceleratorCount int64 `json:"acceleratorCount,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AcceleratorCount int64 `json:"acceleratorCount,omitempty"`
// +optional
AcceleratorCount int64 `json:"acceleratorCount,omitempty"`

Copy link
Contributor

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, please add optional tags for consistency with other fields

AcceleratorCount int64 `json:"acceleratorCount,omitempty"`
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance.
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-a100, nvidia-tesla-p4, nvidia-tesla-t4
AcceleratorType string `json:"acceleratorType,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AcceleratorType string `json:"acceleratorType,omitempty"`
// +optional
AcceleratorType string `json:"acceleratorType,omitempty"`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider these as optional ? Whenever GCPAcceleratorConfig is actully specified in the GuestAccelerators field in needs to have both these values set. The entire GuestAccelerators field is optional, but if set, then within that field both type and count are required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field is not optional then I think you should remove omitempty

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 29, 2021
Copy link
Contributor

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 1, 2021
Copy link
Contributor

@Miciah Miciah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest adding some //+kubebuilder:Enum and //+kubebuilder:default tags, but it looks like we don't use kubebuilder for the machine API. I also wasn't sure about the use of bool and pointers, which we avoid in other APIs, but it looks like they're prevalent in the machine API.

@@ -54,9 +54,20 @@ type GCPMachineProviderSpec struct {
// ProjectID is the project in which the GCP machine provider will create the VM.
// +optional
ProjectID string `json:"projectID,omitempty"`
// GuestAccelerators is a list of GPUs to be attached to the VM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these necessarily GPUs? "Accelerators" is a broad enough term that it could encompass FPGAs, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Google Cloud Platform provides graphics processing units (accelerators) that you can add to VM instances to improve or accelerate performance when working with intensive workloads. " - from GCP docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Google Cloud Platform provides graphics processing units (accelerators) that you can add to VM instances to improve or accelerate performance when working with intensive workloads. " - from GCP docs

If these are explicitly GPUs, it seems worth calling them such in our API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the only thing we can add here currently are GPUs. All documentation about accelerator types links back to GPUs. @SamuelStuchly let's rename this field to GPUs

@@ -54,9 +54,20 @@ type GCPMachineProviderSpec struct {
// ProjectID is the project in which the GCP machine provider will create the VM.
// +optional
ProjectID string `json:"projectID,omitempty"`
// GuestAccelerators is a list of GPUs to be attached to the VM.
// +optional
GuestAccelerators []*GCPAcceleratorConfig `json:"guestAccelerators,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a slice of pointers? Any reason not to use []GCPAcceleratorConfig?

Copy link
Contributor Author

@SamuelStuchly SamuelStuchly Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention

We expect our openshift APIs to follow openshift and kubernetes conventions for a consistent look and feel. Can you describe why its important to allow an element to nil? What additional power does it grant the user of this API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any reason we need this to be a point, please change this one Sam

Comment on lines 63 to 68
// OnHostMaintenance Determines the behavior when a maintenance event occurs that might cause the instance to reboot (default MIGRATE).
// This is required to be set to TERMINATE if you want to provision machine with attached GPUs.
// +optional
OnHostMaintenance string `json:"onHostMaintenance,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider defining a type and named constants for the allowed values.

Are "MIGRATE" and "TERMINATE" the only allowed values?

Can you elaborate on what these values mean? I can only guess that "MIGRATE" could mean the cloud platform would do some kind of live migration of the VM, whereas "TERMINATE" could mean the cloud platform would do a regular OS shutdown, forcing the eviction of pods.

Suggested change
// OnHostMaintenance Determines the behavior when a maintenance event occurs that might cause the instance to reboot (default MIGRATE).
// This is required to be set to TERMINATE if you want to provision machine with attached GPUs.
// +optional
OnHostMaintenance string `json:"onHostMaintenance,omitempty"`
// OnHostMaintenance determines the behavior when a maintenance event occurs that might cause the instance to reboot.
// Allowed values are "MIGRATE" or "TERMINATE". The default value is "MIGRATE".
// "MIGRATE" means [...]. "TERMINATE" means [...].
// This is required to be set to "TERMINATE" if the machine has attached accelerators.
// +optional
OnHostMaintenance string `json:"onHostMaintenance,omitempty"`

Copy link
Contributor Author

@SamuelStuchly SamuelStuchly Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This as well is set as string following google compute api. I see that for example in google's .NET api package they have it set as enum. However in golang package it is simply a string. We could possibly change to its own type, however it doesnt seem necessary to me, considering we would be diverging from compute api convntion.
I can add an extra comment concerning MIGRATE and TERMINATE meaning.

Comment on lines 67 to 68
// AutomaticRestart Determines the behavior when an instance crashes or is stopped by the system (default true).
// Can not be true with preepmtible instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "the system" mean OpenShift, GCP, or something else?

Suggested change
// AutomaticRestart Determines the behavior when an instance crashes or is stopped by the system (default true).
// Can not be true with preepmtible instances.
// AutomaticRestart determines the behavior when an instance crashes or is stopped by the system (default true).
// Cannot be true with preemptible instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken from GCP docs, instance is referring to the VM running on GCP.
If it seems confusing i can update this with clearer description.

// AutomaticRestart Determines the behavior when an instance crashes or is stopped by the system (default true).
// Can not be true with preepmtible instances.
// +optional
AutomaticRestart *bool `json:"automaticRestart,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's usually better to avoid bool in API definitions. I see that other types the machine API uses it in many places though, so maybe it's all right here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@deads2k deads2k Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention

We expect our CRDs to follow kubernetes API conventions. While its not unheard of to have *bool as a type, I'm also interested in how that power is intended to be wielded by a user and how it is better than some kind of restart policy.

RestartPolicy, with possible values of RestartAlways, RemainStopped (or RestartNever for congruency), and "" appears more kube-like and explicitly describes the alternatives available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamuelStuchly Have you looked at whether the cluster api equivalent has this option? Curious to see how they handle it

Bool pointer would normally be used to allow three states, but I don't think we need that here, we could invert the type to make it just a bool eg DisableAutomaticRestart bool // default false, but I don't have an issue with moving this to the enum style suggested above either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as i can see they do not have this option yet. Cluster api does not support GPUs on gcp kubernetes-sigs/cluster-api-provider-gcp#289, so i assume they have not had use for this option yet.

// GCPAcceleratorConfig describes type and count of accelerator cards attached to the instance on GCP.
type GCPAcceleratorConfig struct {
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance
AcceleratorCount int64 `json:"acceleratorCount"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an int64 allows for a lot of accelerator cards! Probably an int32 would suffice.

The "Accelerator" prefix seems redundant (spec.guestAccelerators[*].acceleratorCount).

Copy link
Contributor Author

@SamuelStuchly SamuelStuchly Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follows https://pkg.go.dev/google.golang.org/api/compute/v1 convention

Can you describe the value this provides to a user of the API? What is the logical upper bound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upper bound of this will be in the order of two digits, if not 1, we can definitely get away with an int32

Based on the limits described in the docs, the upper bound is 16 GPUs per instance

@SamuelStuchly
Copy link
Contributor Author

SamuelStuchly commented Nov 4, 2021

I was going to suggest adding some //+kubebuilder:Enum and //+kubebuilder:default tags, but it looks like we don't use kubebuilder for the machine API. I also wasn't sure about the use of bool and pointers, which we avoid in other APIs, but it looks like they're prevalent in the machine API.

Machine-api gcp provider uses Google's https://pkg.go.dev/google.golang.org/api/compute/v1. Therefore our structures are modeled after google's compute api structures. Hence the bool pointers, variable names, etc.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2021
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance
AcceleratorCount int64 `json:"acceleratorCount"`
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance.
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-a100, nvidia-tesla-p4, nvidia-tesla-t4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need validation tags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make these typed constants as I suggested above?

Do we need validation tags when we this API is only used as a raw extension?

Copy link
Contributor Author

@SamuelStuchly SamuelStuchly Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make these into the typed constants, we will have to update this in case gcp changes supported cards.
Also this line was supposed to be more of comment of which GPUs does gcp allow. If we were to make this into constants accepted by AcceleratorType field, I will remove nvidia-tesla-a100 type since it cant be specified in this field. This type is supported only in machine types with already attached gpus.

// Preemptible indicates if created instance is preemptible
// +optional
Preemptible bool `json:"preemptible,omitempty"`
// OnHostMaintenance determines the behavior when a maintenance event occurs that might cause the instance to reboot.
// [default] MIGRATE - causes Compute Engine to live migrate an instance when there is a maintenance event.
// TERMINATE - stops an instance instead of migrating it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerated values in kube APIs are "Terminate". Also, you're missing the tags for CRD validation on this field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this to be a typed string and then add constants that match that type with the correct values? That way we don't have to spell them out in our controllers and tests multiple times.

@deads2k is making it a const like this idiomatic for enums?

type GCPHostMaintenanceType string

const (
  MigrateHostMaintenanceType GCPHostMaintenanceType = "Migrate"
  TerminateHostMaintenanceType GCPHostMaintenanceType = "Terminate"
)

We don't have CRD validation for the GCP provider spec (or any provider spec) due to it being passed in as a raw extension, so we could add the CRD validation tags but they unfortunately have no effect right now

// This is required to be set to TERMINATE if you want to provision machine with attached GPUs.
// +optional
OnHostMaintenance string `json:"onHostMaintenance,omitempty"`
// AutomaticRestart determines behavior when an instance crashes or is stopped due to maintenance event (default true).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// AutomaticRestart determines behavior when an instance crashes or is stopped due to maintenance event (default true).
// AutomaticRestart determines the behavior when an instance crashes or is stopped due to a maintenance event (default true).

Thanks, this is a little clearer. Is it going to be obvious to people using this API what "a maintenance event" means though, or who or what could initiate a "maintenance event"? Would it be accurate to say "when an instance crashes or the underlying infrastructure provider stops the instance as part of a maintenance event"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it is clear enough. But sure it cant hurt to specify it even better.

@SamuelStuchly
Copy link
Contributor Author

SamuelStuchly commented Nov 9, 2021

@Miciah @deads2k I see where you are coming from. However we have been following provider api conventions across MAPI for a long time. What you are suggesting might not just apply to changes made in this PR. This PR is part of migration of many api configs, that follow same conventions, which have already been merged before. This PR is also blocking other further migration PRs since we need the dependency. Therefore there is argument from our side to merge asap. We of course value your input and want to reach an understanding.
I have discussed this with our team and asked our technical lead @JoelSpeed to help resolve this.

Comment on lines 159 to 184
// GCPAcceleratorConfig describes type and count of accelerator cards attached to the instance on GCP.
type GCPAcceleratorConfig struct {
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance
AcceleratorCount int32 `json:"acceleratorCount"`
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance.
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-p4, nvidia-tesla-t4
// +kubebuilder:validation:Pattern=`^nvidia-tesla-(k80|p100|v100|p4|t4)$`
AcceleratorType string `json:"acceleratorType"`
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename this inline with the rename above or remove the accelerator prefixes as below

Suggested change
// GCPAcceleratorConfig describes type and count of accelerator cards attached to the instance on GCP.
type GCPAcceleratorConfig struct {
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance
AcceleratorCount int32 `json:"acceleratorCount"`
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance.
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-p4, nvidia-tesla-t4
// +kubebuilder:validation:Pattern=`^nvidia-tesla-(k80|p100|v100|p4|t4)$`
AcceleratorType string `json:"acceleratorType"`
}
// GCPGPUConfig describes type and count of accelerator cards attached to the instance on GCP.
type GCPGPUConfig struct {
// Count is number of GPUs to be attached to an instance
Count int32 `json:"count"`
// Type is the type of GPU to be attached to an instance.
// Supported accelerator types are: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-v100, nvidia-tesla-p4, nvidia-tesla-t4
// +kubebuilder:validation:Pattern=`^nvidia-tesla-(k80|p100|v100|p4|t4)$`
Type string `json:"type"`
}

Comment on lines 74 to 75
// This is required to be set to TerminateHostMaintenanceType if you want to provision machine with attached GPUs.
// +kubebuilder:validation:Enum=MigrateHostMaintenanceType;TerminateHostMaintenanceType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This is required to be set to TerminateHostMaintenanceType if you want to provision machine with attached GPUs.
// +kubebuilder:validation:Enum=MigrateHostMaintenanceType;TerminateHostMaintenanceType;
// This is required to be set to Terminate if you want to provision machine with attached GPUs.
// Otherwise, allowed values are Migrate and Terminate.
// +kubebuilder:validation:Enum=Migrate;Terminate;

// This is required to be set to TerminateHostMaintenanceType if you want to provision machine with attached GPUs.
// +kubebuilder:validation:Enum=MigrateHostMaintenanceType;TerminateHostMaintenanceType;
// +optional
OnHostMaintenance GCPHostMaintenanceType `json:"onHostMaintenance,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to describe what happens when the value is empty. We often say, " "" means no opinion and the platform chooses a default, which is subject to change over time, currently that default is ".

// AutomaticRestart determines the behavior when an instance crashes or the underlying infrastructure provider stops the instance as part of a maintenance event (default true).
// Cannot be true with preemptible instances.
// +optional
AutomaticRestart bool `json:"automaticRestart,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this one slipped through the cracks: #1044 (comment)

Perhaps something like RestartPolicy, with possible values of RestartAlways, RemainStopped (or RestartNever for congruency), and ""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see much value in changing bool to RestartPolicy, since it supports only two possible values anyway. It just adds changes to existing code, plus extra need for validation.

I added it as a separate commit, so if you agree we can just remove it.

// If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always".
// +kubebuilder:validation:Enum=Always;Never;
// +optional
AutomaticRestart corev1.RestartPolicy `json:"automaticRestart,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change the field to be called RestartPolicy and define our own enum values. Using the corev1.RestartPolicy enum does work, except there are 3 possible values there, and currently, we only support 2 possible values

Copy link
Contributor Author

@SamuelStuchly SamuelStuchly Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, however i assumed we do not want to rename the field, since we do not have docs concerning the providerSpec, it might be confusing for customer if the field does not match the GCP docs.

I can make the change and mention it in the comment that this RestartPolicy field is reffering to AutomaticRestart in google docs.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2021
@deads2k
Copy link
Contributor

deads2k commented Nov 19, 2021

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 19, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexander-demichev, deads2k, JoelSpeed, SamuelStuchly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2021
@openshift-merge-robot openshift-merge-robot merged commit 313e51e into openshift:master Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants