Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cmd to configure nodegroups on a running cluster #2246

Merged
merged 41 commits into from
Jun 17, 2021
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
3b28bbd
WIP on nodegroup adder
RobertLucian Jun 7, 2021
6fa2e8f
WIP nodegroup adder cmd
RobertLucian Jun 8, 2021
d4d2bda
Merge branch 'master' into feature/add-or-remove-ngs
RobertLucian Jun 8, 2021
ec88c2d
Use the simplified aws resource table when showing the costs
RobertLucian Jun 8, 2021
3083606
WIP cluster configure
RobertLucian Jun 9, 2021
ddd5b3e
WIP cluster configure
RobertLucian Jun 10, 2021
9de468a
Bug fixes
RobertLucian Jun 10, 2021
bc3f33a
Further fixes on the cloudformation stacks
RobertLucian Jun 10, 2021
914233b
Address layout on install.sh
RobertLucian Jun 10, 2021
3fcc248
Add priority field to the node group config
RobertLucian Jun 10, 2021
e8b7bf6
Document the priority field in the docs
RobertLucian Jun 10, 2021
19e4be5
Make lint
RobertLucian Jun 10, 2021
cadadfa
Layout change for cluster configure cmd
RobertLucian Jun 11, 2021
20412af
Better reconciliation w/ cloudformation stacks
RobertLucian Jun 11, 2021
a38ec20
Fix number of SGs when cluster already exists
RobertLucian Jun 11, 2021
299fb39
Quota fixes
RobertLucian Jun 11, 2021
e2949c2
Further fixes
RobertLucian Jun 11, 2021
2b445b2
Improve cluster info cmd
RobertLucian Jun 11, 2021
1b599a2
Remove debugging comments
RobertLucian Jun 11, 2021
33c6d36
Nits
RobertLucian Jun 11, 2021
d7d394b
Remove the nodegroups first and then add the others
RobertLucian Jun 11, 2021
df02113
Merge branch 'master' into feature/add-or-remove-ngs
RobertLucian Jun 11, 2021
2c50eb4
Nits
RobertLucian Jun 11, 2021
d6fd7b8
Separate validate functions
RobertLucian Jun 14, 2021
5f385cc
Simplify get cluster state package
RobertLucian Jun 14, 2021
97e4f30
Address PR comments
RobertLucian Jun 14, 2021
4153321
Merge branch 'master' into feature/add-or-remove-ngs
RobertLucian Jun 14, 2021
1c42ae0
Add missing error print when stacks couldn't be retrieved
RobertLucian Jun 14, 2021
157fec7
Bolts and fixes
RobertLucian Jun 14, 2021
c562386
Address PR comments
RobertLucian Jun 15, 2021
fb53dda
Print cluster stacks when running cluster info cmd
RobertLucian Jun 15, 2021
eae2bd1
Refactor
RobertLucian Jun 15, 2021
262b4f5
Some refactoring
RobertLucian Jun 15, 2021
b0fc893
Merge branch 'master' into feature/add-or-remove-ngs
RobertLucian Jun 15, 2021
0068a7d
Fix to the number of required SGs on configure
RobertLucian Jun 15, 2021
3bc9d13
Merge branch 'master' into feature/add-or-remove-ngs
RobertLucian Jun 16, 2021
91bf390
Address PR comments
RobertLucian Jun 16, 2021
d4b5bb9
Address merge conflicts from master
RobertLucian Jun 16, 2021
b58b7a6
Merge branch 'master' into feature/add-or-remove-ngs
RobertLucian Jun 17, 2021
4d5e378
Addressing PR comments and a fix
RobertLucian Jun 17, 2021
4d40100
Merge branch 'master' into feature/add-or-remove-ngs
RobertLucian Jun 17, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 92 additions & 156 deletions cli/cmd/cluster.go

Large diffs are not rendered by default.

23 changes: 16 additions & 7 deletions cli/cmd/errors.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"net/url"
"strings"

"github.com/cortexlabs/cortex/cli/types/flags"
"github.com/cortexlabs/cortex/pkg/consts"
"github.com/cortexlabs/cortex/pkg/lib/errors"
s "github.com/cortexlabs/cortex/pkg/lib/strings"
Expand Down Expand Up @@ -52,7 +53,7 @@ const (
ErrMissingAWSCredentials = "cli.missing_aws_credentials"
ErrCredentialsInClusterConfig = "cli.credentials_in_cluster_config"
ErrClusterUp = "cli.cluster_up"
ErrClusterScale = "cli.cluster_scale"
ErrClusterConfigure = "cli.cluster_configure"
ErrClusterDebug = "cli.cluster_debug"
ErrClusterRefresh = "cli.cluster_refresh"
ErrClusterDown = "cli.cluster_down"
Expand All @@ -61,7 +62,8 @@ const (
ErrMaxInstancesLowerThan = "cli.max_instances_lower_than"
ErrMinInstancesGreaterThanMaxInstances = "cli.min_instances_greater_than_max_instances"
ErrNodeGroupNotFound = "cli.nodegroup_not_found"
ErrJSONOutputNotSupportedWithFlag = "cli.json_output_not_supported_with_flag"
ErrMutuallyExclusiveFlags = "cli.mutually_exclusive_flags"
ErrOutputTypeNotSupportedWithFlag = "cli.output_type_not_supported_with_flag"
ErrClusterAccessConfigRequired = "cli.cluster_access_config_or_prompts_required"
ErrShellCompletionNotSupported = "cli.shell_completion_not_supported"
ErrNoTerminalWidth = "cli.no_terminal_width"
Expand Down Expand Up @@ -162,9 +164,9 @@ func ErrorClusterUp(out string) error {
})
}

func ErrorClusterScale(out string) error {
func ErrorClusterConfigure(out string) error {
return errors.WithStack(&errors.Error{
Kind: ErrClusterScale,
Kind: ErrClusterConfigure,
Message: out,
NoPrint: true,
})
Expand Down Expand Up @@ -229,10 +231,17 @@ func ErrorNodeGroupNotFound(scalingNodeGroupName, clusterName, clusterRegion str
})
}

func ErrorJSONOutputNotSupportedWithFlag(flag string) error {
func ErrorMutuallyExclusiveFlags(flagA, flagB string) error {
return errors.WithStack(&errors.Error{
Kind: ErrJSONOutputNotSupportedWithFlag,
Message: fmt.Sprintf("flag %s cannot be used when output type is set to json", flag),
Kind: ErrMutuallyExclusiveFlags,
Message: fmt.Sprintf("flags %s and %s cannot be used at the same time", flagA, flagB),
})
}

func ErrorOutputTypeNotSupportedWithFlag(flag string, outputType flags.OutputType) error {
return errors.WithStack(&errors.Error{
Kind: ErrOutputTypeNotSupportedWithFlag,
Message: fmt.Sprintf("flag %s cannot be used when output type is set to %s", flag, outputType),
})
}

Expand Down
83 changes: 78 additions & 5 deletions cli/cmd/lib_cluster_config.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,17 @@ import (
"github.com/cortexlabs/cortex/pkg/lib/aws"
cr "github.com/cortexlabs/cortex/pkg/lib/configreader"
"github.com/cortexlabs/cortex/pkg/lib/errors"
"github.com/cortexlabs/cortex/pkg/lib/exit"
"github.com/cortexlabs/cortex/pkg/lib/files"
"github.com/cortexlabs/cortex/pkg/lib/maps"
libmath "github.com/cortexlabs/cortex/pkg/lib/math"
"github.com/cortexlabs/cortex/pkg/lib/pointer"
"github.com/cortexlabs/cortex/pkg/lib/prompt"
"github.com/cortexlabs/cortex/pkg/lib/slices"
s "github.com/cortexlabs/cortex/pkg/lib/strings"
"github.com/cortexlabs/cortex/pkg/lib/table"
"github.com/cortexlabs/cortex/pkg/types/clusterconfig"
"github.com/cortexlabs/cortex/pkg/types/clusterstate"
)

var _cachedClusterConfigRegex = regexp.MustCompile(`^cluster_\S+\.yaml$`)
Expand Down Expand Up @@ -131,12 +134,9 @@ func getInstallClusterConfig(awsClient *aws.Client, clusterConfigFile string, di

promptIfNotAdmin(awsClient, disallowPrompt)

clusterConfig.Telemetry, err = readTelemetryConfig()
if err != nil {
return nil, err
}
clusterConfig.Telemetry = isTelemetryEnabled()

err = clusterConfig.Validate(awsClient)
err = clusterConfig.ValidateOnInstall(awsClient)
if err != nil {
err = errors.Append(err, fmt.Sprintf("\n\ncluster configuration schema can be found at https://docs.cortex.dev/v/%s/", consts.CortexVersionMinor))
return nil, errors.Wrap(err, clusterConfigFile)
Expand All @@ -147,6 +147,45 @@ func getInstallClusterConfig(awsClient *aws.Client, clusterConfigFile string, di
return clusterConfig, nil
}

func getConfigureClusterConfig(awsClient *aws.Client, stacks clusterstate.ClusterStacks, cachedClusterConfig clusterconfig.Config, newClusterConfigFile string, disallowPrompt bool) (*clusterconfig.Config, clusterconfig.ConfigureChanges, error) {
newUserClusterConfig := &clusterconfig.Config{}

err := readUserClusterConfigFile(newUserClusterConfig, newClusterConfigFile)
if err != nil {
return nil, clusterconfig.ConfigureChanges{}, err
}

promptIfNotAdmin(awsClient, disallowPrompt)

newUserClusterConfig.Telemetry = isTelemetryEnabled()
cachedClusterConfig.Telemetry = newUserClusterConfig.Telemetry

configureChanges, err := newUserClusterConfig.ValidateOnConfigure(awsClient, cachedClusterConfig)
if err != nil {
err = errors.Append(err, fmt.Sprintf("\n\ncluster configuration schema can be found at https://docs.cortex.dev/v/%s/", consts.CortexVersionMinor))
return nil, clusterconfig.ConfigureChanges{}, errors.Wrap(err, newClusterConfigFile)
}

// intersect with the stale eks node groups
eksNodeGroupsToRemove := []string{}
staleEKSNgs, staleEKSNgAvailabilities := stacks.GetStaleNodeGroupNames(*newUserClusterConfig)
RobertLucian marked this conversation as resolved.
Show resolved Hide resolved
for i := range staleEKSNgs {
if slices.HasString(configureChanges.NodeGroupsToRemove, staleEKSNgs[i]) {
eksNodeGroupsToRemove = append(eksNodeGroupsToRemove, clusterstate.GetStackName(newUserClusterConfig.ClusterName, staleEKSNgAvailabilities[i], staleEKSNgs[i]))
}
}
configureChanges.NodeGroupsToRemove = eksNodeGroupsToRemove
RobertLucian marked this conversation as resolved.
Show resolved Hide resolved

if !configureChanges.HasChanges() {
fmt.Println("no change required")
exit.Ok()
RobertLucian marked this conversation as resolved.
Show resolved Hide resolved
}

confirmConfigureClusterConfig(configureChanges, cachedClusterConfig, *newUserClusterConfig, _flagClusterDisallowPrompt)

return newUserClusterConfig, configureChanges, nil
}

func confirmInstallClusterConfig(clusterConfig *clusterconfig.Config, awsClient *aws.Client, disallowPrompt bool) {
eksPrice := aws.EKSPrices[clusterConfig.Region]
operatorInstancePrice := aws.InstanceMetadatas[clusterConfig.Region]["t3.medium"].Price
Expand Down Expand Up @@ -264,3 +303,37 @@ func confirmInstallClusterConfig(clusterConfig *clusterconfig.Config, awsClient
prompt.YesOrExit("would you like to continue?", "", exitMessage)
}
}

func confirmConfigureClusterConfig(configureChanges clusterconfig.ConfigureChanges, oldCc, newCc clusterconfig.Config, disallowPrompt bool) {
fmt.Printf("your %s cluster in region %s will receive the following changes\n\n", newCc.ClusterName, newCc.Region)
RobertLucian marked this conversation as resolved.
Show resolved Hide resolved
if len(configureChanges.NodeGroupsToAdd) > 0 {
fmt.Printf("○ %d %s (%s) will be added\n", len(configureChanges.NodeGroupsToAdd), s.PluralS("nodegroup", len(configureChanges.NodeGroupsToAdd)), s.StrsAnd(configureChanges.NodeGroupsToAdd))
}
if len(configureChanges.NodeGroupsToRemove) > 0 {
fmt.Printf("○ %d %s (%s) will be removed\n", len(configureChanges.NodeGroupsToRemove), s.PluralS("nodegroup", len(configureChanges.NodeGroupsToRemove)), s.StrsAnd(configureChanges.NodeGroupsToRemove))
}
if len(configureChanges.NodeGroupsToScale) > 0 {
fmt.Printf("○ %d %s will be scaled\n", len(configureChanges.NodeGroupsToScale), s.PluralS("nodegroup", len(configureChanges.NodeGroupsToScale)))
for _, ngName := range configureChanges.NodeGroupsToScale {
var output string
ngOld := oldCc.GetNodeGroupByName(ngName)
ngScaled := newCc.GetNodeGroupByName(ngName)
if ngOld.MinInstances != ngScaled.MinInstances && ngOld.MaxInstances != ngScaled.MaxInstances {
output = fmt.Sprintf("nodegroup %s will update its %s from %d to %d and update its %s from %d to %d", ngName, clusterconfig.MinInstancesKey, ngOld.MinInstances, ngScaled.MinInstances, clusterconfig.MaxInstancesKey, ngOld.MaxInstances, ngScaled.MaxInstances)
}
if ngOld.MinInstances == ngScaled.MinInstances && ngOld.MaxInstances != ngScaled.MaxInstances {
output = fmt.Sprintf("nodegroup %s will update its %s from %d to %d", ngName, clusterconfig.MaxInstancesKey, ngOld.MaxInstances, ngScaled.MaxInstances)
}
if ngOld.MinInstances != ngScaled.MinInstances && ngOld.MaxInstances == ngScaled.MaxInstances {
output = fmt.Sprintf("nodegroup %s will update its %s from %d to %d", ngName, clusterconfig.MinInstancesKey, ngOld.MinInstances, ngScaled.MinInstances)
}
fmt.Println(s.Indent(fmt.Sprintf("○ %s", output), " "))
RobertLucian marked this conversation as resolved.
Show resolved Hide resolved
}
}
fmt.Println()

if !disallowPrompt {
exitMessage := fmt.Sprintf("cluster configuration can be modified via the cluster config file; see https://docs.cortex.dev/v/%s/ for more information", consts.CortexVersionMinor)
prompt.YesOrExit(fmt.Sprintf("your cluster named \"%s\" in %s will be updated according to the configuration above, are you sure you want to continue?", newCc.ClusterName, newCc.Region), "", exitMessage)
}
}
2 changes: 2 additions & 0 deletions cli/types/flags/output_type.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,14 @@ const (
UnknownOutputType OutputType = iota
PrettyOutputType
JSONOutputType
YAMLOutputType
)

var _outputTypes = []string{
"unknown",
"pretty",
"json",
"yaml",
}

func OutputTypeFromString(s string) OutputType {
Expand Down
2 changes: 1 addition & 1 deletion dev/generate_cli_md.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ commands=(
"delete"
"cluster up"
"cluster info"
"cluster scale"
"cluster configure"
"cluster down"
"cluster export"
"env configure"
Expand Down
30 changes: 14 additions & 16 deletions docs/clients/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Flags:
-e, --env string environment to use
-f, --force override the in-progress api update
-y, --yes skip prompts
-o, --output string output format: one of pretty|json (default "pretty")
-o, --output string output format: one of pretty|json|yaml (default "pretty")
-h, --help help for deploy
```

Expand All @@ -27,7 +27,7 @@ Usage:
Flags:
-e, --env string environment to use
-w, --watch re-run the command every 2 seconds
-o, --output string output format: one of pretty|json (default "pretty")
-o, --output string output format: one of pretty|json|yaml (default "pretty")
deliahu marked this conversation as resolved.
Show resolved Hide resolved
-v, --verbose show additional information (only applies to pretty output format)
-h, --help help for get
```
Expand Down Expand Up @@ -58,7 +58,7 @@ Usage:
Flags:
-e, --env string environment to use
-f, --force override the in-progress api update
-o, --output string output format: one of pretty|json (default "pretty")
-o, --output string output format: one of pretty|json|yaml (default "pretty")
-h, --help help for refresh
```

Expand All @@ -74,7 +74,7 @@ Flags:
-e, --env string environment to use
-f, --force delete the api without confirmation
-c, --keep-cache keep cached data for the api
-o, --output string output format: one of pretty|json (default "pretty")
-o, --output string output format: one of pretty|json|yaml (default "pretty")
-h, --help help for delete
```

Expand Down Expand Up @@ -104,29 +104,27 @@ Flags:
-c, --config string path to a cluster configuration file
-n, --name string name of the cluster
-r, --region string aws region of the cluster
-o, --output string output format: one of pretty|json (default "pretty")
-o, --output string output format: one of pretty|json|yaml (default "pretty")
-e, --configure-env string name of environment to configure
-d, --debug save the current cluster state to a file
--print-config print the cluster config
-y, --yes skip prompts
-h, --help help for info
```

## cluster scale
## cluster configure

```text
update the min/max instances for a nodegroup
update the cluster's configuration

Usage:
cortex cluster scale [flags]
cortex cluster configure CLUSTER_CONFIG_FILE [flags]

Flags:
-n, --name string name of the cluster
-r, --region string aws region of the cluster
--node-group string name of the node group to scale
--min-instances int minimum number of instances
--max-instances int maximum number of instances
-y, --yes skip prompts
-h, --help help for scale
-n, --name string name of the cluster
RobertLucian marked this conversation as resolved.
Show resolved Hide resolved
-r, --region string aws region of the cluster
-y, --yes skip prompts
-h, --help help for configure
```

## cluster down
Expand Down Expand Up @@ -183,7 +181,7 @@ Usage:
cortex env list [flags]

Flags:
-o, --output string output format: one of pretty|json (default "pretty")
-o, --output string output format: one of pretty|json|yaml (default "pretty")
-h, --help help for list
```

Expand Down
16 changes: 11 additions & 5 deletions docs/clusters/instances/multi.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,9 @@ Cortex can be configured to provision different instance types to improve worklo

## Best practices

**Node groups with lower indices have higher priority.**

1. Spot node groups should be listed before on-demand node groups.
1. CPU node groups should be listed before GPU/Inferentia node groups.
1. Node groups with small instance types should be listed before node groups with large instance types.
1. Spot node groups should have a higher priority than on-demand node groups.
1. CPU node groups should have higher priorities than GPU/Inferentia node groups.
1. Node groups with small instance types should have higher priorities than node groups with large instance types.

## Examples

Expand All @@ -22,6 +20,7 @@ node_groups:
instance_type: m5.large
min_instances: 0
max_instances: 5
priority: 100
spot: true
spot_config:
instance_distribution: [m5a.large, m5d.large, m5n.large, m5ad.large, m5dn.large, m4.large, t3.large, t3a.large, t2.large]
Expand All @@ -41,6 +40,7 @@ node_groups:
instance_type: m5.large
min_instances: 0
max_instances: 5
priority: 100
- name: gpu
instance_type: g4dn.xlarge
min_instances: 0
Expand All @@ -61,17 +61,20 @@ node_groups:
instance_type: m5.large
min_instances: 0
max_instances: 5
priority: 100
spot: true
spot_config:
instance_distribution: [m5a.large, m5d.large, m5n.large, m5ad.large, m5dn.large, m4.large, t3.large, t3a.large, t2.large]
- name: cpu-on-demand
instance_type: m5.large
min_instances: 0
max_instances: 5
priority: 50
- name: gpu-spot
instance_type: g4dn.xlarge
min_instances: 0
max_instances: 5
priority: 20
spot: true
- name: gpu-on-demand
instance_type: g4dn.xlarge
Expand All @@ -89,16 +92,19 @@ node_groups:
instance_type: t3.medium
min_instances: 0
max_instances: 5
priority: 100
spot: true
- name: cpu-2
instance_type: m5.2xlarge
min_instances: 0
max_instances: 5
priority: 70
spot: true
- name: cpu-3
instance_type: m5.8xlarge
min_instances: 0
max_instances: 5
priority: 30
spot: true
- name: cpu-4
instance_type: m5.24xlarge
Expand Down
3 changes: 2 additions & 1 deletion docs/clusters/management/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,13 @@ region: us-east-1
# list of availability zones for your region
availability_zones: # default: 3 random availability zones in your region, e.g. [us-east-1a, us-east-1b, us-east-1c]

# list of cluster node groups; the smaller index, the higher the priority of the node group
# list of cluster node groups;
node_groups:
- name: ng-cpu # name of the node group
instance_type: m5.large # instance type
min_instances: 1 # minimum number of instances
max_instances: 5 # maximum number of instances
priority: 1 # priority of the node group; the higher the value, the higher the priority
RobertLucian marked this conversation as resolved.
Show resolved Hide resolved
instance_volume_size: 50 # disk storage size per instance (GB)
instance_volume_type: gp3 # instance volume type [gp2 | gp3 | io1 | st1 | sc1]
# instance_volume_iops: 3000 # instance volume iops (only applicable to io1/gp3)
Expand Down
Loading