Skip to content

Add support for multi-instance-type clusters to AWS/GCP providers #1951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Mar 15, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
6fd42d6
WIP on MICT
RobertLucian Mar 9, 2021
c5585a3
Construct install pricing confirmation
RobertLucian Mar 9, 2021
d1d7158
Validate node memory capacity (might have to revert this)
RobertLucian Mar 10, 2021
72d46cd
Function to print cluster info
RobertLucian Mar 10, 2021
b6755d5
Fix memory capacity function
RobertLucian Mar 10, 2021
0c1e982
Modify the EKS generator accordingly
RobertLucian Mar 10, 2021
7e4d582
Changes to the max length of ng names + autoscaler
RobertLucian Mar 10, 2021
13022c6
Fix naming of the ng in EKS generator
RobertLucian Mar 10, 2021
c0bf615
Fix cloud formation discovery
RobertLucian Mar 10, 2021
c457fa4
Fix default values for ng name
RobertLucian Mar 10, 2021
36824f9
Default value for on demand backup is false
RobertLucian Mar 11, 2021
d80d208
Small UX/code style changes to AWS
RobertLucian Mar 11, 2021
cd7f930
Add correct tolerations for APIs
RobertLucian Mar 11, 2021
aecf9a7
MITC for GCP
RobertLucian Mar 11, 2021
204f4ce
Lint
RobertLucian Mar 11, 2021
7e32e18
Remove on-demand backup instances
RobertLucian Mar 11, 2021
f30e1d8
Remove on-demand backup instance p2
RobertLucian Mar 11, 2021
0f94de1
Merge branch 'master' into feature/multi-instance-type-clusters
RobertLucian Mar 11, 2021
0340f18
Fixes for the CA
RobertLucian Mar 11, 2021
797a1e0
Add node affinities to deployments
RobertLucian Mar 11, 2021
3c49b03
Improve scheduling-resulted score for pods
RobertLucian Mar 11, 2021
c2246bd
Address cortex(-gcp) cluster info commands
RobertLucian Mar 12, 2021
7762852
Install nvidia/inf daemonsets regardless
RobertLucian Mar 12, 2021
8fa0b6e
Cluster configure
RobertLucian Mar 12, 2021
4953f15
Fixes with cluster configure command
RobertLucian Mar 12, 2021
39fd1a9
Fully address cluster configure command
RobertLucian Mar 12, 2021
310c24d
Fix cluster info debug command
RobertLucian Mar 12, 2021
be91510
Allow update rollback complete state
RobertLucian Mar 12, 2021
c250008
Update .circleci test cluster config
RobertLucian Mar 12, 2021
84980e3
Improve code readability in refresh.sh
RobertLucian Mar 12, 2021
6764d00
Merge branch 'master' into feature/multi-instance-type-clusters
RobertLucian Mar 12, 2021
afeca89
Fix Inferentia in EKS generator
RobertLucian Mar 12, 2021
8af132b
Min memory update logic
vishalbollu Mar 12, 2021
d5722c8
Address PR comments
RobertLucian Mar 13, 2021
7f84e19
Fix memory capacity allocations
RobertLucian Mar 13, 2021
36399c2
Merge branch 'master' into feature/multi-instance-type-clusters
RobertLucian Mar 13, 2021
5496ff4
Fixes and nits
RobertLucian Mar 13, 2021
5ae7271
Merge branch 'master' into feature/multi-instance-type-clusters
RobertLucian Mar 15, 2021
cedd983
Fix GeneratePreferredNodeAffinities function
RobertLucian Mar 15, 2021
dcab83b
Add test
RobertLucian Mar 15, 2021
cbfa90c
Add node selectors to gpu/inf ds
RobertLucian Mar 15, 2021
c49b3bf
DCGM node selectors + nvidia selector + e2e expectations
RobertLucian Mar 15, 2021
7559600
Apply all tolerations to deployments
RobertLucian Mar 15, 2021
2c9fdfe
Remove unnecessary parameter in Go
RobertLucian Mar 15, 2021
4f30aa9
Merge branch 'master' into feature/multi-instance-type-clusters
RobertLucian Mar 15, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 33 additions & 9 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -146,15 +146,27 @@ jobs:
echo 'export AWS_SECRET_ACCESS_KEY=${NIGHTLY_AWS_SECRET_ACCESS_KEY}' >> $BASH_ENV
- run:
name: Generate Cluster Config
# using a variety of node groups to test the multi-instance-type cluster functionality
command: |
cat \<< EOF > ./cluster.yaml
cluster_name: cortex
provider: aws
cluster_name: cortex
region: us-east-1
instance_type: g4dn.xlarge
min_instances: 1
max_instances: 2
bucket: cortex-dev-nightly
node_groups:
- name: spot
instance_type: t3.medium
min_instances: 0
max_instances: 1
spot: true
- name: cpu
instance_type: c5.xlarge
min_instances: 1
max_instances: 2
- name: gpu
instance_type: g4dn.xlarge
min_instances: 1
max_instances: 2
EOF
- run-e2e-tests:
provider: aws
Expand All @@ -174,16 +186,28 @@ jobs:
echo 'export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/google_service_account.json' >> $BASH_ENV
- run:
name: Generate Cluster Config
# using a variety of node pools to test the multi-instance-type cluster functionality
command: |
cat \<< EOF > ./cluster.yaml
provider: gcp
cluster_name: cortex
project: cortexlabs-dev
zone: us-east1-c
provider: gcp
instance_type: n1-standard-2
accelerator_type: nvidia-tesla-t4
min_instances: 1
max_instances: 2
node_pools:
- name: preemptible
instance_type: n1-standard-2
min_instances: 0
max_instances: 1
preemptible: true
- name: cpu
instance_type: n1-standard-2
min_instances: 1
max_instances: 2
- name: gpu
instance_type: n1-standard-2
accelerator_type: nvidia-tesla-t4
min_instances: 1
max_instances: 2
EOF
- run-e2e-tests:
provider: gcp
Expand Down
90 changes: 55 additions & 35 deletions cli/cmd/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ var _clusterConfigureCmd = &cobra.Command{
exit.Error(err)
}

accessConfig, err := getClusterAccessConfigWithCache()
accessConfig, err := getNewClusterAccessConfig(clusterConfigFile)
if err != nil {
exit.Error(err)
}
Expand All @@ -317,7 +317,7 @@ var _clusterConfigureCmd = &cobra.Command{
exit.Error(err)
}

err = clusterstate.AssertClusterStatus(accessConfig.ClusterName, accessConfig.Region, clusterState.Status, clusterstate.StatusCreateComplete)
err = clusterstate.AssertClusterStatus(accessConfig.ClusterName, accessConfig.Region, clusterState.Status, clusterstate.StatusCreateComplete, clusterstate.StatusUpdateComplete, clusterstate.StatusUpdateRollbackComplete)
if err != nil {
exit.Error(err)
}
Expand Down Expand Up @@ -527,7 +527,7 @@ var _clusterExportCmd = &cobra.Command{
exit.Error(err)
}

err = clusterstate.AssertClusterStatus(accessConfig.ClusterName, accessConfig.Region, clusterState.Status, clusterstate.StatusCreateComplete)
err = clusterstate.AssertClusterStatus(accessConfig.ClusterName, accessConfig.Region, clusterState.Status, clusterstate.StatusCreateComplete, clusterstate.StatusUpdateComplete, clusterstate.StatusUpdateRollbackComplete)
if err != nil {
exit.Error(err)
}
Expand Down Expand Up @@ -668,7 +668,7 @@ func printInfoClusterState(awsClient *aws.Client, accessConfig *clusterconfig.Ac
fmt.Println()
}

err = clusterstate.AssertClusterStatus(accessConfig.ClusterName, accessConfig.Region, clusterState.Status, clusterstate.StatusCreateComplete)
err = clusterstate.AssertClusterStatus(accessConfig.ClusterName, accessConfig.Region, clusterState.Status, clusterstate.StatusCreateComplete, clusterstate.StatusUpdateComplete, clusterstate.StatusUpdateRollbackComplete)
if err != nil {
return err
}
Expand All @@ -679,6 +679,12 @@ func printInfoClusterState(awsClient *aws.Client, accessConfig *clusterconfig.Ac
func printInfoOperatorResponse(clusterConfig clusterconfig.Config, operatorEndpoint string) error {
fmt.Print("fetching cluster status ...\n\n")

yamlBytes, err := yaml.Marshal(clusterConfig)
if err != nil {
return err
}
yamlString := string(yamlBytes)

operatorConfig := cluster.OperatorConfig{
Telemetry: isTelemetryEnabled(),
ClientID: clientID(),
Expand All @@ -688,42 +694,67 @@ func printInfoOperatorResponse(clusterConfig clusterconfig.Config, operatorEndpo

infoResponse, err := cluster.Info(operatorConfig)
if err != nil {
fmt.Println(clusterConfig.UserStr())
fmt.Println(yamlString)
return err
}
infoResponse.ClusterConfig.Config = clusterConfig

printInfoClusterConfig(infoResponse)
fmt.Println(console.Bold("metadata:"))
fmt.Println(fmt.Sprintf("aws access key id: %s", infoResponse.MaskedAWSAccessKeyID))
fmt.Println(fmt.Sprintf("%s: %s", clusterconfig.APIVersionUserKey, infoResponse.ClusterConfig.APIVersion))

fmt.Println()
fmt.Println(console.Bold("cluster config:"))
fmt.Print(yamlString)

printInfoPricing(infoResponse, clusterConfig)
printInfoNodes(infoResponse)

return nil
}

func printInfoClusterConfig(infoResponse *schema.InfoResponse) {
var items table.KeyValuePairs
items.Add("aws access key id", infoResponse.MaskedAWSAccessKeyID)
items.AddAll(infoResponse.ClusterConfig.UserTable())
items.Print()
}

func printInfoPricing(infoResponse *schema.InfoResponse, clusterConfig clusterconfig.Config) {
numAPIInstances := len(infoResponse.NodeInfos)

var totalAPIInstancePrice float64
for _, nodeInfo := range infoResponse.NodeInfos {
totalAPIInstancePrice += nodeInfo.Price
}

eksPrice := aws.EKSPrices[clusterConfig.Region]
operatorInstancePrice := aws.InstanceMetadatas[clusterConfig.Region]["t3.medium"].Price
operatorEBSPrice := aws.EBSMetadatas[clusterConfig.Region]["gp2"].PriceGB * 20 / 30 / 24
metricsEBSPrice := aws.EBSMetadatas[clusterConfig.Region]["gp2"].PriceGB * 40 / 30 / 24
nlbPrice := aws.NLBMetadatas[clusterConfig.Region].Price
natUnitPrice := aws.NATMetadatas[clusterConfig.Region].Price
apiEBSPrice := aws.EBSMetadatas[clusterConfig.Region][clusterConfig.InstanceVolumeType.String()].PriceGB * float64(clusterConfig.InstanceVolumeSize) / 30 / 24
if clusterConfig.InstanceVolumeType.String() == "io1" && clusterConfig.InstanceVolumeIOPS != nil {
apiEBSPrice += aws.EBSMetadatas[clusterConfig.Region][clusterConfig.InstanceVolumeType.String()].PriceIOPS * float64(*clusterConfig.InstanceVolumeIOPS) / 30 / 24

headers := []table.Header{
{Title: "aws resource"},
{Title: "cost per hour"},
}

var rows [][]interface{}
rows = append(rows, []interface{}{"1 eks cluster", s.DollarsMaxPrecision(eksPrice)})

var totalNodeGroupsPrice float64
for _, ng := range clusterConfig.NodeGroups {
var ngNamePrefix string
if ng.Spot {
ngNamePrefix = "cx-ws-"
} else {
ngNamePrefix = "cx-wd-"
}
nodesInfo := infoResponse.GetNodesWithNodeGroupName(ngNamePrefix + ng.Name)
numInstances := len(nodesInfo)

ebsPrice := aws.EBSMetadatas[clusterConfig.Region][ng.InstanceVolumeType.String()].PriceGB * float64(ng.InstanceVolumeSize) / 30 / 24
if ng.InstanceVolumeType.String() == "io1" && ng.InstanceVolumeIOPS != nil {
ebsPrice += aws.EBSMetadatas[clusterConfig.Region][ng.InstanceVolumeType.String()].PriceIOPS * float64(*ng.InstanceVolumeIOPS) / 30 / 24
}
totalEBSPrice := ebsPrice * float64(numInstances)

totalInstancePrice := float64(0)
for _, nodeInfo := range nodesInfo {
totalInstancePrice += nodeInfo.Price
}

rows = append(rows, []interface{}{fmt.Sprintf("nodegroup %s: %d (out of %d) %s for your apis", ng.Name, numInstances, ng.MaxInstances, s.PluralS("instance", numInstances)), s.DollarsAndTenthsOfCents(totalInstancePrice) + " total"})
rows = append(rows, []interface{}{fmt.Sprintf("nodegroup %s: %d (out of %d) %dgb ebs %s for your apis", ng.Name, numInstances, ng.MaxInstances, ng.InstanceVolumeSize, s.PluralS("volume", numInstances)), s.DollarsAndTenthsOfCents(totalEBSPrice) + " total"})

totalNodeGroupsPrice += totalEBSPrice + totalInstancePrice
}

var natTotalPrice float64
Expand All @@ -732,20 +763,9 @@ func printInfoPricing(infoResponse *schema.InfoResponse, clusterConfig clusterco
} else if clusterConfig.NATGateway == clusterconfig.HighlyAvailableNATGateway {
natTotalPrice = natUnitPrice * float64(len(clusterConfig.AvailabilityZones))
}

totalPrice := eksPrice + totalAPIInstancePrice + apiEBSPrice*float64(numAPIInstances) +
operatorInstancePrice*2 + operatorEBSPrice + metricsEBSPrice + nlbPrice*2 + natTotalPrice
totalPrice := eksPrice + totalNodeGroupsPrice + operatorInstancePrice*2 + operatorEBSPrice + metricsEBSPrice + nlbPrice*2 + natTotalPrice
fmt.Printf(console.Bold("\nyour cluster currently costs %s per hour\n\n"), s.DollarsAndCents(totalPrice))

headers := []table.Header{
{Title: "aws resource"},
{Title: "cost per hour"},
}

var rows [][]interface{}
rows = append(rows, []interface{}{"1 eks cluster", s.DollarsMaxPrecision(eksPrice)})
rows = append(rows, []interface{}{fmt.Sprintf("%d %s for your apis", numAPIInstances, s.PluralS("instance", numAPIInstances)), s.DollarsAndTenthsOfCents(totalAPIInstancePrice) + " total"})
rows = append(rows, []interface{}{fmt.Sprintf("%d %dgb ebs %s for your apis", numAPIInstances, clusterConfig.InstanceVolumeSize, s.PluralS("volume", numAPIInstances)), s.DollarsAndTenthsOfCents(apiEBSPrice*float64(numAPIInstances)) + " total"})
rows = append(rows, []interface{}{"2 t3.medium instances for cortex", s.DollarsMaxPrecision(operatorInstancePrice * 2)})
rows = append(rows, []interface{}{"1 20gb ebs volume for the operator", s.DollarsAndTenthsOfCents(operatorEBSPrice)})
rows = append(rows, []interface{}{"1 40gb ebs volume for prometheus", s.DollarsAndTenthsOfCents(metricsEBSPrice)})
Expand Down
130 changes: 71 additions & 59 deletions cli/cmd/cluster_gcp.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ import (
"github.com/cortexlabs/cortex/pkg/lib/telemetry"
"github.com/cortexlabs/cortex/pkg/types"
"github.com/cortexlabs/cortex/pkg/types/clusterconfig"
"github.com/cortexlabs/yaml"
"github.com/spf13/cobra"
containerpb "google.golang.org/genproto/googleapis/container/v1"
)
Expand Down Expand Up @@ -373,7 +374,18 @@ func printInfoOperatorResponseGCP(accessConfig *clusterconfig.GCPAccessConfig, o
return err
}

infoResponse.ClusterConfig.UserTable().Print()
yamlBytes, err := yaml.Marshal(infoResponse.ClusterConfig.GCPConfig)
if err != nil {
return err
}
yamlString := string(yamlBytes)

fmt.Println(console.Bold("metadata:"))
fmt.Println(fmt.Sprintf("%s: %s", clusterconfig.APIVersionUserKey, infoResponse.ClusterConfig.APIVersion))

fmt.Println()
fmt.Println(console.Bold("cluster config:"))
fmt.Print(yamlString)

return nil
}
Expand Down Expand Up @@ -448,25 +460,9 @@ func updateGCPCLIEnv(envName string, operatorEndpoint string, disallowPrompt boo
func createGKECluster(clusterConfig *clusterconfig.GCPConfig, gcpClient *gcp.Client) error {
fmt.Print("○ creating GKE cluster ")

nodeLabels := map[string]string{"workload": "true"}
var accelerators []*containerpb.AcceleratorConfig

if clusterConfig.AcceleratorType != nil {
accelerators = append(accelerators, &containerpb.AcceleratorConfig{
AcceleratorCount: *clusterConfig.AcceleratorsPerInstance,
AcceleratorType: *clusterConfig.AcceleratorType,
})
nodeLabels["nvidia.com/gpu"] = "present"
}

gkeClusterParent := fmt.Sprintf("projects/%s/locations/%s", clusterConfig.Project, clusterConfig.Zone)
gkeClusterName := fmt.Sprintf("%s/clusters/%s", gkeClusterParent, clusterConfig.ClusterName)

initialNodeCount := int64(1)
if clusterConfig.MinInstances > 0 {
initialNodeCount = clusterConfig.MinInstances
}

gkeClusterConfig := containerpb.Cluster{
Name: clusterConfig.ClusterName,
InitialClusterVersion: "1.18",
Expand All @@ -488,52 +484,68 @@ func createGKECluster(clusterConfig *clusterconfig.GCPConfig, gcpClient *gcp.Cli
Locations: []string{clusterConfig.Zone},
}

if clusterConfig.Preemptible {
gkeClusterConfig.NodePools = append(gkeClusterConfig.NodePools, &containerpb.NodePool{
Name: "ng-cortex-wk-preemp",
Config: &containerpb.NodeConfig{
MachineType: clusterConfig.InstanceType,
Labels: nodeLabels,
Taints: []*containerpb.NodeTaint{
{
Key: "workload",
Value: "true",
Effect: containerpb.NodeTaint_NO_SCHEDULE,
for _, nodePool := range clusterConfig.NodePools {
nodeLabels := map[string]string{"workload": "true"}
initialNodeCount := int64(1)
if nodePool.MinInstances > 0 {
initialNodeCount = nodePool.MinInstances
}

var accelerators []*containerpb.AcceleratorConfig
if nodePool.AcceleratorType != nil {
accelerators = append(accelerators, &containerpb.AcceleratorConfig{
AcceleratorCount: *nodePool.AcceleratorsPerInstance,
AcceleratorType: *nodePool.AcceleratorType,
})
nodeLabels["nvidia.com/gpu"] = "present"
}

if nodePool.Preemptible {
gkeClusterConfig.NodePools = append(gkeClusterConfig.NodePools, &containerpb.NodePool{
Name: "cx-ws-" + nodePool.Name,
Config: &containerpb.NodeConfig{
MachineType: nodePool.InstanceType,
Labels: nodeLabels,
Taints: []*containerpb.NodeTaint{
{
Key: "workload",
Value: "true",
Effect: containerpb.NodeTaint_NO_SCHEDULE,
},
},
},
Accelerators: accelerators,
OauthScopes: []string{
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
},
ServiceAccount: gcpClient.ClientEmail,
Preemptible: true,
},
InitialNodeCount: int32(initialNodeCount),
})
}
if clusterConfig.OnDemandBackup || !clusterConfig.Preemptible {
gkeClusterConfig.NodePools = append(gkeClusterConfig.NodePools, &containerpb.NodePool{
Name: "ng-cortex-wk-on-dmd",
Config: &containerpb.NodeConfig{
MachineType: clusterConfig.InstanceType,
Labels: nodeLabels,
Taints: []*containerpb.NodeTaint{
{
Key: "workload",
Value: "true",
Effect: containerpb.NodeTaint_NO_SCHEDULE,
Accelerators: accelerators,
OauthScopes: []string{
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
},
ServiceAccount: gcpClient.ClientEmail,
Preemptible: true,
},
Accelerators: accelerators,
OauthScopes: []string{
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
InitialNodeCount: int32(initialNodeCount),
})
} else {
gkeClusterConfig.NodePools = append(gkeClusterConfig.NodePools, &containerpb.NodePool{
Name: "cx-wd-" + nodePool.Name,
Config: &containerpb.NodeConfig{
MachineType: nodePool.InstanceType,
Labels: nodeLabels,
Taints: []*containerpb.NodeTaint{
{
Key: "workload",
Value: "true",
Effect: containerpb.NodeTaint_NO_SCHEDULE,
},
},
Accelerators: accelerators,
OauthScopes: []string{
"https://www.googleapis.com/auth/compute",
"https://www.googleapis.com/auth/devstorage.read_only",
},
ServiceAccount: gcpClient.ClientEmail,
},
ServiceAccount: gcpClient.ClientEmail,
},
InitialNodeCount: int32(initialNodeCount),
})
InitialNodeCount: int32(initialNodeCount),
})
}
}

if clusterConfig.Network != nil {
Expand Down
Loading