Check instance quotas during cluster creation (#1537)

deliahu · web-flow · commit e90e8f67ac54 · 2020-11-10T11:45:35.000-05:00
diff --git a/docs/cluster-management/ec2-instances.md b/docs/cluster-management/ec2-instances.md
@@ -6,7 +6,7 @@ There are a variety of instance types to choose from when creating a Cortex clus
 
 This is not a comprehensive guide so please refer to the [AWS's documentation](https://aws.amazon.com/ec2/instance-types/) for more information.
 
-Note: you may have limited (or no) access to certain instance types. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "on-demand" in the search box. You can request a limit by selecting an instance family and clicking "Request limit increase" in the upper right. Note that the limits are vCPU-based no matter the instance type (e.g. to run 4 `g4dn.xlarge` instances, you will need a 16 vCPU limit for G instances).
+Note: There is an instance limit associated with your AWS account for each instance family in each region, for on-demand and for spot instances. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "on-demand" or "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).
 
 | Instance Type                                           | CPU    | Memory    | GPU Memory               | Starting price per hour* | Notes                             |
 | :---                                                    | :---   | :---      | :---                     | :---                     | :---                              |
diff --git a/docs/cluster-management/spot-instances.md b/docs/cluster-management/spot-instances.md
@@ -37,7 +37,7 @@ Spot instances can be mixed with on-demand instances by configuring `on_demand_b
 
 Even if multiple instances are specified in your `instance_distribution` on-demand instances are mixed, there is still a possibility of running into scale up issues when attempting to spin up spot instances. Spot instance requests may not be fulfilled for several reasons. Spot instance pricing fluctuates, therefore the `max_price` may be lower than the current spot pricing rate. Another possibility could be that the availability zones of the cluster ran out of spot instances. `on_demand_backup` can be used mitigate the impact of unfulfilled spot requests by enabling the cluster to spin up on-demand instances if spot instance requests are not fulfilled within 5 minutes.
 
-There is a spot instance limit associated with your AWS account for each region. You can check your current limit [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:) (set the region in the upper right corner to your desired region, and search for "spot"). Note that the listed spot instance limit may misrepresent the actual number of spot instances you can allocate. Your actual spot instance limit depends on the instance type you have requested. In general, you can run a higher number of smaller instance types, or fewer large instance types. For example, even if the limit shows `20`, if you are requesting large instances like `p2.xlarge`, the actual limit may be lower due to the way AWS calculates this limit. If you are not getting the number of spot instances that you are expecting for your instance type, you can request a limit increase [here](https://console.aws.amazon.com/support/home#/case/create?issueType=service-limit-increase&limitType=service-code-ec2-spot-instances).
+There is a spot instance limit associated with your AWS account for each instance family in each region. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).
 
 ## Example spot configuration
 
diff --git a/docs/troubleshooting/stuck-updating.md b/docs/troubleshooting/stuck-updating.md
@@ -37,11 +37,7 @@ On the old UI:
 
 ![old ui](https://user-images.githubusercontent.com/808475/78153350-7e9eb480-742a-11ea-9221-1f6559db45fd.png)
 
-The most common reason AWS is unable to provision instances is that you have reached your instance limit:
-
-* **on-demand instances**: You may have limited access to your requested instance type. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "on-demand" in the search box. You can request a limit by selecting your instance family and clicking "Request limit increase" in the upper right. Note that the limits are vCPU-based no matter the instance type (e.g. to run 4 `g4dn.xlarge` instances, you will need a 16 vCPU limit for G instances).
-
-* **spot instances**: You may have limited access to spot instances in your region. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "spot" in the search box. Note that the listed spot instance limit may misrepresent the actual number of spot instances you can allocate. Your actual spot instance limit depends on the instance type you have requested. In general, you can run a higher number of smaller instance types, or fewer large instance types. For example, even if the limit shows `20`, if you are requesting large instances like `p2.xlarge`, the actual limit may be lower due to the way AWS calculates this limit. If you are not getting the number of spot instances that you are expecting for your instance type, you can request a limit increase [here](https://console.aws.amazon.com/support/home#/case/create?issueType=service-limit-increase&limitType=service-code-ec2-spot-instances).
+The most common reason AWS is unable to provision instances is that you have reached your instance limit. There is an instance limit associated with your AWS account for each instance family in each region, for on-demand and for spot instances. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "on-demand" or "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).
 
 If you are using spot instances and don't have `on_demand_backup` set to true, it is also possible that AWS has run out of spot instances for your requested instance type and region. You can enable `on_demand_backup` to allow Cortex to fall back to on-demand instances when spot instances are unavailable, or you can try adding additional alternative instance types in `instance_distribution`. See our [spot documentation](../cluster-management/spot-instances.md).
 
diff --git a/pkg/lib/aws/errors.go b/pkg/lib/aws/errors.go
@@ -33,7 +33,7 @@ const (
 	ErrAuth                         = "aws.auth"
 	ErrBucketInaccessible           = "aws.bucket_inaccessible"
 	ErrBucketNotFound               = "aws.bucket_not_found"
-	ErrInstanceTypeLimitIsZero      = "aws.instance_type_limit_is_zero"
+	ErrInsufficientInstanceQuota    = "aws.insufficient_instance_quota"
 	ErrNoValidSpotPrices            = "aws.no_valid_spot_prices"
 	ErrReadCredentials              = "aws.read_credentials"
 	ErrECRExtractingCredentials     = "aws.ecr_failed_credentials"
@@ -128,10 +128,11 @@ func ErrorBucketNotFound(bucket string) error {
 	})
 }
 
-func ErrorInstanceTypeLimitIsZero(instanceType string, region string) error {
+func ErrorInsufficientInstanceQuota(instanceType string, lifecycle string, region string, requiredInstances int64, vCPUPerInstance int64, vCPUQuota int64, quotaCode string) error {
+	url := fmt.Sprintf("https://%s.console.aws.amazon.com/servicequotas/home?region=%s#!/services/ec2/quotas/%s", region, region, quotaCode)
 	return errors.WithStack(&errors.Error{
-		Kind:    ErrInstanceTypeLimitIsZero,
-		Message: fmt.Sprintf(`you don't have access to %s instances in %s; please request access in the appropriate region (https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances). If you submitted a request and it was recently approved, please allow ~30 minutes for AWS to reflect this change."`, instanceType, region),
+		Kind:    ErrInsufficientInstanceQuota,
+		Message: fmt.Sprintf("your cluster may require up to %d %s %s instances, but your AWS quota for %s %s instances in %s is only %d vCPU (there are %d vCPUs per %s instance); please reduce the maximum number of %s %s instances your cluster may use (e.g. by changing max_instances and/or spot_config if applicable), or request a quota increase to at least %d vCPU here: %s (if your request was recently approved, please allow ~30 minutes for AWS to reflect this change)", requiredInstances, lifecycle, instanceType, lifecycle, instanceType, region, vCPUQuota, vCPUPerInstance, instanceType, lifecycle, instanceType, requiredInstances*vCPUPerInstance, url),
 	})
 }
 
diff --git a/pkg/lib/aws/servicequotas.go b/pkg/lib/aws/servicequotas.go
@@ -31,7 +31,11 @@ var _instanceCategoryRegex = regexp.MustCompile(`[a-zA-Z]+`)
 var _standardInstanceCategories = strset.New("a", "c", "d", "h", "i", "m", "r", "t", "z")
 var _knownInstanceCategories = strset.Union(_standardInstanceCategories, strset.New("p", "g", "inf", "x", "f"))
 
-func (c *Client) VerifyInstanceQuota(instanceType string) error {
+func (c *Client) VerifyInstanceQuota(instanceType string, requiredOnDemandInstances int64, requiredSpotInstances int64) error {
+	if requiredOnDemandInstances == 0 && requiredSpotInstances == 0 {
+		return nil
+	}
+
 	instanceCategory := _instanceCategoryRegex.FindString(instanceType)
 
 	// Allow the instance if we don't recognize the type
@@ -43,7 +47,10 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
 		instanceCategory = "standard"
 	}
 
-	var cpuLimit *int
+	var onDemandCPUQuota *int64
+	var onDemandQuotaCode string
+	var spotCPUQuota *int64
+	var spotQuotaCode string
 	err := c.ServiceQuotas().ListServiceQuotasPages(
 		&servicequotas.ListServiceQuotasInput{
 			ServiceCode: aws.String("ec2"),
@@ -58,12 +65,20 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
 				}
 
 				metricClass, ok := quota.UsageMetric.MetricDimensions["Class"]
-				if !ok || metricClass == nil || !strings.HasSuffix(*metricClass, "/OnDemand") {
+				if !ok || metricClass == nil || !(strings.HasSuffix(*metricClass, "/OnDemand") || strings.HasSuffix(*metricClass, "/Spot")) {
 					continue
 				}
 
+				// quota is specified in number of vCPU permitted per family
 				if strings.ToLower(*metricClass) == instanceCategory+"/ondemand" {
-					cpuLimit = pointer.Int(int(*quota.Value)) // quota is specified in number of vCPU permitted per family
+					onDemandCPUQuota = pointer.Int64(int64(*quota.Value))
+					onDemandQuotaCode = *quota.QuotaCode
+				} else if strings.ToLower(*metricClass) == instanceCategory+"/spot" {
+					spotCPUQuota = pointer.Int64(int64(*quota.Value))
+					spotQuotaCode = *quota.QuotaCode
+				}
+
+				if onDemandCPUQuota != nil && spotCPUQuota != nil {
 					return false
 				}
 			}
@@ -74,8 +89,16 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
 		return errors.WithStack(err)
 	}
 
-	if cpuLimit != nil && *cpuLimit == 0 {
-		return ErrorInstanceTypeLimitIsZero(instanceType, c.Region)
+	cpuPerInstance := InstanceMetadatas[c.Region][instanceType].CPU
+	requiredOnDemandCPU := requiredOnDemandInstances * cpuPerInstance.Value()
+	requiredSpotCPU := requiredSpotInstances * cpuPerInstance.Value()
+
+	if onDemandCPUQuota != nil && *onDemandCPUQuota < requiredOnDemandCPU {
+		return ErrorInsufficientInstanceQuota(instanceType, "on-demand", c.Region, requiredOnDemandInstances, cpuPerInstance.Value(), *onDemandCPUQuota, onDemandQuotaCode)
+	}
+
+	if spotCPUQuota != nil && *spotCPUQuota < requiredSpotCPU {
+		return ErrorInsufficientInstanceQuota(instanceType, "spot", c.Region, requiredSpotInstances, cpuPerInstance.Value(), *spotCPUQuota, spotQuotaCode)
 	}
 
 	return nil
diff --git a/pkg/types/clusterconfig/clusterconfig.go b/pkg/types/clusterconfig/clusterconfig.go
@@ -19,6 +19,7 @@ package clusterconfig
 import (
 	"fmt"
 	"io/ioutil"
+	"math"
 	"net"
 	"net/http"
 	"regexp"
@@ -586,7 +587,7 @@ func (cc *Config) Validate(awsClient *aws.Client) error {
 		cc.InstanceVolumeIOPS = pointer.Int64(libmath.MinInt64(cc.InstanceVolumeSize*50, 3000))
 	}
 
-	if err := awsClient.VerifyInstanceQuota(primaryInstanceType); err != nil {
+	if err := awsClient.VerifyInstanceQuota(primaryInstanceType, cc.MaxPossibleOnDemandInstances(), cc.MaxPossibleSpotInstances()); err != nil {
 		// Skip AWS errors, since some regions (e.g. eu-north-1) do not support this API
 		if _, ok := errors.CauseOrSelf(err).(awserr.Error); !ok {
 			return errors.Wrap(err, InstanceTypeKey)
@@ -1081,6 +1082,54 @@ func DefaultAccessConfig() (*AccessConfig, error) {
 	return accessConfig, nil
 }
 
+func (cc *Config) MaxPossibleOnDemandInstances() int64 {
+	if cc.MaxInstances == nil {
+		return 0 // unexpected
+	}
+
+	if cc.Spot == nil || *cc.Spot == false || cc.SpotConfig == nil || cc.SpotConfig.OnDemandBackup == nil || *cc.SpotConfig.OnDemandBackup == true {
+		return *cc.MaxInstances
+	}
+
+	onDemandBaseCap, onDemandPctAboveBaseCap := cc.SpotConfigOnDemandValues()
+
+	return onDemandBaseCap + int64(math.Ceil(float64(onDemandPctAboveBaseCap)/100*float64(*cc.MaxInstances-onDemandBaseCap)))
+}
+
+func (cc *Config) MaxPossibleSpotInstances() int64 {
+	if cc.MaxInstances == nil {
+		return 0 // unexpected
+	}
+
+	if cc.Spot == nil || *cc.Spot == false {
+		return 0
+	}
+
+	if cc.SpotConfig == nil {
+		return *cc.MaxInstances
+	}
+
+	onDemandBaseCap, onDemandPctAboveBaseCap := cc.SpotConfigOnDemandValues()
+
+	return *cc.MaxInstances - onDemandBaseCap - int64(math.Floor(float64(onDemandPctAboveBaseCap)/100*float64(*cc.MaxInstances-onDemandBaseCap)))
+}
+
+func (cc *Config) SpotConfigOnDemandValues() (int64, int64) {
+	// default OnDemandBaseCapacity is 0
+	var onDemandBaseCapacity int64 = 0
+	if cc.SpotConfig.OnDemandBaseCapacity != nil {
+		onDemandBaseCapacity = *cc.SpotConfig.OnDemandBaseCapacity
+	}
+
+	// default OnDemandPercentageAboveBaseCapacity is 0
+	var onDemandPercentageAboveBaseCapacity int64 = 0
+	if cc.SpotConfig.OnDemandPercentageAboveBaseCapacity != nil {
+		onDemandPercentageAboveBaseCapacity = *cc.SpotConfig.OnDemandPercentageAboveBaseCapacity
+	}
+
+	return onDemandBaseCapacity, onDemandPercentageAboveBaseCapacity
+}
+
 func (cc *InternalConfig) UserTable() table.KeyValuePairs {
 	var items table.KeyValuePairs