Skip to content

Commit e90e8f6

Browse files
authored
Check instance quotas during cluster creation (#1537)
1 parent ad02ec5 commit e90e8f6

File tree

6 files changed

+87
-18
lines changed

6 files changed

+87
-18
lines changed

docs/cluster-management/ec2-instances.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ There are a variety of instance types to choose from when creating a Cortex clus
66

77
This is not a comprehensive guide so please refer to the [AWS's documentation](https://aws.amazon.com/ec2/instance-types/) for more information.
88

9-
Note: you may have limited (or no) access to certain instance types. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "on-demand" in the search box. You can request a limit by selecting an instance family and clicking "Request limit increase" in the upper right. Note that the limits are vCPU-based no matter the instance type (e.g. to run 4 `g4dn.xlarge` instances, you will need a 16 vCPU limit for G instances).
9+
Note: There is an instance limit associated with your AWS account for each instance family in each region, for on-demand and for spot instances. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "on-demand" or "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).
1010

1111
| Instance Type | CPU | Memory | GPU Memory | Starting price per hour* | Notes |
1212
| :--- | :--- | :--- | :--- | :--- | :--- |

docs/cluster-management/spot-instances.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Spot instances can be mixed with on-demand instances by configuring `on_demand_b
3737

3838
Even if multiple instances are specified in your `instance_distribution` on-demand instances are mixed, there is still a possibility of running into scale up issues when attempting to spin up spot instances. Spot instance requests may not be fulfilled for several reasons. Spot instance pricing fluctuates, therefore the `max_price` may be lower than the current spot pricing rate. Another possibility could be that the availability zones of the cluster ran out of spot instances. `on_demand_backup` can be used mitigate the impact of unfulfilled spot requests by enabling the cluster to spin up on-demand instances if spot instance requests are not fulfilled within 5 minutes.
3939

40-
There is a spot instance limit associated with your AWS account for each region. You can check your current limit [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:) (set the region in the upper right corner to your desired region, and search for "spot"). Note that the listed spot instance limit may misrepresent the actual number of spot instances you can allocate. Your actual spot instance limit depends on the instance type you have requested. In general, you can run a higher number of smaller instance types, or fewer large instance types. For example, even if the limit shows `20`, if you are requesting large instances like `p2.xlarge`, the actual limit may be lower due to the way AWS calculates this limit. If you are not getting the number of spot instances that you are expecting for your instance type, you can request a limit increase [here](https://console.aws.amazon.com/support/home#/case/create?issueType=service-limit-increase&limitType=service-code-ec2-spot-instances).
40+
There is a spot instance limit associated with your AWS account for each instance family in each region. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).
4141

4242
## Example spot configuration
4343

docs/troubleshooting/stuck-updating.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,11 +37,7 @@ On the old UI:
3737

3838
![old ui](https://user-images.githubusercontent.com/808475/78153350-7e9eb480-742a-11ea-9221-1f6559db45fd.png)
3939

40-
The most common reason AWS is unable to provision instances is that you have reached your instance limit:
41-
42-
* **on-demand instances**: You may have limited access to your requested instance type. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "on-demand" in the search box. You can request a limit by selecting your instance family and clicking "Request limit increase" in the upper right. Note that the limits are vCPU-based no matter the instance type (e.g. to run 4 `g4dn.xlarge` instances, you will need a 16 vCPU limit for G instances).
43-
44-
* **spot instances**: You may have limited access to spot instances in your region. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "spot" in the search box. Note that the listed spot instance limit may misrepresent the actual number of spot instances you can allocate. Your actual spot instance limit depends on the instance type you have requested. In general, you can run a higher number of smaller instance types, or fewer large instance types. For example, even if the limit shows `20`, if you are requesting large instances like `p2.xlarge`, the actual limit may be lower due to the way AWS calculates this limit. If you are not getting the number of spot instances that you are expecting for your instance type, you can request a limit increase [here](https://console.aws.amazon.com/support/home#/case/create?issueType=service-limit-increase&limitType=service-code-ec2-spot-instances).
40+
The most common reason AWS is unable to provision instances is that you have reached your instance limit. There is an instance limit associated with your AWS account for each instance family in each region, for on-demand and for spot instances. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "on-demand" or "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).
4541

4642
If you are using spot instances and don't have `on_demand_backup` set to true, it is also possible that AWS has run out of spot instances for your requested instance type and region. You can enable `on_demand_backup` to allow Cortex to fall back to on-demand instances when spot instances are unavailable, or you can try adding additional alternative instance types in `instance_distribution`. See our [spot documentation](../cluster-management/spot-instances.md).
4743

pkg/lib/aws/errors.go

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ const (
3333
ErrAuth = "aws.auth"
3434
ErrBucketInaccessible = "aws.bucket_inaccessible"
3535
ErrBucketNotFound = "aws.bucket_not_found"
36-
ErrInstanceTypeLimitIsZero = "aws.instance_type_limit_is_zero"
36+
ErrInsufficientInstanceQuota = "aws.insufficient_instance_quota"
3737
ErrNoValidSpotPrices = "aws.no_valid_spot_prices"
3838
ErrReadCredentials = "aws.read_credentials"
3939
ErrECRExtractingCredentials = "aws.ecr_failed_credentials"
@@ -128,10 +128,11 @@ func ErrorBucketNotFound(bucket string) error {
128128
})
129129
}
130130

131-
func ErrorInstanceTypeLimitIsZero(instanceType string, region string) error {
131+
func ErrorInsufficientInstanceQuota(instanceType string, lifecycle string, region string, requiredInstances int64, vCPUPerInstance int64, vCPUQuota int64, quotaCode string) error {
132+
url := fmt.Sprintf("https://%s.console.aws.amazon.com/servicequotas/home?region=%s#!/services/ec2/quotas/%s", region, region, quotaCode)
132133
return errors.WithStack(&errors.Error{
133-
Kind: ErrInstanceTypeLimitIsZero,
134-
Message: fmt.Sprintf(`you don't have access to %s instances in %s; please request access in the appropriate region (https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances). If you submitted a request and it was recently approved, please allow ~30 minutes for AWS to reflect this change."`, instanceType, region),
134+
Kind: ErrInsufficientInstanceQuota,
135+
Message: fmt.Sprintf("your cluster may require up to %d %s %s instances, but your AWS quota for %s %s instances in %s is only %d vCPU (there are %d vCPUs per %s instance); please reduce the maximum number of %s %s instances your cluster may use (e.g. by changing max_instances and/or spot_config if applicable), or request a quota increase to at least %d vCPU here: %s (if your request was recently approved, please allow ~30 minutes for AWS to reflect this change)", requiredInstances, lifecycle, instanceType, lifecycle, instanceType, region, vCPUQuota, vCPUPerInstance, instanceType, lifecycle, instanceType, requiredInstances*vCPUPerInstance, url),
135136
})
136137
}
137138

pkg/lib/aws/servicequotas.go

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,11 @@ var _instanceCategoryRegex = regexp.MustCompile(`[a-zA-Z]+`)
3131
var _standardInstanceCategories = strset.New("a", "c", "d", "h", "i", "m", "r", "t", "z")
3232
var _knownInstanceCategories = strset.Union(_standardInstanceCategories, strset.New("p", "g", "inf", "x", "f"))
3333

34-
func (c *Client) VerifyInstanceQuota(instanceType string) error {
34+
func (c *Client) VerifyInstanceQuota(instanceType string, requiredOnDemandInstances int64, requiredSpotInstances int64) error {
35+
if requiredOnDemandInstances == 0 && requiredSpotInstances == 0 {
36+
return nil
37+
}
38+
3539
instanceCategory := _instanceCategoryRegex.FindString(instanceType)
3640

3741
// Allow the instance if we don't recognize the type
@@ -43,7 +47,10 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
4347
instanceCategory = "standard"
4448
}
4549

46-
var cpuLimit *int
50+
var onDemandCPUQuota *int64
51+
var onDemandQuotaCode string
52+
var spotCPUQuota *int64
53+
var spotQuotaCode string
4754
err := c.ServiceQuotas().ListServiceQuotasPages(
4855
&servicequotas.ListServiceQuotasInput{
4956
ServiceCode: aws.String("ec2"),
@@ -58,12 +65,20 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
5865
}
5966

6067
metricClass, ok := quota.UsageMetric.MetricDimensions["Class"]
61-
if !ok || metricClass == nil || !strings.HasSuffix(*metricClass, "/OnDemand") {
68+
if !ok || metricClass == nil || !(strings.HasSuffix(*metricClass, "/OnDemand") || strings.HasSuffix(*metricClass, "/Spot")) {
6269
continue
6370
}
6471

72+
// quota is specified in number of vCPU permitted per family
6573
if strings.ToLower(*metricClass) == instanceCategory+"/ondemand" {
66-
cpuLimit = pointer.Int(int(*quota.Value)) // quota is specified in number of vCPU permitted per family
74+
onDemandCPUQuota = pointer.Int64(int64(*quota.Value))
75+
onDemandQuotaCode = *quota.QuotaCode
76+
} else if strings.ToLower(*metricClass) == instanceCategory+"/spot" {
77+
spotCPUQuota = pointer.Int64(int64(*quota.Value))
78+
spotQuotaCode = *quota.QuotaCode
79+
}
80+
81+
if onDemandCPUQuota != nil && spotCPUQuota != nil {
6782
return false
6883
}
6984
}
@@ -74,8 +89,16 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
7489
return errors.WithStack(err)
7590
}
7691

77-
if cpuLimit != nil && *cpuLimit == 0 {
78-
return ErrorInstanceTypeLimitIsZero(instanceType, c.Region)
92+
cpuPerInstance := InstanceMetadatas[c.Region][instanceType].CPU
93+
requiredOnDemandCPU := requiredOnDemandInstances * cpuPerInstance.Value()
94+
requiredSpotCPU := requiredSpotInstances * cpuPerInstance.Value()
95+
96+
if onDemandCPUQuota != nil && *onDemandCPUQuota < requiredOnDemandCPU {
97+
return ErrorInsufficientInstanceQuota(instanceType, "on-demand", c.Region, requiredOnDemandInstances, cpuPerInstance.Value(), *onDemandCPUQuota, onDemandQuotaCode)
98+
}
99+
100+
if spotCPUQuota != nil && *spotCPUQuota < requiredSpotCPU {
101+
return ErrorInsufficientInstanceQuota(instanceType, "spot", c.Region, requiredSpotInstances, cpuPerInstance.Value(), *spotCPUQuota, spotQuotaCode)
79102
}
80103

81104
return nil

pkg/types/clusterconfig/clusterconfig.go

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ package clusterconfig
1919
import (
2020
"fmt"
2121
"io/ioutil"
22+
"math"
2223
"net"
2324
"net/http"
2425
"regexp"
@@ -586,7 +587,7 @@ func (cc *Config) Validate(awsClient *aws.Client) error {
586587
cc.InstanceVolumeIOPS = pointer.Int64(libmath.MinInt64(cc.InstanceVolumeSize*50, 3000))
587588
}
588589

589-
if err := awsClient.VerifyInstanceQuota(primaryInstanceType); err != nil {
590+
if err := awsClient.VerifyInstanceQuota(primaryInstanceType, cc.MaxPossibleOnDemandInstances(), cc.MaxPossibleSpotInstances()); err != nil {
590591
// Skip AWS errors, since some regions (e.g. eu-north-1) do not support this API
591592
if _, ok := errors.CauseOrSelf(err).(awserr.Error); !ok {
592593
return errors.Wrap(err, InstanceTypeKey)
@@ -1081,6 +1082,54 @@ func DefaultAccessConfig() (*AccessConfig, error) {
10811082
return accessConfig, nil
10821083
}
10831084

1085+
func (cc *Config) MaxPossibleOnDemandInstances() int64 {
1086+
if cc.MaxInstances == nil {
1087+
return 0 // unexpected
1088+
}
1089+
1090+
if cc.Spot == nil || *cc.Spot == false || cc.SpotConfig == nil || cc.SpotConfig.OnDemandBackup == nil || *cc.SpotConfig.OnDemandBackup == true {
1091+
return *cc.MaxInstances
1092+
}
1093+
1094+
onDemandBaseCap, onDemandPctAboveBaseCap := cc.SpotConfigOnDemandValues()
1095+
1096+
return onDemandBaseCap + int64(math.Ceil(float64(onDemandPctAboveBaseCap)/100*float64(*cc.MaxInstances-onDemandBaseCap)))
1097+
}
1098+
1099+
func (cc *Config) MaxPossibleSpotInstances() int64 {
1100+
if cc.MaxInstances == nil {
1101+
return 0 // unexpected
1102+
}
1103+
1104+
if cc.Spot == nil || *cc.Spot == false {
1105+
return 0
1106+
}
1107+
1108+
if cc.SpotConfig == nil {
1109+
return *cc.MaxInstances
1110+
}
1111+
1112+
onDemandBaseCap, onDemandPctAboveBaseCap := cc.SpotConfigOnDemandValues()
1113+
1114+
return *cc.MaxInstances - onDemandBaseCap - int64(math.Floor(float64(onDemandPctAboveBaseCap)/100*float64(*cc.MaxInstances-onDemandBaseCap)))
1115+
}
1116+
1117+
func (cc *Config) SpotConfigOnDemandValues() (int64, int64) {
1118+
// default OnDemandBaseCapacity is 0
1119+
var onDemandBaseCapacity int64 = 0
1120+
if cc.SpotConfig.OnDemandBaseCapacity != nil {
1121+
onDemandBaseCapacity = *cc.SpotConfig.OnDemandBaseCapacity
1122+
}
1123+
1124+
// default OnDemandPercentageAboveBaseCapacity is 0
1125+
var onDemandPercentageAboveBaseCapacity int64 = 0
1126+
if cc.SpotConfig.OnDemandPercentageAboveBaseCapacity != nil {
1127+
onDemandPercentageAboveBaseCapacity = *cc.SpotConfig.OnDemandPercentageAboveBaseCapacity
1128+
}
1129+
1130+
return onDemandBaseCapacity, onDemandPercentageAboveBaseCapacity
1131+
}
1132+
10841133
func (cc *InternalConfig) UserTable() table.KeyValuePairs {
10851134
var items table.KeyValuePairs
10861135

0 commit comments

Comments
 (0)