Skip to content

Check instance quotas during cluster creation #1537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 10, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/cluster-management/ec2-instances.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ There are a variety of instance types to choose from when creating a Cortex clus

This is not a comprehensive guide so please refer to the [AWS's documentation](https://aws.amazon.com/ec2/instance-types/) for more information.

Note: you may have limited (or no) access to certain instance types. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "on-demand" in the search box. You can request a limit by selecting an instance family and clicking "Request limit increase" in the upper right. Note that the limits are vCPU-based no matter the instance type (e.g. to run 4 `g4dn.xlarge` instances, you will need a 16 vCPU limit for G instances).
Note: There is an instance limit associated with your AWS account for each instance family in each region, for on-demand and for spot instances. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "on-demand" or "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).

| Instance Type | CPU | Memory | GPU Memory | Starting price per hour* | Notes |
| :--- | :--- | :--- | :--- | :--- | :--- |
Expand Down
2 changes: 1 addition & 1 deletion docs/cluster-management/spot-instances.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Spot instances can be mixed with on-demand instances by configuring `on_demand_b

Even if multiple instances are specified in your `instance_distribution` on-demand instances are mixed, there is still a possibility of running into scale up issues when attempting to spin up spot instances. Spot instance requests may not be fulfilled for several reasons. Spot instance pricing fluctuates, therefore the `max_price` may be lower than the current spot pricing rate. Another possibility could be that the availability zones of the cluster ran out of spot instances. `on_demand_backup` can be used mitigate the impact of unfulfilled spot requests by enabling the cluster to spin up on-demand instances if spot instance requests are not fulfilled within 5 minutes.

There is a spot instance limit associated with your AWS account for each region. You can check your current limit [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:) (set the region in the upper right corner to your desired region, and search for "spot"). Note that the listed spot instance limit may misrepresent the actual number of spot instances you can allocate. Your actual spot instance limit depends on the instance type you have requested. In general, you can run a higher number of smaller instance types, or fewer large instance types. For example, even if the limit shows `20`, if you are requesting large instances like `p2.xlarge`, the actual limit may be lower due to the way AWS calculates this limit. If you are not getting the number of spot instances that you are expecting for your instance type, you can request a limit increase [here](https://console.aws.amazon.com/support/home#/case/create?issueType=service-limit-increase&limitType=service-code-ec2-spot-instances).
There is a spot instance limit associated with your AWS account for each instance family in each region. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).

## Example spot configuration

Expand Down
6 changes: 1 addition & 5 deletions docs/troubleshooting/stuck-updating.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,7 @@ On the old UI:

![old ui](https://user-images.githubusercontent.com/808475/78153350-7e9eb480-742a-11ea-9221-1f6559db45fd.png)

The most common reason AWS is unable to provision instances is that you have reached your instance limit:

* **on-demand instances**: You may have limited access to your requested instance type. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "on-demand" in the search box. You can request a limit by selecting your instance family and clicking "Request limit increase" in the upper right. Note that the limits are vCPU-based no matter the instance type (e.g. to run 4 `g4dn.xlarge` instances, you will need a 16 vCPU limit for G instances).

* **spot instances**: You may have limited access to spot instances in your region. To check your limits, click [here](https://console.aws.amazon.com/ec2/v2/home?#Limits:), set your region in the upper right, and type "spot" in the search box. Note that the listed spot instance limit may misrepresent the actual number of spot instances you can allocate. Your actual spot instance limit depends on the instance type you have requested. In general, you can run a higher number of smaller instance types, or fewer large instance types. For example, even if the limit shows `20`, if you are requesting large instances like `p2.xlarge`, the actual limit may be lower due to the way AWS calculates this limit. If you are not getting the number of spot instances that you are expecting for your instance type, you can request a limit increase [here](https://console.aws.amazon.com/support/home#/case/create?issueType=service-limit-increase&limitType=service-code-ec2-spot-instances).
The most common reason AWS is unable to provision instances is that you have reached your instance limit. There is an instance limit associated with your AWS account for each instance family in each region, for on-demand and for spot instances. You can check your current limit and request an increase [here](https://console.aws.amazon.com/servicequotas/home?#!/services/ec2/quotas) (set the region in the upper right corner to your desired region, type "on-demand" or "spot" in the search bar, and click on the quota that matches your instance type). Note that the quota values indicate the number of vCPUs available, not the number of instances; different instances have a different numbers of vCPUs, which can be seen [here](https://aws.amazon.com/ec2/instance-types/).

If you are using spot instances and don't have `on_demand_backup` set to true, it is also possible that AWS has run out of spot instances for your requested instance type and region. You can enable `on_demand_backup` to allow Cortex to fall back to on-demand instances when spot instances are unavailable, or you can try adding additional alternative instance types in `instance_distribution`. See our [spot documentation](../cluster-management/spot-instances.md).

Expand Down
9 changes: 5 additions & 4 deletions pkg/lib/aws/errors.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ const (
ErrAuth = "aws.auth"
ErrBucketInaccessible = "aws.bucket_inaccessible"
ErrBucketNotFound = "aws.bucket_not_found"
ErrInstanceTypeLimitIsZero = "aws.instance_type_limit_is_zero"
ErrInsufficientInstanceQuota = "aws.insufficient_instance_quota"
ErrNoValidSpotPrices = "aws.no_valid_spot_prices"
ErrReadCredentials = "aws.read_credentials"
ErrECRExtractingCredentials = "aws.ecr_failed_credentials"
Expand Down Expand Up @@ -128,10 +128,11 @@ func ErrorBucketNotFound(bucket string) error {
})
}

func ErrorInstanceTypeLimitIsZero(instanceType string, region string) error {
func ErrorInsufficientInstanceQuota(instanceType string, lifecycle string, region string, requiredInstances int64, vCPUPerInstance int64, vCPUQuota int64, quotaCode string) error {
url := fmt.Sprintf("https://%s.console.aws.amazon.com/servicequotas/home?region=%s#!/services/ec2/quotas/%s", region, region, quotaCode)
return errors.WithStack(&errors.Error{
Kind: ErrInstanceTypeLimitIsZero,
Message: fmt.Sprintf(`you don't have access to %s instances in %s; please request access in the appropriate region (https://console.aws.amazon.com/support/cases#/create?issueType=service-limit-increase&limitType=ec2-instances). If you submitted a request and it was recently approved, please allow ~30 minutes for AWS to reflect this change."`, instanceType, region),
Kind: ErrInsufficientInstanceQuota,
Message: fmt.Sprintf("your cluster may require up to %d %s %s instances, but your AWS quota for %s %s instances in %s is only %d vCPU (there are %d vCPUs per %s instance); please reduce the maximum number of %s %s instances your cluster may use (e.g. by changing max_instances and/or spot_config if applicable), or request a quota increase to at least %d vCPU here: %s (if your request was recently approved, please allow ~30 minutes for AWS to reflect this change)", requiredInstances, lifecycle, instanceType, lifecycle, instanceType, region, vCPUQuota, vCPUPerInstance, instanceType, lifecycle, instanceType, requiredInstances*vCPUPerInstance, url),
})
}

Expand Down
35 changes: 29 additions & 6 deletions pkg/lib/aws/servicequotas.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,11 @@ var _instancePrefixRegex = regexp.MustCompile(`[a-zA-Z]+`)
var _standardInstancePrefixes = strset.New("a", "c", "d", "h", "i", "m", "r", "t", "z")
var _knownInstancePrefixes = strset.Union(_standardInstancePrefixes, strset.New("p", "g", "inf", "x", "f"))

func (c *Client) VerifyInstanceQuota(instanceType string) error {
func (c *Client) VerifyInstanceQuota(instanceType string, requiredOnDemandInstances int64, requiredSpotInstances int64) error {
if requiredOnDemandInstances == 0 && requiredSpotInstances == 0 {
return nil
}

instancePrefix := _instancePrefixRegex.FindString(instanceType)

// Allow the instance if we don't recognize the type
Expand All @@ -43,7 +47,10 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
instancePrefix = "standard"
}

var cpuLimit *int
var onDemandCPUQuota *int64
var onDemandQuotaCode string
var spotCPUQuota *int64
var spotQuotaCode string
err := c.ServiceQuotas().ListServiceQuotasPages(
&servicequotas.ListServiceQuotasInput{
ServiceCode: aws.String("ec2"),
Expand All @@ -58,12 +65,20 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
}

metricClass, ok := quota.UsageMetric.MetricDimensions["Class"]
if !ok || metricClass == nil || !strings.HasSuffix(*metricClass, "/OnDemand") {
if !ok || metricClass == nil || !(strings.HasSuffix(*metricClass, "/OnDemand") || strings.HasSuffix(*metricClass, "/Spot")) {
continue
}

// quota is specified in number of vCPU permitted per family
if strings.ToLower(*metricClass) == instancePrefix+"/ondemand" {
cpuLimit = pointer.Int(int(*quota.Value)) // quota is specified in number of vCPU permitted per family
onDemandCPUQuota = pointer.Int64(int64(*quota.Value))
onDemandQuotaCode = *quota.QuotaCode
} else if strings.ToLower(*metricClass) == instancePrefix+"/spot" {
spotCPUQuota = pointer.Int64(int64(*quota.Value))
spotQuotaCode = *quota.QuotaCode
}

if onDemandCPUQuota != nil && spotCPUQuota != nil {
return false
}
}
Expand All @@ -74,8 +89,16 @@ func (c *Client) VerifyInstanceQuota(instanceType string) error {
return errors.WithStack(err)
}

if cpuLimit != nil && *cpuLimit == 0 {
return ErrorInstanceTypeLimitIsZero(instanceType, c.Region)
cpuPerInstance := InstanceMetadatas[c.Region][instanceType].CPU
requiredOnDemandCPU := requiredOnDemandInstances * cpuPerInstance.Value()
requiredSpotCPU := requiredSpotInstances * cpuPerInstance.Value()

if onDemandCPUQuota != nil && *onDemandCPUQuota < requiredOnDemandCPU {
return ErrorInsufficientInstanceQuota(instanceType, "on-demand", c.Region, requiredOnDemandInstances, cpuPerInstance.Value(), *onDemandCPUQuota, onDemandQuotaCode)
}

if spotCPUQuota != nil && *spotCPUQuota < requiredSpotCPU {
return ErrorInsufficientInstanceQuota(instanceType, "spot", c.Region, requiredSpotInstances, cpuPerInstance.Value(), *spotCPUQuota, spotQuotaCode)
}

return nil
Expand Down
55 changes: 54 additions & 1 deletion pkg/types/clusterconfig/clusterconfig.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package clusterconfig
import (
"fmt"
"io/ioutil"
"math"
"net"
"net/http"
"regexp"
Expand Down Expand Up @@ -586,7 +587,7 @@ func (cc *Config) Validate(awsClient *aws.Client) error {
cc.InstanceVolumeIOPS = pointer.Int64(libmath.MinInt64(cc.InstanceVolumeSize*50, 3000))
}

if err := awsClient.VerifyInstanceQuota(primaryInstanceType); err != nil {
if err := awsClient.VerifyInstanceQuota(primaryInstanceType, cc.MaxPossibleOnDemandInstances(), cc.MaxPossibleSpotInstances()); err != nil {
// Skip AWS errors, since some regions (e.g. eu-north-1) do not support this API
if _, ok := errors.CauseOrSelf(err).(awserr.Error); !ok {
return errors.Wrap(err, InstanceTypeKey)
Expand Down Expand Up @@ -1078,6 +1079,58 @@ func DefaultAccessConfig() (*AccessConfig, error) {
return accessConfig, nil
}

func (cc *Config) MaxPossibleOnDemandInstances() int64 {
if cc.MaxInstances == nil {
return 0 // unexpected
}

if cc.Spot == nil || *cc.Spot == false || cc.SpotConfig == nil || cc.SpotConfig.OnDemandBackup == nil || *cc.SpotConfig.OnDemandBackup == true {
return *cc.MaxInstances
}

var count int64

// default OnDemandBaseCapacity is 0
if cc.SpotConfig.OnDemandBaseCapacity != nil {
count += *cc.SpotConfig.OnDemandBaseCapacity
}

// default OnDemandPercentageAboveBaseCapacity is 0
if cc.SpotConfig.OnDemandPercentageAboveBaseCapacity != nil {
count += int64(math.Ceil(float64(*cc.SpotConfig.OnDemandPercentageAboveBaseCapacity) / 100 * float64(*cc.MaxInstances-count)))
}

return libmath.MinInt64(count, *cc.MaxInstances) // take min just to be safe
}

func (cc *Config) MaxPossibleSpotInstances() int64 {
if cc.MaxInstances == nil {
return 0 // unexpected
}

if cc.Spot == nil || *cc.Spot == false {
return 0
}

count := *cc.MaxInstances

if cc.SpotConfig == nil {
return count
}

// default OnDemandBaseCapacity is 0
if cc.SpotConfig.OnDemandBaseCapacity != nil {
count -= *cc.SpotConfig.OnDemandBaseCapacity
}

// default OnDemandPercentageAboveBaseCapacity is 0
if cc.SpotConfig.OnDemandPercentageAboveBaseCapacity != nil {
count -= int64(math.Floor(float64(*cc.SpotConfig.OnDemandPercentageAboveBaseCapacity) / 100 * float64(count)))
}

return libmath.MaxInt64(count, 0) // take max just to be safe
}

func (cc *InternalConfig) UserTable() table.KeyValuePairs {
var items table.KeyValuePairs

Expand Down