[autoscaler] AWS Autoscaler Improvements

This issue seeks to consolidate and track an initial set of proposed improvements to the AWS Autoscaler. All design docs and pull requests that support these proposals will be linked here as they become available. Also, any relevant conversations or concerns related to these proposals should be tracked here as well.

These proposals have naturally arisen in the course of using Ray internally at Amazon, and focus primarily on foundational improvements to AWS Autoscaler security, testability, usability, and integration with existing AWS cloud configuration capabilities. 

In particular, to make fully automated AWS Ray cluster deployments predictable and reliable at scale, we would like to ensure that the AWS Autoscaler is idempotent and deterministic during cluster config application, and guard this behavior against regression.

To achieve these goals, we would like to work with the open source community to deliver the following priority-ordered improvements, where each improvement is typically expected to be fulfilled by a single pull request:

1. **Honor Separate Head and Worker Node Subnet IDs**: Fix an existing issue that allows users to specify different head/worker subnets but then ignores the worker subnet. This also introduces foundational changes to allow separate EC2 security groups to exist for head and worker nodes, together with a basic stubbed unit test suite for the AWS Autoscaler. Subsequent improvements will build on top of this foundation.

2. **AWS Config Bootstrap Plugin Support**: To unblock a broad range of one-off user requirements, allow users to specify custom callbacks to be invoked during config bootstrapping. Minimally, users should be able to provide their own pre-bootstrap and post-bootstrap python scripts to be invoked before/after config bootstrapping. Maximally, users should be able to override either a subset or all of the default config bootstrapping workflow.

3. **EC2 Security Group Inbound Rule Whitelists**: Support common enterprise network security compliance use cases to ensure that inbound AWS EC2 instance connections are restricted to a known set of internal subnets, security groups, and/or prefix list IDs. These changes will also support idempotent, deterministic whitelist re-application to enable automatic correction of cluster configuration drift. We will consider the trade-offs related to adding this feature as either an example config bootstrap plugin, or as part of the default config bootstrapping workflow.

4. **Disable Autoscaler Config Cache**: Ensures that cluster config is applied deterministically, regardless of the state of the host applying configuration. We would like to minimally provide the option to disable the local config cache, and maximally remove the local config cache altogether.

5. **EC2 Security Group Tagging**: To better facilitate both automated and manual cluster security group discovery, security groups will have tags applied to track the cluster(s) that they are currently servicing. Minimally, these tags will have cluster drift detection and idempotent correction applied during config bootstrapping. Maximally, these tags will have ongoing, eventually consistent cluster drift detection and idempotent correction applied.

6. **CloudFormation Template Support**: To support a broader range of expected AWS account config preconditions (e.g. create an SQS queue for a Ray cluster to poll, grant the cluster permissions to read/delete messages from it, etc.), we would like to add the ability to deploy user-specified CloudFormation stacks prior to bootstrapping a Ray cluster. We will consider the trade-offs related to adding this feature as either an example config bootstrap plugin, or as part of the default config bootstrapping workflow.

7. **Add More Strongly Typed Head and Worker Node Config**: To resolve ambiguity about which boto3 models Ray can successfully honor during cluster creation, we would like to introduce stronger typing to quickly catch unsupported "head_node" and "worker_nodes" configuration options instead of ignoring them or launching misconfigured clusters.

8. **EC2 Launch Template Support**: To further simplify head and worker node configuration options, and to reduce the need to write common cluster post-configuration steps as setup commands, we would like to investigate tighter integration of EC2 launch templates with head and worker node config.

9. **EC2 Security Group Outbound Rule Whitelists**: These changes build on top of the inbound rule whitelists introduced in [2], and serve similar enterprise network security compliance use cases. We will consider the trade-offs related to adding this feature as either an example config bootstrap plugin, or as part of the default config bootstrapping workflow.

10. **AWS Autoscaler Test Suite Coverage and Maintainability Improvements**: Extend the AWS Autoscaler test suite to ensure that we have sufficient code coverage across all pre-existing AWS Autoscaler features not touched by the above changes. We will also ensure that we have created sufficient tests to both vet, and protect, idempotent and deterministic autoscaling cluster config application. We will refactor this test suite as required to improve long-term maintainability, and any additional issues discovered while improving test coverage will be appended to this issue for tracking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[autoscaler] AWS Autoscaler Improvements #8420

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[autoscaler] AWS Autoscaler Improvements #8420

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions