Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS Autoscaler] Spread across availability zones #2177

Closed
AdamGleave opened this issue Jun 1, 2018 · 0 comments
Closed

[AWS Autoscaler] Spread across availability zones #2177

AdamGleave opened this issue Jun 1, 2018 · 0 comments

Comments

@AdamGleave
Copy link
Contributor

Feature enhancement: be able to specify multiple availability zones to launch worker nodes into. For spot instances, this would reduce the risk of all of your workers being terminated, and could also enable greater peak capacity. This is particularly valuable in regions such as us-east-1 that have seven availability regions.

It's not clear what the best way to do this is. Right now, we specify SubnetId (with it being filled in by aws/config.py:_configure_subnet) to peg it to a particular availability zone. Spot fleet requests let you specify multiple SubnetId's, but RunInstance (which we currently use) does not. A reasonable policy might be to launch workers round-robin between availability zones. (This has the disadvantage of not favoring regions with lower prices, but now that Amazon makes spot prices vary only gradually over time, this doesn't seem like a big loss.)

Note there is a downside in terms of increased latency to having nodes in different availability zones, so there are probably better allocation strategies than round-robin.

AdamGleave added a commit to AdamGleave/ray that referenced this issue Jun 14, 2018
AdamGleave added a commit to AdamGleave/ray that referenced this issue Jun 14, 2018
ericl pushed a commit that referenced this issue Jun 20, 2018
* AWS: support multiple availability zones (fix #2177)

* Bugfix: [] rather than ()

* Test config

* Test config tweaks

* Remove test config

* Formatting fixes

* Update YAML config
royf added a commit to royf/ray that referenced this issue Jun 22, 2018
* 'master' of https://github.com/ray-project/ray: (157 commits)
  Fix build failure while using make -j1. Issue 2257 (ray-project#2279)
  Cast locator with index type (ray-project#2274)
  fixing zero length partitions (ray-project#2237)
  Make actor handles work in Python mode. (ray-project#2283)
  [xray] Add error table and push error messages to driver through node manager. (ray-project#2256)
  addressing comments (ray-project#2210)
  Re-enable some actor tests. (ray-project#2276)
  Experimental: enable automatic GCS flushing with configurable policy. (ray-project#2266)
  [xray] Sets good object manager defaults. (ray-project#2255)
  [tune] Update Trainable doc to expose interface (ray-project#2272)
  [rllib] Add a simple REST policy server and client example (ray-project#2232)
  [asv] Pushing to s3 (ray-project#2246)
  [rllib] Remove need to pass around registry (ray-project#2250)
  Support multiple availability zones in AWS (fix ray-project#2177) (ray-project#2254)
  [rllib] Add squash_to_range model option (ray-project#2239)
  Mitigate randomly building failure: adding gen_local_scheduler_fbs to raylet lib. (ray-project#2271)
  [rllib] Refactor Multi-GPU for PPO (ray-project#1646)
  [rllib] Envs for vectorized execution, async execution, and policy serving (ray-project#2170)
  [Dataframe] Change pandas and ray.dataframe imports (ray-project#1942)
  [Java] Replace binary rewrite with Remote Lambda Cache (SerdeLambda) (ray-project#2245)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant