-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AWS Autoscaler] Spread across availability zones #2177
Comments
AdamGleave
added a commit
to AdamGleave/ray
that referenced
this issue
Jun 14, 2018
AdamGleave
added a commit
to AdamGleave/ray
that referenced
this issue
Jun 14, 2018
ericl
pushed a commit
that referenced
this issue
Jun 20, 2018
royf
added a commit
to royf/ray
that referenced
this issue
Jun 22, 2018
* 'master' of https://github.com/ray-project/ray: (157 commits) Fix build failure while using make -j1. Issue 2257 (ray-project#2279) Cast locator with index type (ray-project#2274) fixing zero length partitions (ray-project#2237) Make actor handles work in Python mode. (ray-project#2283) [xray] Add error table and push error messages to driver through node manager. (ray-project#2256) addressing comments (ray-project#2210) Re-enable some actor tests. (ray-project#2276) Experimental: enable automatic GCS flushing with configurable policy. (ray-project#2266) [xray] Sets good object manager defaults. (ray-project#2255) [tune] Update Trainable doc to expose interface (ray-project#2272) [rllib] Add a simple REST policy server and client example (ray-project#2232) [asv] Pushing to s3 (ray-project#2246) [rllib] Remove need to pass around registry (ray-project#2250) Support multiple availability zones in AWS (fix ray-project#2177) (ray-project#2254) [rllib] Add squash_to_range model option (ray-project#2239) Mitigate randomly building failure: adding gen_local_scheduler_fbs to raylet lib. (ray-project#2271) [rllib] Refactor Multi-GPU for PPO (ray-project#1646) [rllib] Envs for vectorized execution, async execution, and policy serving (ray-project#2170) [Dataframe] Change pandas and ray.dataframe imports (ray-project#1942) [Java] Replace binary rewrite with Remote Lambda Cache (SerdeLambda) (ray-project#2245) ...
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Feature enhancement: be able to specify multiple availability zones to launch worker nodes into. For spot instances, this would reduce the risk of all of your workers being terminated, and could also enable greater peak capacity. This is particularly valuable in regions such as us-east-1 that have seven availability regions.
It's not clear what the best way to do this is. Right now, we specify SubnetId (with it being filled in by
aws/config.py:_configure_subnet
) to peg it to a particular availability zone. Spot fleet requests let you specify multiple SubnetId's, but RunInstance (which we currently use) does not. A reasonable policy might be to launch workers round-robin between availability zones. (This has the disadvantage of not favoring regions with lower prices, but now that Amazon makes spot prices vary only gradually over time, this doesn't seem like a big loss.)Note there is a downside in terms of increased latency to having nodes in different availability zones, so there are probably better allocation strategies than round-robin.
The text was updated successfully, but these errors were encountered: