Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192
Open
Description
System information
- OS Platform and Distribution: Linux Ubuntu 16.04 (Deep Learning AMI)
- Ray installed from (source or binary): pip
- Ray version: 0.3.1
- Python version: 3.6.3
When using an autoscaling cluster setup on AWS (see Autoscaling GPU Cluster), running a remote function that requires GPU resources causes an indefinite hang. No worker instances are started.
Relevant bits of mycluster.yaml:
min_workers: 0
max_workers: 10
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
provider:
type: aws
region: us-west-2
availability_zone: us-west-2b
head_node:
InstanceType: m5.large
ImageId: ami-3b6bce43
worker_nodes:
InstanceType: p2.16xlarge
ImageId: ami-3b6bce43 # Amazon Deep Learning AMI (Ubuntu)
InstanceMarketOptions:
MarketType: spot
SpotOptions:
MaxPrice: 10.00
Code on the driver:
import ray
ray.init(redis_address=ray.services.get_node_ip_address() + ":6379")
@ray.remote(num_gpus=1)
def gpu_test(x):
return x**2
print(ray.get([gpu_test.remote(i) for i in range(10)]))
Remote functions with num_cpus=1
run fine (but in a small test like this, the driver just takes care of executing everything itself).