Skip to content

Remote calls hang when requesting GPU resources from an autoscaling EC2 cluster #2192

Open
@npyoung

Description

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04 (Deep Learning AMI)
  • Ray installed from (source or binary): pip
  • Ray version: 0.3.1
  • Python version: 3.6.3

When using an autoscaling cluster setup on AWS (see Autoscaling GPU Cluster), running a remote function that requires GPU resources causes an indefinite hang. No worker instances are started.

Relevant bits of mycluster.yaml:

min_workers: 0
max_workers: 10
target_utilization_fraction: 0.8
idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2b

head_node:
  InstanceType: m5.large
  ImageId: ami-3b6bce43

worker_nodes:
  InstanceType: p2.16xlarge
    ImageId: ami-3b6bce43  # Amazon Deep Learning AMI (Ubuntu)
    InstanceMarketOptions:
      MarketType: spot
      SpotOptions:
        MaxPrice: 10.00

Code on the driver:

import ray
ray.init(redis_address=ray.services.get_node_ip_address() + ":6379")

@ray.remote(num_gpus=1)
def gpu_test(x):
    return x**2

print(ray.get([gpu_test.remote(i) for i in range(10)]))

Remote functions with num_cpus=1 run fine (but in a small test like this, the driver just takes care of executing everything itself).

Metadata

Assignees

No one assigned

    Labels

    P3Issue moderate in impact or severity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions