Skip to content

Using max_calls=1 with high memory usage functions causes OOM issues #7073

Closed
@matter-funds

Description

@matter-funds

What is the problem?

Invoking ray.remote with max_calls=1 and functions which require large amounts of memory causes workers to crash with OOM issues. Using max_calls=None causes no issues.
Ray throws the following error:

020-02-06 16:40:08,621  ERROR worker.py:1003 -- Possible unhandled error from worker: ray::IDLE (pid=4978, ip=172.31.38.182)
  File "python/ray/_raylet.pyx", line 631, in ray._raylet.execute_task
  File "/home/ubuntu/project/env/lib/python3.6/site-packages/ray/memory_monitor.py", line 126, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-38-182 is used (3.56 / 3.62 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
4977    0.33GiB ray::__main__.foo()
4125    0.13GiB ray::IDLE
4062    0.13GiB ray::IDLE
4335    0.13GiB ray::IDLE
4315    0.13GiB ray::IDLE
4229    0.13GiB ray::IDLE
4206    0.13GiB ray::IDLE
4175    0.13GiB ray::IDLE
4280    0.13GiB ray::IDLE
4436    0.13GiB ray::IDLE

In addition, up to 0.0 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.

It seems that all those ray::IDLE processes are hanging around eating up memory. The worker machine sometimes recovers, but sometimes freezes due to lack of memory and eventually crashes.

Ray version and other system information (Python version, TensorFlow version, OS):
ray version 0.8.1
Python 3.6
Running ray cluster on AWS. Head node is r5.large (with 16G RAM), worker nodes are c5.large (with 4G RAM).
Ubuntu 18.04 LTS on head, worker and driver nodes.

Reproduction (REQUIRED)

Reproducing the issue involves setting up a ray cluster. The relevant setting from my .yaml are:

cluster_name: ray_test
min_workers: 2
max_workers: 2
head_node:
    InstanceType: r5.large
    ImageId: ami-0ac3869e129c60af6 # Ubuntu 18.04 LTS
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100
worker_nodes:
    InstanceType: c5.large
    ImageId: ami-0ac3869e129c60af6 # Ubuntu 18.04 LTS

The python script looks as follows:
ray_test.py

import ray
import time

ray.init('auto')

def foo():
  #This functions does nothing, but takes up around 400M of RAM
  a = [1] * 30000000

jobs = []
for _ in range(4000):
  jobs.append(ray.remote(max_calls=1)(foo).remote())

objids = jobs
while True:
  objids_ready, objids_pending = ray.wait(objids, len(objids), 0)
  time.sleep(0.5)
  if 0==len(objids_pending):
    break

The steps to repro are as follows:

  1. Launch the ray cluster with ray up autoscaler/ray_test.yaml
  2. Launch local ray node with 0 resources with ray start --address=$INTERNAL_HEAD_IP:6379 --redis-password='5241590000000000' --num-cpus=0 --num-gpus=0. I don't think this is mandatory, but I like to have my driver not on the head.
  3. Run python ray_test.py.

At this point, ray will start complaining, first with:

{CPU: 2.000000}, {memory: 2.197266 GiB}, {object_store_memory: 0.634766 GiB}. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

And eventually with the message I've pasted at the top of this post, the one talking about ray::IDLE.

Interestingly, if I remove the max_calls=1 argument in ray_test.py, I can no longer reproduce the issue.

I speculate the issue has to do with how ray spawns a new process for each function run, but there's no way for me to be sure.

Running with memory=500*1024*1024 delays the crash, but doesn't prevent it.

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tneeds-repro-scriptIssue needs a runnable script to be reproducedstaleThe issue is stale. It will be closed within 7 days unless there are further conversation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions