Description
What is the problem?
Invoking ray.remote with max_calls=1
and functions which require large amounts of memory causes workers to crash with OOM issues. Using max_calls=None
causes no issues.
Ray throws the following error:
020-02-06 16:40:08,621 ERROR worker.py:1003 -- Possible unhandled error from worker: ray::IDLE (pid=4978, ip=172.31.38.182)
File "python/ray/_raylet.pyx", line 631, in ray._raylet.execute_task
File "/home/ubuntu/project/env/lib/python3.6/site-packages/ray/memory_monitor.py", line 126, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-38-182 is used (3.56 / 3.62 GB). The top 10 memory consumers are:
PID MEM COMMAND
4977 0.33GiB ray::__main__.foo()
4125 0.13GiB ray::IDLE
4062 0.13GiB ray::IDLE
4335 0.13GiB ray::IDLE
4315 0.13GiB ray::IDLE
4229 0.13GiB ray::IDLE
4206 0.13GiB ray::IDLE
4175 0.13GiB ray::IDLE
4280 0.13GiB ray::IDLE
4436 0.13GiB ray::IDLE
In addition, up to 0.0 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.
It seems that all those ray::IDLE
processes are hanging around eating up memory. The worker machine sometimes recovers, but sometimes freezes due to lack of memory and eventually crashes.
Ray version and other system information (Python version, TensorFlow version, OS):
ray version 0.8.1
Python 3.6
Running ray cluster on AWS. Head node is r5.large (with 16G RAM), worker nodes are c5.large (with 4G RAM).
Ubuntu 18.04 LTS on head, worker and driver nodes.
Reproduction (REQUIRED)
Reproducing the issue involves setting up a ray cluster. The relevant setting from my .yaml are:
cluster_name: ray_test
min_workers: 2
max_workers: 2
head_node:
InstanceType: r5.large
ImageId: ami-0ac3869e129c60af6 # Ubuntu 18.04 LTS
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 100
worker_nodes:
InstanceType: c5.large
ImageId: ami-0ac3869e129c60af6 # Ubuntu 18.04 LTS
The python script looks as follows:
ray_test.py
import ray
import time
ray.init('auto')
def foo():
#This functions does nothing, but takes up around 400M of RAM
a = [1] * 30000000
jobs = []
for _ in range(4000):
jobs.append(ray.remote(max_calls=1)(foo).remote())
objids = jobs
while True:
objids_ready, objids_pending = ray.wait(objids, len(objids), 0)
time.sleep(0.5)
if 0==len(objids_pending):
break
The steps to repro are as follows:
- Launch the ray cluster with
ray up autoscaler/ray_test.yaml
- Launch local ray node with 0 resources with
ray start --address=$INTERNAL_HEAD_IP:6379 --redis-password='5241590000000000' --num-cpus=0 --num-gpus=0
. I don't think this is mandatory, but I like to have my driver not on the head. - Run
python ray_test.py
.
At this point, ray will start complaining, first with:
{CPU: 2.000000}, {memory: 2.197266 GiB}, {object_store_memory: 0.634766 GiB}. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
And eventually with the message I've pasted at the top of this post, the one talking about ray::IDLE
.
Interestingly, if I remove the max_calls=1
argument in ray_test.py
, I can no longer reproduce the issue.
I speculate the issue has to do with how ray spawns a new process for each function run, but there's no way for me to be sure.
Running with memory=500*1024*1024
delays the crash, but doesn't prevent it.