Using max_calls=1 with high memory usage functions causes OOM issues

### What is the problem?
Invoking ray.remote with `max_calls=1` and functions which require large amounts of memory causes workers to crash with OOM issues. Using `max_calls=None` causes no issues.
Ray throws the following error:
```
020-02-06 16:40:08,621  ERROR worker.py:1003 -- Possible unhandled error from worker: ray::IDLE (pid=4978, ip=172.31.38.182)
  File "python/ray/_raylet.pyx", line 631, in ray._raylet.execute_task
  File "/home/ubuntu/project/env/lib/python3.6/site-packages/ray/memory_monitor.py", line 126, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-38-182 is used (3.56 / 3.62 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
4977    0.33GiB ray::__main__.foo()
4125    0.13GiB ray::IDLE
4062    0.13GiB ray::IDLE
4335    0.13GiB ray::IDLE
4315    0.13GiB ray::IDLE
4229    0.13GiB ray::IDLE
4206    0.13GiB ray::IDLE
4175    0.13GiB ray::IDLE
4280    0.13GiB ray::IDLE
4436    0.13GiB ray::IDLE

In addition, up to 0.0 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.
```
It seems that all those `ray::IDLE` processes are hanging around eating up memory. The worker machine sometimes recovers, but sometimes freezes due to lack of memory and eventually crashes.

*Ray version and other system information (Python version, TensorFlow version, OS):*
ray version 0.8.1
Python 3.6
Running ray cluster on AWS. Head node is r5.large (with 16G RAM), worker nodes are c5.large (with 4G RAM).
Ubuntu 18.04 LTS on head, worker and driver nodes.

### Reproduction (REQUIRED)
Reproducing the issue involves setting up a ray cluster. The relevant setting from my .yaml are:
```
cluster_name: ray_test
min_workers: 2
max_workers: 2
head_node:
    InstanceType: r5.large
    ImageId: ami-0ac3869e129c60af6 # Ubuntu 18.04 LTS
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100
worker_nodes:
    InstanceType: c5.large
    ImageId: ami-0ac3869e129c60af6 # Ubuntu 18.04 LTS
```

The python script looks as follows:
`ray_test.py`
```
import ray
import time

ray.init('auto')

def foo():
  #This functions does nothing, but takes up around 400M of RAM
  a = [1] * 30000000

jobs = []
for _ in range(4000):
  jobs.append(ray.remote(max_calls=1)(foo).remote())

objids = jobs
while True:
  objids_ready, objids_pending = ray.wait(objids, len(objids), 0)
  time.sleep(0.5)
  if 0==len(objids_pending):
    break
```

The steps to repro are as follows:
1. Launch the ray cluster with `ray up autoscaler/ray_test.yaml`
2. Launch local ray node with 0 resources with `ray start --address=$INTERNAL_HEAD_IP:6379 --redis-password='5241590000000000' --num-cpus=0 --num-gpus=0`. I don't think this is mandatory, but I like to have my driver not on the head.
3. Run `python ray_test.py`.

At this point, ray will start complaining, first with:
```020-02-06 16:39:30,206  WARNING worker.py:1063 -- The actor or task with ID 70b16fe8b545fae1ffffffff0100 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:172.31.39.225: 1.000000},
{CPU: 2.000000}, {memory: 2.197266 GiB}, {object_store_memory: 0.634766 GiB}. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
```
And eventually with the message I've pasted at the top of this post, the one talking about `ray::IDLE`.

Interestingly, if I remove the `max_calls=1` argument in `ray_test.py`, I can no longer reproduce the issue.

I speculate the issue has to do with how ray spawns a new process for each function run, but there's no way for me to be sure.

Running with `memory=500*1024*1024` delays the crash, but doesn't prevent it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using max_calls=1 with high memory usage functions causes OOM issues #7073

What is the problem?

Reproduction (REQUIRED)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using max_calls=1 with high memory usage functions causes OOM issues #7073

Description

What is the problem?

Reproduction (REQUIRED)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions