Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray] Handle memory pressure more gracefully #6458

Open
petrock99 opened this issue Dec 12, 2019 · 3 comments
Open

[ray] Handle memory pressure more gracefully #6458

petrock99 opened this issue Dec 12, 2019 · 3 comments
Labels
enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity

Comments

@petrock99
Copy link

What is the problem?

I'm using Ray 0.7.6 + Python 3.7.3 with 45 linux machines on my university network as a cluster. All students in my department have access to these machines and use them frequently. If one of these students does something to consume more than 95% of the available memory on any of the nodes (e.g. open a 13GB file) Ray will throw up it's hands and quit. This is quite frustrating, especially if I'm 4 hours into an 8 hour run. It would be nice if Ray handled memory pressure a little more gracefully. Especially if that memory pressure is being caused by another users that I have no control over. Couple ideas:

  • Take the user of the processes consuming high amounts of memory into account before tearing down. Why punish me for another users actions?
  • If memory pressure does occur on a node, instead of tearing down, stop sending it actors to play with until memory goes down to acceptable levels. If memory never gets down to 'normal' on that node, so be it. There are other worker machines available to take up the slack.

Reproduction

  • Run a long running python script with a bunch of actors on a cluster of machines.
  • Log into one of the worker machines in your cluster as another user
  • As the other user, do something to put the worker machine into memory pressure (e.g. use stress-ng).

Expected:
Ray process should continue with out issues, all be it a little slower
Actual:
Ray process tears itself down and quits.

@petrock99 petrock99 added the enhancement Request for new feature and/or capability label Dec 12, 2019
@petrock99
Copy link
Author

petrock99 commented Dec 12, 2019

Hit this issue again today. Someone decided to open a 30GB file on one of the worker nodes and ray killed itself. raytracer is not affiliated with ray of course.

28779 29.92GiB ./raytracer HolidayScene/1002.txt HolidayScene/1002.ppm

Here is the output from ray:

Traceback (most recent call last):
File “path/to/my/python/script", line 471, in
nnet.train(train_loader, train_dataset.max_len(), n_epochs, learning_rate)
File "_path/to/my/python/script", line 257, in train
running_state_dict,n_batches, progress, bar, verbose)
File "_path/to/my/python/script", line 185, in wait_for_training_data
training_data = ray.get(training_id)
File "path/to/my/home/.local/lib/python3.7/site-packages/ray/worker.py", line 2121, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray_RemoteBatchNetHelper:train() (pid=28718, host=jaguar)
File "path/to/my/home/.local/lib/python3.7/site-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node jaguar is used (31.24 / 31.27 GB). The top 10 memory consumers are:

PID MEM COMMAND
28779 29.92GiB ./raytracer HolidayScene/1002.txt HolidayScene/1002.ppm
28718 0.06GiB ray_RemoteBatchNetHelper:train()
28719 0.06GiB ray_worker
21854 0.04GiB ristretto /s/bach/c/under/jhgrins/cs410/P5_test/scene1.gif
11835 0.02GiB /usr/libexec/sssd/sssd_kcm --uid 0 --gid 0 --logger=files
13314 0.01GiB /opt/google/chrome-beta/chrome --type=renderer --disable-webrtc-apm-in-audio-service --field-trial-h
14493 0.01GiB path/to/my/home/.local/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_
14494 0.01GiB /usr/local/anaconda-2019.07/bin/python -u path/to/my/home/.local/lib/python3.7/site-package
12924 0.01GiB /opt/google/chrome-beta/chrome
12967 0.0GiB /opt/google/chrome-beta/chrome --type=utility --field-trial-handle=5488568897404811202,1067882724045

In addition, up to 0.03 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the object_store_memory parameter when starting Ray, and the max Redis size with redis_max_memory. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.
2019-12-12 16:03:37,066 INFO node_provider.py:41 -- ClusterState: Loaded cluster state: [list of nodes]
2019-12-12 16:03:37,067 INFO commands.py:110 -- teardown_cluster: Shutting down 13 nodes...

@ericl
Copy link
Contributor

ericl commented Dec 14, 2019

One thing that can help here is to, if you know for sure your workload doesn't use a lot of memory, set RAY_DEBUG_DISABLE_MEMORY_MONITOR=1. This will disable memory checking entirely. However, note that this can lead to very confusing error messages if you do run into real memory contention, since Ray cannot always return good error messages if it runs out of memory for real.

@virtualluke
Copy link
Contributor

Setting RAY_DEBUG_DISABLE_MEMORY_MONITOR=1 is a little like turning up the radio in the car when you have some engine noise you want to forget about.

I think this would be a very nice feature to make ray more robust. Long way of me adding a "thumbs up" to the issue.

@ericl ericl added the P3 Issue moderate in impact or severity label Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

3 participants