-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ray] Handle memory pressure more gracefully #6458
Comments
Hit this issue again today. Someone decided to open a 30GB file on one of the worker nodes and ray killed itself. raytracer is not affiliated with ray of course. 28779 29.92GiB ./raytracer HolidayScene/1002.txt HolidayScene/1002.ppm Here is the output from ray: Traceback (most recent call last): PID MEM COMMAND In addition, up to 0.03 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the |
One thing that can help here is to, if you know for sure your workload doesn't use a lot of memory, set |
Setting RAY_DEBUG_DISABLE_MEMORY_MONITOR=1 is a little like turning up the radio in the car when you have some engine noise you want to forget about. I think this would be a very nice feature to make ray more robust. Long way of me adding a "thumbs up" to the issue. |
What is the problem?
I'm using Ray 0.7.6 + Python 3.7.3 with 45 linux machines on my university network as a cluster. All students in my department have access to these machines and use them frequently. If one of these students does something to consume more than 95% of the available memory on any of the nodes (e.g. open a 13GB file) Ray will throw up it's hands and quit. This is quite frustrating, especially if I'm 4 hours into an 8 hour run. It would be nice if Ray handled memory pressure a little more gracefully. Especially if that memory pressure is being caused by another users that I have no control over. Couple ideas:
Reproduction
Expected:
Ray process should continue with out issues, all be it a little slower
Actual:
Ray process tears itself down and quits.
The text was updated successfully, but these errors were encountered: