Memory-aware task scheduling to avoid OOMs under memory pressure #20495
Description
Overview
Currently, the Ray scheduler only schedules based on CPUs by default for tasks (e.g., num_cpus=1). The user can also request memory (e.g., memory=1e9), however in most applications it is quite difficult to predict the heap memory usage of a task. In practice, this means that Ray users often see OOMs due to memory over-subscription, and resort to hacks like increasing the number of CPUs allocated to tasks.
Ideally, Ray would manage this automatically: when tasks consume too much heap memory, the scheduler should pushback on the scheduling of new tasks and preempt eligible tasks to reduce memory pressure.
Proposed design
Allow Ray to preempt and kill tasks that are using too much heap memory. We can do this by scanning the memory usage of tasks e.g., every 100ms, and preempting tasks if we are nearing a memory limit threshold (e.g., 80%).
Furthermore, the scheduler can stop scheduling new tasks should we near the threshold.
Compatibility: Preempting certain kinds of tasks can be unexpected, and breaks backwards compatibility. This can be an "opt-in" feature initially for tasks. E.g., "@ray.remote(memory="auto")` in order to preserve backwards compatibility. Libraries like multiprocessing and Datasets can enable this by default for their map tasks. In the future, we can try to enable it by default for tasks that are safe to preempt (e.g., those that are not launching child tasks, and have retries enabled).
Acceptance criteria: As a user, I can run Ray tasks that use large amounts of memory without needing to tune/tweak Ray resource settings to avoid OOM crashes.
Metadata
Assignees
Type
Projects
Status
In Progress