Skip to content

Memory-aware task scheduling to avoid OOMs under memory pressure #20495

Open
@ericl

Description

Overview

Currently, the Ray scheduler only schedules based on CPUs by default for tasks (e.g., num_cpus=1). The user can also request memory (e.g., memory=1e9), however in most applications it is quite difficult to predict the heap memory usage of a task. In practice, this means that Ray users often see OOMs due to memory over-subscription, and resort to hacks like increasing the number of CPUs allocated to tasks.

Ideally, Ray would manage this automatically: when tasks consume too much heap memory, the scheduler should pushback on the scheduling of new tasks and preempt eligible tasks to reduce memory pressure.

Proposed design

Allow Ray to preempt and kill tasks that are using too much heap memory. We can do this by scanning the memory usage of tasks e.g., every 100ms, and preempting tasks if we are nearing a memory limit threshold (e.g., 80%).
Furthermore, the scheduler can stop scheduling new tasks should we near the threshold.

Compatibility: Preempting certain kinds of tasks can be unexpected, and breaks backwards compatibility. This can be an "opt-in" feature initially for tasks. E.g., "@ray.remote(memory="auto")` in order to preserve backwards compatibility. Libraries like multiprocessing and Datasets can enable this by default for their map tasks. In the future, we can try to enable it by default for tasks that are safe to preempt (e.g., those that are not launching child tasks, and have retries enabled).

Acceptance criteria: As a user, I can run Ray tasks that use large amounts of memory without needing to tune/tweak Ray resource settings to avoid OOM crashes.

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalenhancementRequest for new feature and/or capabilitysize:large

    Type

    No type

    Projects

    • Status

      In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions