Skip to content

Commit

Permalink
[core] actors are not retried when OOM killed. Switch to system exit (r…
Browse files Browse the repository at this point in the history
…ay-project#28943)

Right now actors that are restartable don't restart when OOM killed. GCS does not retry actor if the failure reason is INTENDED_USER_EXIT or USER_ERROR.

This PR changes the OOM killer exit to be system error so that GCS can retry the actor.
Related issue number

ray-project#28920

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
  • Loading branch information
clarng authored Oct 5, 2022
1 parent 419dd8d commit 33e5d1e
Showing 3 changed files with 3 additions and 3 deletions.
2 changes: 1 addition & 1 deletion python/ray/tests/test_memory_pressure.py
Original file line number Diff line number Diff line change
@@ -158,7 +158,7 @@ def test_restartable_actor_killed_by_memory_monitor_with_actor_error(
timeout=10,
retry_interval_ms=100,
tag="MemoryManager.ActorEviction.Total",
value=1.0, # TODO(clarng): This should be 2. Look at why restart doesn't work
value=2.0,
)


2 changes: 1 addition & 1 deletion src/ray/common/ray_config_def.h
Original file line number Diff line number Diff line change
@@ -96,7 +96,7 @@ RAY_CONFIG(uint64_t, task_failure_entry_ttl_ms, 15 * 60 * 1000)
/// the retry counter of the task or actor is only used when it fails in other ways
/// that is not related to running out of memory. Note infinite retry (-1) is not
/// supported.
RAY_CONFIG(uint64_t, task_oom_retries, 3)
RAY_CONFIG(uint64_t, task_oom_retries, 15)

/// If the raylet fails to get agent info, we will retry after this interval.
RAY_CONFIG(uint64_t, raylet_get_agent_info_interval_ms, 1)
2 changes: 1 addition & 1 deletion src/ray/raylet/node_manager.cc
Original file line number Diff line number Diff line change
@@ -2985,7 +2985,7 @@ MemoryUsageRefreshCallback NodeManager::CreateMemoryUsageRefreshCallback() {
/// since we print the process memory in the message. Destroy should be called
/// as soon as possible to free up memory.
DestroyWorker(high_memory_eviction_target_,
rpc::WorkerExitType::USER_ERROR,
rpc::WorkerExitType::SYSTEM_ERROR,
worker_exit_message,
true /* force */);

0 comments on commit 33e5d1e

Please sign in to comment.