[Core] Can't determine cause of actor task retry from logs #49287
Description
What happened + What you expected to happen
I ran a Ray Data pipeline. Ray retried some actor tasks, and I tried to discover the cause by looking at the worker logs, but the logs didn't contain the root cause of the exception (OSError
in the repro below):
[2024-12-16 09:57:49,336 I 3869 1159262] core_worker.cc:594: Will resubmit task after a 0ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.data._internal.execution.operators.actor_pool_map_operator, class_name=_MapWorker, function_name=submit, function_hash=}, task_id=11596eb123ba278c979404eb7d775de13c4eb97c01000000, task_name=MapBatches(Embed), job_id=01000000, num_args=6, num_returns=1, max_retries=-1, depth=1, attempt_number=2, actor_task_spec={actor_id=979404eb7d775de13c4eb97c01000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff01000000, actor_counter=5, retry_exceptions=1}
[2024-12-16 09:57:50,340 W 3869 1159262] task_manager.cc:1107: Task attempt 11596eb123ba278c979404eb7d775de13c4eb97c01000000 failed with error TASK_EXECUTION_EXCEPTION Fail immediately? 0, status OK, error info error_message: "User exception:\n call\n yield from self._batch_fn(input, ctx)\n File "/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 364, in transform_fn\n res = fn(batch)\n File "/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 268, in fn\n _handle_debugger_exception(e)\n File "/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 292, in _handle_debugger_exception\n raise UserCodeException() from e\nray.exceptions.UserCodeException"
error_type: TASK_EXECUTION_EXCEPTION
The reason this occurs is that the traceback is long, and Ray truncates the traceback to 500 characters:
Lines 1076 to 1081 in 4dff29a
This is why the Ray Data traceback is long. We wrap the original error:
ray/python/ray/data/_internal/planner/plan_udf_map_op.py
Lines 284 to 292 in 4dff29a
Versions / Dependencies
Reproduction script
import ray
ray.data.DataContext.get_current().actor_task_retry_on_errors = [OSError]
class Embed:
def __init__(self):
self._num_attempts = 0
def __call__(self, batch):
if self._num_attempts < 3:
self._num_attempts += 1
raise OSError("Simulated error")
return batch
ray.data.range(1).map_batches(Embed, concurrency=1).materialize()
Issue Severity
Medium: It is a significant difficulty but I can work around it.