Skip to content

[Core] Can't determine cause of actor task retry from logs #49287

Open
@bveeramani

Description

What happened + What you expected to happen

I ran a Ray Data pipeline. Ray retried some actor tasks, and I tried to discover the cause by looking at the worker logs, but the logs didn't contain the root cause of the exception (OSError in the repro below):

[2024-12-16 09:57:49,336 I 3869 1159262] core_worker.cc:594: Will resubmit task after a 0ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.data._internal.execution.operators.actor_pool_map_operator, class_name=_MapWorker, function_name=submit, function_hash=}, task_id=11596eb123ba278c979404eb7d775de13c4eb97c01000000, task_name=MapBatches(Embed), job_id=01000000, num_args=6, num_returns=1, max_retries=-1, depth=1, attempt_number=2, actor_task_spec={actor_id=979404eb7d775de13c4eb97c01000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff01000000, actor_counter=5, retry_exceptions=1}
[2024-12-16 09:57:50,340 W 3869 1159262] task_manager.cc:1107: Task attempt 11596eb123ba278c979404eb7d775de13c4eb97c01000000 failed with error TASK_EXECUTION_EXCEPTION Fail immediately? 0, status OK, error info error_message: "User exception:\n call\n yield from self._batch_fn(input, ctx)\n File "/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 364, in transform_fn\n res = fn(batch)\n File "/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 268, in fn\n _handle_debugger_exception(e)\n File "/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py", line 292, in _handle_debugger_exception\n raise UserCodeException() from e\nray.exceptions.UserCodeException"
error_type: TASK_EXECUTION_EXCEPTION

The reason this occurs is that the traceback is long, and Ray truncates the traceback to 500 characters:

ray/python/ray/_raylet.pyx

Lines 1076 to 1081 in 4dff29a

# Pass the failure object back to the CoreWorker.
# We also cap the size of the error message to the last
# MAX_APPLICATION_ERROR_LEN characters of the error message.
if application_error != NULL:
application_error[0] = str(failure_object)[
-ray_constants.MAX_APPLICATION_ERROR_LEN:]

This is why the Ray Data traceback is long. We wrap the original error:

def _handle_debugger_exception(e: Exception):
"""If the Ray Debugger is enabled, keep the full stack trace unmodified
so that the debugger can stop at the initial unhandled exception.
Otherwise, clear the stack trace to omit noisy internal code path."""
ctx = ray.data.DataContext.get_current()
if _is_ray_debugger_post_mortem_enabled() or ctx.raise_original_map_exception:
raise e
else:
raise UserCodeException() from e

Versions / Dependencies

644b594

Reproduction script

import ray

ray.data.DataContext.get_current().actor_task_retry_on_errors = [OSError]


class Embed:
    def __init__(self):
        self._num_attempts = 0

    def __call__(self, batch):
        if self._num_attempts < 3:
            self._num_attempts += 1
            raise OSError("Simulated error")

        return batch


ray.data.range(1).map_batches(Embed, concurrency=1).materialize()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-observability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions