[Core] Can't determine cause of actor task retry from logs

### What happened + What you expected to happen

I ran a Ray Data pipeline. Ray retried some actor tasks, and I tried to discover the cause by looking at the worker logs, but the logs didn't contain the root cause of the exception (`OSError` in the repro below):

> [2024-12-16 09:57:49,336 I 3869 1159262] core_worker.cc:594: Will resubmit task after a 0ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.data._internal.execution.operators.actor_pool_map_operator, class_name=_MapWorker, function_name=submit, function_hash=}, task_id=11596eb123ba278c979404eb7d775de13c4eb97c01000000, task_name=MapBatches(Embed), job_id=01000000, num_args=6, num_returns=1, max_retries=-1, depth=1, attempt_number=2, actor_task_spec={actor_id=979404eb7d775de13c4eb97c01000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff01000000, actor_counter=5, retry_exceptions=1}
[2024-12-16 09:57:50,340 W 3869 1159262] task_manager.cc:1107: Task attempt 11596eb123ba278c979404eb7d775de13c4eb97c01000000 failed with error TASK_EXECUTION_EXCEPTION Fail immediately? 0, status OK, error info error_message: "User exception:\n __call__\n    yield from self._batch_fn(input, ctx)\n  File \"/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py\", line 364, in transform_fn\n    res = fn(batch)\n  File \"/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py\", line 268, in fn\n    _handle_debugger_exception(e)\n  File \"/Users/balaji/ray/python/ray/data/_internal/planner/plan_udf_map_op.py\", line 292, in _handle_debugger_exception\n    raise UserCodeException() from e\nray.exceptions.UserCodeException"
error_type: TASK_EXECUTION_EXCEPTION

The reason this occurs is that the traceback is long, and Ray truncates the traceback to 500 characters: https://github.com/ray-project/ray/blob/4dff29a6a1f444e1297c519c7b2a49a30ef0dc1c/python/ray/_raylet.pyx#L1076-L1081

This is why the Ray Data traceback is long. We wrap the original error: https://github.com/ray-project/ray/blob/4dff29a6a1f444e1297c519c7b2a49a30ef0dc1c/python/ray/data/_internal/planner/plan_udf_map_op.py#L284-L292

### Versions / Dependencies

644b5946a82d7cc26982942ec7125e5f97ba34f9

### Reproduction script

```python
import ray

ray.data.DataContext.get_current().actor_task_retry_on_errors = [OSError]


class Embed:
    def __init__(self):
        self._num_attempts = 0

    def __call__(self, batch):
        if self._num_attempts < 3:
            self._num_attempts += 1
            raise OSError("Simulated error")

        return batch


ray.data.range(1).map_batches(Embed, concurrency=1).materialize()
```



### Issue Severity

Medium: It is a significant difficulty but I can work around it.

	# Pass the failure object back to the CoreWorker.
	# We also cap the size of the error message to the last
	# MAX_APPLICATION_ERROR_LEN characters of the error message.
	if application_error != NULL:
	application_error[0] = str(failure_object)[
	-ray_constants.MAX_APPLICATION_ERROR_LEN:]

	def _handle_debugger_exception(e: Exception):
	"""If the Ray Debugger is enabled, keep the full stack trace unmodified
	so that the debugger can stop at the initial unhandled exception.
	Otherwise, clear the stack trace to omit noisy internal code path."""
	ctx = ray.data.DataContext.get_current()
	if _is_ray_debugger_post_mortem_enabled() or ctx.raise_original_map_exception:
	raise e
	else:
	raise UserCodeException() from e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Can't determine cause of actor task retry from logs #49287

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Core] Can't determine cause of actor task retry from logs #49287

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions