Skip to content

[Feature] ClusterExecutor: Store file errors in output files#918

Merged
jan-janssen merged 1 commit intomainfrom
clusterexecutorfileerrors
Feb 14, 2026
Merged

[Feature] ClusterExecutor: Store file errors in output files#918
jan-janssen merged 1 commit intomainfrom
clusterexecutorfileerrors

Conversation

@jan-janssen
Copy link
Member

@jan-janssen jan-janssen commented Feb 14, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Improved error handling and reporting during parallel task execution by ensuring error details are properly captured and logged on the root process.
    • Enhanced error handling flow in task file execution to catch and report exceptions more reliably.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

📝 Walkthrough

Walkthrough

Two backend execution files are modified to relocate error handling logic into try/except blocks. In cache_parallel.py, apply_dict loading and execution shifts into the try block with error writes only on rank 0 and an added MPI.Barrier. In file/backend.py, apply_dict and time_start initialization moves into the try block, creating a potential NameError in error handling.

Changes

Cohort / File(s) Summary
MPI Execution Error Handling
src/executorlib/backend/cache_parallel.py
Relocates apply_dict loading, broadcasting, execution, and result gathering into try/except block with rank-0-only error write-back and MPI.Barrier synchronization.
File Backend Task Execution
src/executorlib/task_scheduler/file/backend.py
Moves apply_dict and time_start initialization into try block; introduces potential NameError if exception occurs before apply_dict assignment.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

🐰 With whiskers twitched and code reviewed with care,
We caught the errors floating in the air!
Try blocks embrace our execution's grace,
But watch that NameError in the race—
MPI barriers keep the ranks aligned,
A safer path through logic intertwined! 🌙

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature: storing file errors in output files for ClusterExecutor, which aligns with the changes made to error handling in both cache_parallel.py and backend.py.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch clusterexecutorfileerrors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/executorlib/task_scheduler/file/backend.py (1)

71-90: ⚠️ Potential issue | 🔴 Critical

NameError if backend_load_file raises: both apply_dict and time_start are unbound.

If backend_load_file (line 72) throws, apply_dict is never assigned, so line 83 (apply_dict=apply_dict) raises a NameError inside the except handler. Similarly, time_start (line 73) is never set, so line 89 (time.time() - time_start) also raises a NameError.

Note how cache_parallel.py avoids this by initializing both before the try block (lines 40–41). Apply the same pattern here.

🐛 Proposed fix — initialize before the try block
 def backend_execute_task_in_file(file_name: str) -> None:
     ...
+    apply_dict = {}
+    time_start = time.time()
     try:
         apply_dict = backend_load_file(file_name=file_name)
-        time_start = time.time()
         result = {
             "result": apply_dict["fn"].__call__(
                 *apply_dict["args"], **apply_dict["kwargs"]
             )
         }
     except Exception as error:
         result = {"error": error}
         backend_write_error_file(
             error=error,
             apply_dict=apply_dict,
         )
 
     backend_write_file(
         file_name=file_name,
         output=result,
         runtime=time.time() - time_start,
     )
src/executorlib/backend/cache_parallel.py (1)

42-68: ⚠️ Potential issue | 🔴 Critical

Deadlock: if rank 0 throws before bcast, all other ranks hang forever.

If backend_load_file (line 44) raises an exception on rank 0, execution jumps to the except block. Meanwhile every other rank is blocked in MPI.COMM_WORLD.bcast (line 45) waiting for rank 0, which never arrives. The Barrier at line 68 does not help because non-zero ranks never reach it.

The same asymmetry applies to gather (line 48): if rank 0 throws during __call__ but other ranks complete successfully, they block on gather while rank 0 is in the except block.

Before this PR, load + bcast lived outside the try block, so any failure would propagate as an unhandled exception and MPI would abort all processes — no deadlock. Moving them inside the try block silently catches the error on rank 0 only.

One approach: broadcast a success/failure flag after the __call__ so all ranks agree on the error path. For the load, keep it (and the bcast) before the try block so a load failure still crashes cleanly.

💡 Sketch: keep load+bcast outside try, protect only the execution
     time_start = time.time()
-    apply_dict = {}
+    apply_dict = None
+    if mpi_rank_zero:
+        apply_dict = backend_load_file(file_name=file_name)
+    apply_dict = MPI.COMM_WORLD.bcast(apply_dict, root=0)
     try:
-        if mpi_rank_zero:
-            apply_dict = backend_load_file(file_name=file_name)
-        apply_dict = MPI.COMM_WORLD.bcast(apply_dict, root=0)
         output = apply_dict["fn"].__call__(*apply_dict["args"], **apply_dict["kwargs"])
         result = (
             MPI.COMM_WORLD.gather(output, root=0) if mpi_size_larger_one else output
         )
     except Exception as error:
+        # All ranks hit the same exception, so gather a sentinel to avoid deadlock
+        if mpi_size_larger_one:
+            MPI.COMM_WORLD.gather(None, root=0)
         if mpi_rank_zero:
             backend_write_file(
                 file_name=file_name,
                 output={"error": error},
                 runtime=time.time() - time_start,
             )
             backend_write_error_file(
                 error=error,
                 apply_dict=apply_dict,
             )
     else:
         if mpi_rank_zero:
             backend_write_file(
                 file_name=file_name,
                 output={"result": result},
                 runtime=time.time() - time_start,
             )
     MPI.COMM_WORLD.Barrier()

Note: this sketch still assumes all ranks throw/succeed together. If asymmetric failures are possible, a coordinated error flag (allreduce on a boolean) is needed before branching.

@codecov
Copy link

codecov bot commented Feb 14, 2026

Codecov Report

❌ Patch coverage is 33.33333% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.78%. Comparing base (3fd4fb3) to head (1540283).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/executorlib/backend/cache_parallel.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #918   +/-   ##
=======================================
  Coverage   93.78%   93.78%           
=======================================
  Files          38       38           
  Lines        1947     1947           
=======================================
  Hits         1826     1826           
  Misses        121      121           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jan-janssen jan-janssen merged commit 6f3cc53 into main Feb 14, 2026
86 of 91 checks passed
@jan-janssen jan-janssen deleted the clusterexecutorfileerrors branch February 14, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant