[Feature] ClusterExecutor: Store file errors in output files by jan-janssen · Pull Request #918 · pyiron/executorlib

jan-janssen · 2026-02-14T16:34:52Z

Summary by CodeRabbit

Bug Fixes
- Improved error handling and reporting during parallel task execution by ensuring error details are properly captured and logged on the root process.
- Enhanced error handling flow in task file execution to catch and report exceptions more reliably.

coderabbitai · 2026-02-14T16:35:06Z

📝 Walkthrough

Walkthrough

Two backend execution files are modified to relocate error handling logic into try/except blocks. In cache_parallel.py, apply_dict loading and execution shifts into the try block with error writes only on rank 0 and an added MPI.Barrier. In file/backend.py, apply_dict and time_start initialization moves into the try block, creating a potential NameError in error handling.

Changes

Cohort / File(s)	Summary
MPI Execution Error Handling `src/executorlib/backend/cache_parallel.py`	Relocates apply_dict loading, broadcasting, execution, and result gathering into try/except block with rank-0-only error write-back and MPI.Barrier synchronization.
File Backend Task Execution `src/executorlib/task_scheduler/file/backend.py`	Moves apply_dict and time_start initialization into try block; introduces potential NameError if exception occurs before apply_dict assignment.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Cluster error handling #621: Modifies cache_parallel.py to wrap execution and gather logic in try/except with rank-0 error reporting—directly mirrors the pattern introduced here.
Split shared cache in backend and frontend #443: Modifies cache_parallel.py and alters usage of backend_load_file and backend_write_file—touches overlapping execution infrastructure.

Poem

🐰 With whiskers twitched and code reviewed with care,
We caught the errors floating in the air!
Try blocks embrace our execution's grace,
But watch that NameError in the race—
MPI barriers keep the ranks aligned,
A safer path through logic intertwined! 🌙

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main feature: storing file errors in output files for ClusterExecutor, which aligns with the changes made to error handling in both cache_parallel.py and backend.py.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch clusterexecutorfileerrors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/executorlib/task_scheduler/file/backend.py (1)
71-90: ⚠️ Potential issue | 🔴 Critical

NameError if backend_load_file raises: both apply_dict and time_start are unbound.

If backend_load_file (line 72) throws, apply_dict is never assigned, so line 83 (apply_dict=apply_dict) raises a NameError inside the except handler. Similarly, time_start (line 73) is never set, so line 89 (time.time() - time_start) also raises a NameError.

Note how cache_parallel.py avoids this by initializing both before the try block (lines 40–41). Apply the same pattern here.
🐛 Proposed fix — initialize before the try block
 def backend_execute_task_in_file(file_name: str) -> None:
     ...
+    apply_dict = {}
+    time_start = time.time()
     try:
         apply_dict = backend_load_file(file_name=file_name)
-        time_start = time.time()
         result = {
             "result": apply_dict["fn"].__call__(
                 *apply_dict["args"], **apply_dict["kwargs"]
             )
         }
     except Exception as error:
         result = {"error": error}
         backend_write_error_file(
             error=error,
             apply_dict=apply_dict,
         )
 
     backend_write_file(
         file_name=file_name,
         output=result,
         runtime=time.time() - time_start,
     )
src/executorlib/backend/cache_parallel.py (1)
42-68: ⚠️ Potential issue | 🔴 Critical

Deadlock: if rank 0 throws before bcast, all other ranks hang forever.

If backend_load_file (line 44) raises an exception on rank 0, execution jumps to the except block. Meanwhile every other rank is blocked in MPI.COMM_WORLD.bcast (line 45) waiting for rank 0, which never arrives. The Barrier at line 68 does not help because non-zero ranks never reach it.

The same asymmetry applies to gather (line 48): if rank 0 throws during __call__ but other ranks complete successfully, they block on gather while rank 0 is in the except block.

Before this PR, load + bcast lived outside the try block, so any failure would propagate as an unhandled exception and MPI would abort all processes — no deadlock. Moving them inside the try block silently catches the error on rank 0 only.

One approach: broadcast a success/failure flag after the __call__ so all ranks agree on the error path. For the load, keep it (and the bcast) before the try block so a load failure still crashes cleanly.
💡 Sketch: keep load+bcast outside try, protect only the execution
     time_start = time.time()
-    apply_dict = {}
+    apply_dict = None
+    if mpi_rank_zero:
+        apply_dict = backend_load_file(file_name=file_name)
+    apply_dict = MPI.COMM_WORLD.bcast(apply_dict, root=0)
     try:
-        if mpi_rank_zero:
-            apply_dict = backend_load_file(file_name=file_name)
-        apply_dict = MPI.COMM_WORLD.bcast(apply_dict, root=0)
         output = apply_dict["fn"].__call__(*apply_dict["args"], **apply_dict["kwargs"])
         result = (
             MPI.COMM_WORLD.gather(output, root=0) if mpi_size_larger_one else output
         )
     except Exception as error:
+        # All ranks hit the same exception, so gather a sentinel to avoid deadlock
+        if mpi_size_larger_one:
+            MPI.COMM_WORLD.gather(None, root=0)
         if mpi_rank_zero:
             backend_write_file(
                 file_name=file_name,
                 output={"error": error},
                 runtime=time.time() - time_start,
             )
             backend_write_error_file(
                 error=error,
                 apply_dict=apply_dict,
             )
     else:
         if mpi_rank_zero:
             backend_write_file(
                 file_name=file_name,
                 output={"result": result},
                 runtime=time.time() - time_start,
             )
     MPI.COMM_WORLD.Barrier()
Note: this sketch still assumes all ranks throw/succeed together. If asymmetric failures are possible, a coordinated error flag (allreduce on a boolean) is needed before branching.

codecov · 2026-02-14T16:47:42Z

Codecov Report

❌ Patch coverage is 33.33333% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.78%. Comparing base (3fd4fb3) to head (1540283).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/executorlib/backend/cache_parallel.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #918   +/-   ##
=======================================
  Coverage   93.78%   93.78%           
=======================================
  Files          38       38           
  Lines        1947     1947           
=======================================
  Hits         1826     1826           
  Misses        121      121

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

[Feature] ClusterExecutor: Store file errors in output files

1540283

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

jan-janssen merged commit 6f3cc53 into main Feb 14, 2026
86 of 91 checks passed

jan-janssen deleted the clusterexecutorfileerrors branch February 14, 2026 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] ClusterExecutor: Store file errors in output files#918

[Feature] ClusterExecutor: Store file errors in output files#918
jan-janssen merged 1 commit intomainfrom
clusterexecutorfileerrors

jan-janssen commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jan-janssen commented Feb 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jan-janssen commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 14, 2026 •

edited

Loading

codecov bot commented Feb 14, 2026 •

edited

Loading