Skip to content

Conversation

@fmeum
Copy link
Collaborator

@fmeum fmeum commented Mar 5, 2025

Background

As of #25396, action rewinding (controlled by --rewind_lost_inputs) and build rewinding (controlled by --experimental_remote_cache_eviction_retries) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if --jobs=1, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:

  • When a lost input is detected, the progress of actions running concurrently isn't lost.
  • Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
  • Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
  • Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

Changes

This PR adds Bazel support for --rewind_lost_inputs with arbitrary --jobs values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new RewoundActionSynchronizer interface to SkyframeActionExecutor that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (--remote_cache_async).

The synchronization scheme relies on a single ReadWriteLock that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in RemoteRewoundActionSynchronizer for details as well as a proof that this scheme is free of deadlocks.


Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10):

bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled

TODO:

Fixes #26657

@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch 5 times, most recently from 12473dc to be8fae5 Compare March 10, 2025 08:30
@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch 7 times, most recently from 2ce580a to c11ea62 Compare March 14, 2025 14:41
@fmeum fmeum changed the title Fix action rewinding races with Bazel's filesystems Add support for --rewind_lost_inputs to Bazel Mar 14, 2025
@fmeum fmeum changed the title Add support for --rewind_lost_inputs to Bazel Add Bazel support for --rewind_lost_inputs Mar 14, 2025
@fmeum
Copy link
Collaborator Author

fmeum commented Mar 14, 2025

@justinhorvitz @coeuvre This is the first (and hopefully final) PR I have planned that is specific to action rewinding (not necessary for build rewinding) - it should be all that remains to get Bazel to support action rewinding. If time permits, it would be great if you could already take a first look and let me know whether it could have a chance to be accepted. If so, I would then work out the todos.

@fmeum fmeum requested review from coeuvre and justinhorvitz March 14, 2025 18:10
@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch from 4b2906c to dc02bd8 Compare March 18, 2025 17:40
@justinhorvitz
Copy link
Contributor

I'm very impressed that you figured out how to get the synchronization right. That alone is quite a feat. But I'm thinking about how we can avoid the complexity altogether:

  1. Is there a way to translate a failure to read an input that was deleted into a lost input exception?
  2. Why does rewinding work for blaze without this complexity? It's because our action filesystem has no dependency on reading from the output base. When it needs to open an input stream, it does so by reading the blob from remote storage.

Do either of those thoughts resonate with you?

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 20, 2025

Before I worked on this PR, I actually tried to make the output handling of Bazel's spawn strategies atomic. This runs into a few Bazel-specific complexities:

  • Bazel supports BwoB with a remote or disk cache and local execution, which means that it's not only the remote strategy that can lose outputs, it's all the strategies. This includes standalone and unsandboxed worker execution, which execute arbitrary binaries directly on the real exec root. These Spawns may read truncated inputs, non-atomically modify their outputs, outright fail if their outputs exist before they run, etc.
  • On Windows, you can't delete a file that is currently open for reading, which makes surgically removing outputs while another action is consuming them even more challenging.

The explicit synchronization scheme also has an advantage compared to the Blaze approach in that it prevents "input tearing", that is, an action consuming outputs from multiple different (re-)executions of another action. This makes the effects of flakiness in builds even worse and Bazel builds already tend to be more flaky on average (simply because most companies don't have a "Bazel/Blaze team" and hermetic C++ toolchains are more difficult to procure).

If we wanted to avoid extra complexity, we could limit action rewinding to builds that only use sandboxed execution strategies. That would still require somewhat subtle work to make all these strategies atomic in how they handle their input, while not supporting the default Javac strategy (unsandboxed multiplex worker). It would also mean that we can't enable --rewind_lost_inputs by default, which is unfortunate for a flag that has a quite substantial but also pretty hidden usability impact.

I'm happy to provide more context on Bazel use cases and challenges and am also very much open to other approaches - this is just the best I could manage so far after weighing these pros and cons.

@justinhorvitz
Copy link
Contributor

If we wanted to avoid extra complexity, we could limit action rewinding to builds that only use sandboxed execution strategies

We're actively looking to enable rewinding for internal builds that use mixed execution strategies (remote, persistent worker, local). @ericfelly is thinking about some sort of local storage for inputs that is separate from the output tree to avoid the deletion race concerns.

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 26, 2025

We're actively looking to enable rewinding for internal builds that use mixed execution strategies (remote, persistent worker, local). @ericfelly is thinking about some sort of local storage for inputs that is separate from the output tree to avoid the deletion race concerns.

This sounds very interesting. Can you share more details about the use case and/or approach?

More specifically, are you planning to support:

  • local executions that discover lost inputs (so that they trigger rewinding of the actions that produce these inputs, but the rewound actions are still all assumed to be remote)
  • local executions whose outputs are lost (so that they may be rewound)

@justinhorvitz
Copy link
Contributor

Outputs of local actions should never be lost, so we're only planning a solution for the former.

I'm not going to dive into reviewing this PR any further right now unless some bazel stakeholders decide that this is the direction they want to go. Just for reference, my team's priorities are to support google-internal use cases, and I try my best to review PRs on a best-effort basis. In this case, I would want someone more bazel-oriented to make a decision on the direction.

@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch 3 times, most recently from 6cfe0cc to 7f78996 Compare May 20, 2025 16:13
*/
final class RemoteRewoundActionSynchronizer implements RewoundActionSynchronizer {
// A single coarse lock is used to synchronize rewound actions (writers) and both rewound and
// non-rewound actions (readers) as long as no rewound action has attempted to prepare for its
Copy link
Collaborator Author

@fmeum fmeum May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coeuvre Regarding the optimization potential we discussed privately: If we can be sure that a certain action writes its outputs atomically and doesn't need them to be deleted (say, if we know the action runs with remote execution, where we can arrange for this), it would not need to acquire the write lock. If all actions are of this type (as they would be at Google), the fine-grained lock would never be inflated.

@meisterT meisterT requested a review from ericfelly October 16, 2025 07:15
@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch 2 times, most recently from acc911a to 90740d1 Compare December 14, 2025 11:13
@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch 2 times, most recently from 68e7210 to 00cb629 Compare December 14, 2025 14:11
@fmeum
Copy link
Collaborator Author

fmeum commented Dec 14, 2025

@gemini-code-assist /review

@gemini-code-assist
Copy link

Thank you for tagging me, @fmeum. I've been following the discussion regarding the complexities of action rewinding, especially the synchronization challenges and the considerations for mixed execution strategies. I understand that the goal is to ensure robust support for --rewind_lost_inputs.

I can now perform a comprehensive review of the current state of this pull request, taking into account the points raised in your conversation with @justinhorvitz. I will provide detailed feedback based on the correctness, efficiency, maintainability, and security criteria, along with any relevant style guide adherence. Please let me know if you'd like me to proceed with the review.

@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch from 00cb629 to 29c2d1d Compare December 14, 2025 14:15
@fmeum
Copy link
Collaborator Author

fmeum commented Dec 14, 2025

@gemini-code-assist Yes, please do

@gemini-code-assist
Copy link

Thank you, @fmeum. I will now proceed with a comprehensive review of the pull request, taking into account the previous discussions and the specified review criteria. I will provide detailed feedback on correctness, efficiency, maintainability, and security, as well as any relevant style guide adherence.

@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch 4 times, most recently from 42d0640 to a79d87f Compare December 15, 2025 10:30
@fmeum
Copy link
Collaborator Author

fmeum commented Dec 15, 2025

I rebased onto master and resolved the todo about synchronization with CompletionFunction. This also now includes #28108 to ensure testing on Windows.

@fmeum fmeum marked this pull request as ready for review December 15, 2025 10:31
@fmeum fmeum requested a review from a team as a code owner December 15, 2025 10:31
@github-actions github-actions bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Dec 15, 2025
@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch 3 times, most recently from 767cb95 to 1b2f221 Compare December 29, 2025 16:57
fmeum added 2 commits January 12, 2026 20:02
This enables future work on making action rewinding work in Bazel with `--jobs` values larger than 1, which is much more challenging for standalone execution due to actions directly operating on the exec root.

# Conflicts:
#	src/test/java/com/google/devtools/build/lib/skyframe/rewinding/BUILD
#	src/test/java/com/google/devtools/build/lib/skyframe/rewinding/RewindingTest.java

# Conflicts:
#	src/test/java/com/google/devtools/build/lib/skyframe/rewinding/RewindingTest.java
@fmeum fmeum force-pushed the action-rewinding-simple-concurrency branch from 1b2f221 to 4c72fd1 Compare January 12, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

changing remote_cache should should invalidate remote files in action cache

2 participants