Skip to content

Conversation

@fmeum
Copy link
Collaborator

@fmeum fmeum commented Mar 4, 2025

This is achieved by implementing an ImportantOutputHandler for Bazel and moving the downloading logic that previously lived in CompletionFunction into it. This slightly differs from Blaze since Bazel accounts for runfiles within CompletionFunction's call to the ImportantOutputHandler#processOutputsAndGetLostArtifacts function.

Changes made in #25396 to support build rewinding as a fallback of action rewinding are extracted into ActionRewindStrategy to ensure consistent behavior between lost inputs and lost outputs. This results in a change to the error message shown when lost inputs are encountered but --rewind_lost_inputs is not enabled, which used to not mention the flag (compared to the case of lost outputs, in which it did). New Java tests verify the basic functionality of action rewinding in Bazel.

This change requires a fix to BulkTransferException#add, which previously resulted in infinite recursion when passed a BulkTransferException.

@fmeum fmeum force-pushed the build-rewinding-top-level-outputs branch 4 times, most recently from 6d74866 to fd731fd Compare March 4, 2025 15:49
@fmeum fmeum changed the title Build rewinding top level outputs Retry build when digests of top-level outputs are missing Mar 4, 2025
@fmeum
Copy link
Collaborator Author

fmeum commented Mar 4, 2025

Stacked on #25396

@fmeum fmeum force-pushed the build-rewinding-top-level-outputs branch 3 times, most recently from b8cefea to 8df7e66 Compare March 7, 2025 18:18
@fmeum fmeum marked this pull request as ready for review March 7, 2025 18:31
@fmeum fmeum requested a review from a team as a code owner March 7, 2025 18:31
@fmeum fmeum requested review from coeuvre and justinhorvitz March 7, 2025 18:32
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Mar 7, 2025
@fmeum fmeum removed the request for review from a team March 7, 2025 18:32
@fmeum
Copy link
Collaborator Author

fmeum commented Mar 7, 2025

This can be reviewed independently of #25358.


ImmutableMap<String, ActionInput> knownLostOutputs = ImmutableMap.of();
try {
ensureToplevelArtifacts(env, importantArtifacts, inputMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the whole ensureTopLevelArtifacts functionality be offloaded to an ImportantOutputHandler? It looks an awful lot like what the blaze ImportantOutputHandler does, and it would clean up the interface awkwardness introduced by passing in known lost inputs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, this cleaned up the change quite a bit. I do still need to pass in a new argument (a way to get the generating action for an artifact).

InputMetadataProvider metadataProvider,
ImmutableMap<String, ActionInput> knownLostOutputs) {
return new LostArtifacts(
knownLostOutputs, new ActionInputDepOwnerMap(knownLostOutputs.values()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the owner map matter at all for full build rewinding or only action rewinding? It looks like you're just creating an empty one and not populating any owners.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter for build rewinding, but as I only noticed after I added more tests, it does for action rewinding for tree artifacts. I am now tracking owners, PTAL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on some refactoring that will hopefully simplify this ownership tracking. I sent the first change for review today: https://bazel-review.googlesource.com/c/bazel/+/274110. The goal is to eventually end up at a place where the ownership tracking is optional (ActionRewindStrategy fills it in if it's not present).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I did wonder why ownership needed to be supplied manually. The CL description helped me understand this better.

@fmeum fmeum force-pushed the build-rewinding-top-level-outputs branch 2 times, most recently from 941690f to d9d761d Compare March 9, 2025 07:43
@fmeum
Copy link
Collaborator Author

fmeum commented Mar 9, 2025

@justinhorvitz I added new tests that cover all the kinds of artifacts that Bazel might download via CompletionFunction. Top-level outputs work well, including tree artifacts, but lost runfiles result in this failure in the new remoteCacheEvictBlobs_whenRunfilesRequested_succeedsWithActionRewinding integration test case:

Caused by: java.lang.IllegalStateException: Lost input not a dep of the failed action and can't be associated with such a dep. lostInput: File:[[<execution_root>]bazel-out/darwin_arm64-fastbuild/bin]a/foo.out/file-inside, owners: [File:[[<execution_root>]bazel-out/darwin_arm64-fastbuild/bin]a/foo.out], failedKey: TargetCompletionKey{topLevelArtifactContext=com.google.devtools.build.lib.analysis.TopLevelArtifactContext@90904c3b, actionLookupKey=ConfiguredTargetKey{label=//a:bin, config=BuildConfigurationKey[5ef29e1379ec5c6e584ffec760372c7a27e8ee210a5a61ff33d4bd48eeff889b]}, willTest=false}
	at com.google.devtools.build.lib.skyframe.rewinding.ActionRewindStrategy.getLostInputOwningDirectDeps(ActionRewindStrategy.java:528)
	at com.google.devtools.build.lib.skyframe.rewinding.ActionRewindStrategy.prepareRewindPlan(ActionRewindStrategy.java:258)
	at com.google.devtools.build.lib.skyframe.rewinding.ActionRewindStrategy.prepareRewindPlanForLostTopLevelOutputs(ActionRewindStrategy.java:137)
	at com.google.devtools.build.lib.skyframe.CompletionFunction.informImportantOutputHandler(CompletionFunction.java:474)
	at com.google.devtools.build.lib.skyframe.CompletionFunction.compute(CompletionFunction.java:351)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:471)
	... 7 more

Bazel doesn't contain any callsite of processRunfilesAndGetLostArtifacts, so I am not sure whether I need to go through an entirely different flow here or just add to the failedKeyDeps. Do you have a suggestion for me?

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 9, 2025

I think that this is not a fundamental problem: I can get the test to pass by passing builtArtifacts into informImportantOutputHandler, not just importantArtifacts - the latter doesn't contain the runfiles tree artifacts. It's rightfully not an important artifact for BES purposes, but with Bazel, it would probably need to be considered one for action rewinding purposes.

@justinhorvitz I pushed a commit that adds runfile tree artifacts to failedDepKeys if present. What do you think of that approach?

@fmeum fmeum force-pushed the build-rewinding-top-level-outputs branch 2 times, most recently from 203ee76 to e329b38 Compare March 9, 2025 21:01
@fmeum fmeum requested a review from justinhorvitz March 9, 2025 21:26
@justinhorvitz
Copy link
Contributor

I think that this is not a fundamental problem: I can get the test to pass by passing builtArtifacts into informImportantOutputHandler, not just importantArtifacts - the latter doesn't contain the runfiles tree artifacts. It's rightfully not an important artifact for BES purposes, but with Bazel, it would probably need to be considered one for action rewinding purposes.

@justinhorvitz I pushed a commit that adds runfile tree artifacts to failedDepKeys if present. What do you think of that approach?

The reason that blaze does not process runfiles like this in CompletionFunction and instead uses the top-level-aspect approach I mentioned over chat is because for certain builds, that aspect's action also builds a manifest of where the runfiles can be found in remote storage. We need to validate cached storage addresses prior to building that manifest because the addresses of the rewound action's outputs may be different than what was cached.

I think it would be best to make the bazel case work similarly if possible. Do you think it's feasible to pass --aspects=StageRunfilesAspect when runfiles of top-level targets are needed? If so, I'll open-source that aspect and its action.

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 10, 2025

Bazel has flags such as --remote_download_regex that can request the materialization of individual files from a runfiles tree (or any runfiles tree), but it's impossible to tell from a given regex whether it could match top-level runfiles. Would the aspect approach be able to support this use case? Would Bazel users need to manually pass that flag (and even with a bazel run)?

@justinhorvitz
Copy link
Contributor

The aspect-based approach is intended to validate leases and materialize runfiles of top-level targets, not anything else. I would require a flag, unless you make it the default.

I think you could also get away with having CompletionFunction pass in the runfiles to processOutputsAndGetLostArtifacts, and we'll just have the google handler ignore them and continue to rely on the runfiles-specific method.

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 17, 2025

@justinhorvitz I resolved the merge conflict, PTAL.

var byDigest = byDigestBuilder.build();
return new LostArtifacts(
byDigest,
ActionRewindStrategy.calculateLostInputOwners(byDigest.values(), metadataProvider));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this visible, we can make LostArtifacts take an Optional<LostInputOwners>, and pass that into ActionRewindStrategy. Then the lost inputs and outputs cases would become more similar.

@justinhorvitz
Copy link
Contributor

The rewinding part is looking pretty good. Just left a couple comments. I did not review the remote execution parts that I am not familiar with.

Copy link
Collaborator Author

@fmeum fmeum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinhorvitz I hope I addressed your comments, please take another look.
@coeuvre This is ready to be reviewed from the remote execution perspective.

for (ActionInput lostOutput : lostOutputs.byDigest().values()) {
builtArtifacts.remove(lostOutput);
builtArtifacts.removeAll(lostOutputs.owners().getOwners(lostOutput));
lostOutputs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinhorvitz Is it okay that Bazel skips this part because it doesn't populate owners? I would think that the answer is no, which would probably require moving this computation into ActionRewindStrategy.prepareRewindPlanForLostTopLevelOutputs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit with this change for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal from builtArtifacts is relevant for errorful cases where builtArtifacts is passed to postFailedEvent. I believe that the _partiallyBuiltTarget_ rewinding test cases are supposed to cover this. The owners would be covered by the tree artifact one.

@fmeum fmeum requested a review from justinhorvitz March 19, 2025 22:24
@fmeum
Copy link
Collaborator Author

fmeum commented Mar 20, 2025

@bazel-io fork 8.2.0

*/
public ImmutableMap<String, ActionInput> getLostInputs(
Function<String, ActionInput> actionInputResolver) {
public LostArtifacts getLostArtifacts(InputMetadataProvider metadataProvider) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Function<String, ActionInput> is not enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still enough and I just switched back. InputMetadataProvider is now always a transitive dep of BulkTransferException, which is what I initially wanted to avoid - but LostArtifacts brings that with it.

@fmeum fmeum requested a review from coeuvre March 26, 2025 14:21
Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will import this.

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 26, 2025

@bazel-io fork 8.2.0

@coeuvre
Copy link
Member

coeuvre commented Apr 2, 2025

I have made some changes on top of this PR during internal reviews, most notably:

  1. I undid the change to interface ImportantOutputHandler#processOutputsAndGetLostArtifacts. Instead, let the new implementation find its own way to get the generating action of an artifact.
  2. I moved the new ImportantOutputHandler implementation into its own target so it can be shared with other modules internally.
  3. Instead of passing builtArtifacts and mutate it inside ActionRewindStrategy#prepareRewindPlanForLostTopLevelOutputs, mutate it directly from the call site.

The review is still undergoing but should be merged soon.

@copybara-service copybara-service bot closed this in 28270f1 Apr 3, 2025
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Apr 3, 2025
@fmeum fmeum deleted the build-rewinding-top-level-outputs branch April 14, 2025 12:59
fmeum added a commit to fmeum/bazel that referenced this pull request Apr 25, 2025
This is achieved by implementing an `ImportantOutputHandler` for Bazel and moving the downloading logic that previously lived in `CompletionFunction` into it. This slightly differs from Blaze since Bazel accounts for runfiles within `CompletionFunction`'s call to the `ImportantOutputHandler#processOutputsAndGetLostArtifacts` function.

Changes made in bazelbuild#25396 to support build rewinding as a fallback of action rewinding are extracted into `ActionRewindStrategy` to ensure consistent behavior between lost inputs and lost outputs. This results in a change to the error message shown when lost inputs are encountered but `--rewind_lost_inputs` is not enabled, which used to not mention the flag (compared to the case of lost outputs, in which it did). New Java tests verify the basic functionality of action rewinding in Bazel.

This change requires a fix to `BulkTransferException#add`, which previously resulted in infinite recursion when passed a `BulkTransferException`.

Closes bazelbuild#25448.

PiperOrigin-RevId: 743552575
Change-Id: If7bd85fd737d9af9ef339ee1ad57f9d1c230b5a9
copybara-service bot pushed a commit that referenced this pull request Apr 30, 2025
… it is also a top-level runfile of the same target.

The bug was introduced by #25448 (28270f1) which added top-level output rewinding for bazel. Blaze does not process or rewind runfiles from `CompletionFunction`, so in such a case, there can be a discrepancy between the metadata in `inputMap` (reflects the unrewound `RunfilesArtifactValue`) and `importantInputMap` (reflects the rewound `FileArtifactValue`). The culprit change switches to `inputMap`, resulting in the bug.

As a quick fix, pass two metadata providers to `ImportantOutputHandler` - one containing only important metadata (used by blaze's `ImportantOutputHandler`), and one containing the full metadata including artifacts from hidden output groups (used by bazel's `RemoteImportantOutputHandler`). I think this can be cleaned up more, but I want to fix the bug first.

PiperOrigin-RevId: 753201303
Change-Id: I46fc6fd6384541ec002cbd7c74ba83ac9fa45a09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants