-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Retry build when digests of top-level outputs are missing #25448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6d74866 to
fd731fd
Compare
|
Stacked on #25396 |
b8cefea to
8df7e66
Compare
|
This can be reviewed independently of #25358. |
|
|
||
| ImmutableMap<String, ActionInput> knownLostOutputs = ImmutableMap.of(); | ||
| try { | ||
| ensureToplevelArtifacts(env, importantArtifacts, inputMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the whole ensureTopLevelArtifacts functionality be offloaded to an ImportantOutputHandler? It looks an awful lot like what the blaze ImportantOutputHandler does, and it would clean up the interface awkwardness introduced by passing in known lost inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, this cleaned up the change quite a bit. I do still need to pass in a new argument (a way to get the generating action for an artifact).
| InputMetadataProvider metadataProvider, | ||
| ImmutableMap<String, ActionInput> knownLostOutputs) { | ||
| return new LostArtifacts( | ||
| knownLostOutputs, new ActionInputDepOwnerMap(knownLostOutputs.values())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the owner map matter at all for full build rewinding or only action rewinding? It looks like you're just creating an empty one and not populating any owners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter for build rewinding, but as I only noticed after I added more tests, it does for action rewinding for tree artifacts. I am now tracking owners, PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on some refactoring that will hopefully simplify this ownership tracking. I sent the first change for review today: https://bazel-review.googlesource.com/c/bazel/+/274110. The goal is to eventually end up at a place where the ownership tracking is optional (ActionRewindStrategy fills it in if it's not present).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I did wonder why ownership needed to be supplied manually. The CL description helped me understand this better.
941690f to
d9d761d
Compare
|
@justinhorvitz I added new tests that cover all the kinds of artifacts that Bazel might download via Bazel doesn't contain any callsite of |
|
I think that this is not a fundamental problem: I can get the test to pass by passing @justinhorvitz I pushed a commit that adds runfile tree artifacts to |
203ee76 to
e329b38
Compare
The reason that blaze does not process runfiles like this in I think it would be best to make the bazel case work similarly if possible. Do you think it's feasible to pass |
|
Bazel has flags such as |
|
The aspect-based approach is intended to validate leases and materialize runfiles of top-level targets, not anything else. I would require a flag, unless you make it the default. I think you could also get away with having |
|
@justinhorvitz I resolved the merge conflict, PTAL. |
| var byDigest = byDigestBuilder.build(); | ||
| return new LostArtifacts( | ||
| byDigest, | ||
| ActionRewindStrategy.calculateLostInputOwners(byDigest.values(), metadataProvider)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making this visible, we can make LostArtifacts take an Optional<LostInputOwners>, and pass that into ActionRewindStrategy. Then the lost inputs and outputs cases would become more similar.
src/main/java/com/google/devtools/build/lib/skyframe/CompletionFunction.java
Outdated
Show resolved
Hide resolved
|
The rewinding part is looking pretty good. Just left a couple comments. I did not review the remote execution parts that I am not familiar with. |
fmeum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinhorvitz I hope I addressed your comments, please take another look.
@coeuvre This is ready to be reviewed from the remote execution perspective.
| for (ActionInput lostOutput : lostOutputs.byDigest().values()) { | ||
| builtArtifacts.remove(lostOutput); | ||
| builtArtifacts.removeAll(lostOutputs.owners().getOwners(lostOutput)); | ||
| lostOutputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinhorvitz Is it okay that Bazel skips this part because it doesn't populate owners? I would think that the answer is no, which would probably require moving this computation into ActionRewindStrategy.prepareRewindPlanForLostTopLevelOutputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a commit with this change for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal from builtArtifacts is relevant for errorful cases where builtArtifacts is passed to postFailedEvent. I believe that the _partiallyBuiltTarget_ rewinding test cases are supposed to cover this. The owners would be covered by the tree artifact one.
|
@bazel-io fork 8.2.0 |
src/main/java/com/google/devtools/build/lib/remote/common/BulkTransferException.java
Outdated
Show resolved
Hide resolved
| */ | ||
| public ImmutableMap<String, ActionInput> getLostInputs( | ||
| Function<String, ActionInput> actionInputResolver) { | ||
| public LostArtifacts getLostArtifacts(InputMetadataProvider metadataProvider) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Function<String, ActionInput> is not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still enough and I just switched back. InputMetadataProvider is now always a transitive dep of BulkTransferException, which is what I initially wanted to avoid - but LostArtifacts brings that with it.
coeuvre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will import this.
|
@bazel-io fork 8.2.0 |
|
I have made some changes on top of this PR during internal reviews, most notably:
The review is still undergoing but should be merged soon. |
This is achieved by implementing an `ImportantOutputHandler` for Bazel and moving the downloading logic that previously lived in `CompletionFunction` into it. This slightly differs from Blaze since Bazel accounts for runfiles within `CompletionFunction`'s call to the `ImportantOutputHandler#processOutputsAndGetLostArtifacts` function. Changes made in bazelbuild#25396 to support build rewinding as a fallback of action rewinding are extracted into `ActionRewindStrategy` to ensure consistent behavior between lost inputs and lost outputs. This results in a change to the error message shown when lost inputs are encountered but `--rewind_lost_inputs` is not enabled, which used to not mention the flag (compared to the case of lost outputs, in which it did). New Java tests verify the basic functionality of action rewinding in Bazel. This change requires a fix to `BulkTransferException#add`, which previously resulted in infinite recursion when passed a `BulkTransferException`. Closes bazelbuild#25448. PiperOrigin-RevId: 743552575 Change-Id: If7bd85fd737d9af9ef339ee1ad57f9d1c230b5a9
… it is also a top-level runfile of the same target. The bug was introduced by #25448 (28270f1) which added top-level output rewinding for bazel. Blaze does not process or rewind runfiles from `CompletionFunction`, so in such a case, there can be a discrepancy between the metadata in `inputMap` (reflects the unrewound `RunfilesArtifactValue`) and `importantInputMap` (reflects the rewound `FileArtifactValue`). The culprit change switches to `inputMap`, resulting in the bug. As a quick fix, pass two metadata providers to `ImportantOutputHandler` - one containing only important metadata (used by blaze's `ImportantOutputHandler`), and one containing the full metadata including artifacts from hidden output groups (used by bazel's `RemoteImportantOutputHandler`). I think this can be cleaned up more, but I want to fix the bug first. PiperOrigin-RevId: 753201303 Change-Id: I46fc6fd6384541ec002cbd7c74ba83ac9fa45a09
This is achieved by implementing an
ImportantOutputHandlerfor Bazel and moving the downloading logic that previously lived inCompletionFunctioninto it. This slightly differs from Blaze since Bazel accounts for runfiles withinCompletionFunction's call to theImportantOutputHandler#processOutputsAndGetLostArtifactsfunction.Changes made in #25396 to support build rewinding as a fallback of action rewinding are extracted into
ActionRewindStrategyto ensure consistent behavior between lost inputs and lost outputs. This results in a change to the error message shown when lost inputs are encountered but--rewind_lost_inputsis not enabled, which used to not mention the flag (compared to the case of lost outputs, in which it did). New Java tests verify the basic functionality of action rewinding in Bazel.This change requires a fix to
BulkTransferException#add, which previously resulted in infinite recursion when passed aBulkTransferException.