-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Refactor build rewinding to a fallback strategy for action rewinding #25396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important The terms of service for this installation has not been accepted. Please ask the Organization owners to visit the Gemini Code Assist Admin Console to sign it. |
8139820 to
fa9dfa3
Compare
|
CI failures look unrelated, but persistent |
fa9dfa3 to
c0d54dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I will wait for @justinhorvitz's opinion on the changes related to action rewinding before I import the PR.
c0d54dd to
4764750
Compare
# Conflicts: # src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java
4764750 to
41d95ba
Compare
| "--rewind_lost_inputs", | ||
| "--features=cc_include_scanning", | ||
| "--experimental_remote_include_extraction_size_threshold=0", | ||
| "--experimental_remote_cache_eviction_retries=0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have an effect only after #25396? I don't know how stacked PRs work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's relevant already with this PR, which is the first in the stack. Without it, the lostInputWithRewindingDisabled test doesn't see the expected exit code. Since build rewinding and action rewinding are somewhat competing strategies for handling lost inputs now, I found it safer to disable build rewinding for the entire test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. We're no longer guarding the full build retry on a specific exit code which is only possible in bazel. So now --experimental_remote_cache_eviction_retries is load bearing for blaze, when there's a lost input with --norewind_lost_inputs. How can we ensure that --experimental_remote_cache_eviction_retries continues to be a no-op for blaze?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how these Bazel/Blaze differences are usually handled. If setting the flag to 0 in the global .blazerc is not sufficient, maybe the flag default could be changed with copybara?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use blazerc if necessary, I was just hoping there was a way to continue keeping the flag completely irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the most natural solution and the one that's been used in the past when something like this came up. The only alternative I see would be to add a bit to the lost input exception types that all Bazel call sites set differently, but that feels worse to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mailed a blazerc change to set this flag to 0. Let's make sure that lands safely before importing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, another issue. We need to modify invocation policy to force the flag to 0 instead of ignoring it. Otherwise the blazerc value will be ignored. I'll update here when that lands. If you don't hear from me within a week, please ping.
src/test/java/com/google/devtools/build/lib/skyframe/rewinding/RewindingTestsHelper.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/common/LostInputsEvent.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/util/DigestUtil.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/skyframe/ActionExecutionFunction.java
Outdated
Show resolved
Hide resolved
|
@coeuvre This is now ready to be merged. |
|
Importing now! |
Cache evictions encountered during reads of remote files in `RemoteActionFileSystem` now result in the build being retried when `--experimental_remote_cache_eviction_retries` is set to a positive value (the default). This is enabled by implementing the `checkForLostInputs` method on the `RemoteOutputService`, building on the refactoring performed in #25396. Closes #25358. PiperOrigin-RevId: 736064349 Change-Id: Idc70d55138c3366d22ece7f0579b242b3c217da8
Cache evictions encountered during reads of remote files in `RemoteActionFileSystem` now result in the build being retried when `--experimental_remote_cache_eviction_retries` is set to a positive value (the default). This is enabled by implementing the `checkForLostInputs` method on the `RemoteOutputService`, building on the refactoring performed in bazelbuild#25396. Closes bazelbuild#25358. PiperOrigin-RevId: 736064349 Change-Id: Idc70d55138c3366d22ece7f0579b242b3c217da8
|
@bazel-io fork 8.2.0 |
…winding By throwing `LostInputExecExceptions` in the right locations, build rewinding can be realized as a coarse-grained fallback for the more fine-grained action rewinding. This approach avoids further divergence between Bazel and Blaze logic and ensures that further improvements to build rewinding also simplify future efforts to add action rewinding to Bazel. This should be a pure refactoring with one exception: if the download of the outputs of a remotely executed action (with caching disabled) fails due cache eviction, the build is no longer retried after the remote execution has already been retried. An RE backend that repeatedly fails to keep the outputs of a freshly excuted action in the cache long enough for them to be downloaded is faulty and additional full build retries are pointless after retries of the action execution have failed repeatedly. This change in behavior is required to get existing tests to pass. Closes bazelbuild#25396. PiperOrigin-RevId: 734550819 Change-Id: Idabee2483beb05721c413c27ab610b8f260412ec (cherry picked from commit e8526d4)
…winding (#25545) By throwing `LostInputExecExceptions` in the right locations, build rewinding can be realized as a coarse-grained fallback for the more fine-grained action rewinding. This approach avoids further divergence between Bazel and Blaze logic and ensures that further improvements to build rewinding also simplify future efforts to add action rewinding to Bazel. This should be a pure refactoring with one exception: if the download of the outputs of a remotely executed action (with caching disabled) fails due cache eviction, the build is no longer retried after the remote execution has already been retried. An RE backend that repeatedly fails to keep the outputs of a freshly excuted action in the cache long enough for them to be downloaded is faulty and additional full build retries are pointless after retries of the action execution have failed repeatedly. This change in behavior is required to get existing tests to pass. Closes #25396. PiperOrigin-RevId: 734550819 Change-Id: Idabee2483beb05721c413c27ab610b8f260412ec (cherry picked from commit e8526d4) Closes #25544
Cache evictions encountered during reads of remote files in `RemoteActionFileSystem` now result in the build being retried when `--experimental_remote_cache_eviction_retries` is set to a positive value (the default). This is enabled by implementing the `checkForLostInputs` method on the `RemoteOutputService`, building on the refactoring performed in bazelbuild#25396. Closes bazelbuild#25358. PiperOrigin-RevId: 736064349 Change-Id: Idc70d55138c3366d22ece7f0579b242b3c217da8
…g digest (#25538) Cache evictions encountered during reads of remote files in `RemoteActionFileSystem` now result in the build being retried when `--experimental_remote_cache_eviction_retries` is set to a positive value (the default). This is enabled by implementing the `checkForLostInputs` method on the `RemoteOutputService`, building on the refactoring performed in #25396. Closes #25358. PiperOrigin-RevId: 736064349 Change-Id: Idc70d55138c3366d22ece7f0579b242b3c217da8 Commit 85d9cc9 Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
This is achieved by implementing an `ImportantOutputHandler` for Bazel and moving the downloading logic that previously lived in `CompletionFunction` into it. This slightly differs from Blaze since Bazel accounts for runfiles within `CompletionFunction`'s call to the `ImportantOutputHandler#processOutputsAndGetLostArtifacts` function. Changes made in #25396 to support build rewinding as a fallback of action rewinding are extracted into `ActionRewindStrategy` to ensure consistent behavior between lost inputs and lost outputs. This results in a change to the error message shown when lost inputs are encountered but `--rewind_lost_inputs` is not enabled, which used to not mention the flag (compared to the case of lost outputs, in which it did). New Java tests verify the basic functionality of action rewinding in Bazel. This change requires a fix to `BulkTransferException#add`, which previously resulted in infinite recursion when passed a `BulkTransferException`. Closes #25448. PiperOrigin-RevId: 743552575 Change-Id: If7bd85fd737d9af9ef339ee1ad57f9d1c230b5a9
This is achieved by implementing an `ImportantOutputHandler` for Bazel and moving the downloading logic that previously lived in `CompletionFunction` into it. This slightly differs from Blaze since Bazel accounts for runfiles within `CompletionFunction`'s call to the `ImportantOutputHandler#processOutputsAndGetLostArtifacts` function. Changes made in bazelbuild#25396 to support build rewinding as a fallback of action rewinding are extracted into `ActionRewindStrategy` to ensure consistent behavior between lost inputs and lost outputs. This results in a change to the error message shown when lost inputs are encountered but `--rewind_lost_inputs` is not enabled, which used to not mention the flag (compared to the case of lost outputs, in which it did). New Java tests verify the basic functionality of action rewinding in Bazel. This change requires a fix to `BulkTransferException#add`, which previously resulted in infinite recursion when passed a `BulkTransferException`. Closes bazelbuild#25448. PiperOrigin-RevId: 743552575 Change-Id: If7bd85fd737d9af9ef339ee1ad57f9d1c230b5a9
By throwing
LostInputExecExceptionsin the right locations, build rewinding can be realized as a coarse-grained fallback for the more fine-grained action rewinding. This approach avoids further divergence between Bazel and Blaze logic and ensures that further improvements to build rewinding also simplify future efforts to add action rewinding to Bazel.This should be a pure refactoring with one exception: if the download of the outputs of a remotely executed action (with caching disabled) fails due cache eviction, the build is no longer retried after the remote execution has already been retried. An RE backend that repeatedly fails to keep the outputs of a freshly excuted action in the cache long enough for them to be downloaded is faulty and additional full build retries are pointless after retries of the action execution have failed repeatedly. This change in behavior is required to get existing tests to pass.