-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RBE actions not properly canceled when BES upload times out #19496
Comments
Running with You also mention that RBE action retries might be a factor - how did you determine that an action was retried? Regarding cancellation: my understanding is that, upon a build failure (which an abrupt exit qualifies as), Bazel cancels any pending gRPC requests to the remote execution server. But it's up to the remote side whether to actually abort the execution (aborting is not necessarily the best option; letting the action finish could still be advantageous to populate the remote cache for a subsequent build). So it would also help to know what's the expected behavior for Buildbarn in this situation. |
Thanks! I'll give this a shot and report back tomorrow
By looking at RPC traces - I can see that a particular action was started via
From what I understand, Buildbarn will continue executing actions as long as a client that is executing it has an open session. Buildbarn will deduplicate identical actions, but that is not a factor here, as:
If no clients care about the action, Buildbarn will time out the action in relatively short time (currently 60s for us; the action will otherwise continue to run for 1+ hours) Additionally, when I Ctrl-C an RBE build, typically I can see RPCs error with a We have previously run builds in an environment where on build timeout the VM is abruptly killed; this would cause actions to continue running on buildbarn, because the TCP sessions weren't properly closed (might be using the wrong words here) - I may be able to fix this with TCP keepalive configuration on buildbarn, but it seems that maybe bazel is not shutting down connections as expected? |
Here's the full stacktrace:
The portion |
It's calling FindMissingBlobs on RBE endpoint because it needs to upload BEP referenced artifacts to CAS. |
Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 90 days unless any other activity occurs. If you think this issue is still relevant and should stay open, please post any comment here and the issue will no longer be marked as stale. |
Description of the bug:
In (at least some) cases, BES upload timing out does not seem to trigger proper cancellation of actions issued over RBE. In my case, I'm seeing:
and noticing that actions continue to run in Buildbarn despite the bazel client (and indeed the VM it was running on) go away. From a cursory look at the BES code, it would seem that abruptly exiting on BES timeout is the intent.
It's possible that only RBE actions that required a retry are not properly canceled - this seems to be a pattern, and I don't yet have a counterexample.
Since bazel is exiting semi-cleanly (i.e. not getting SIGKILL) it should cancel actions executing via RBE, in order to free up remote resources.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Unfortunately, I'm unable to reproduce this issue outside of a nightly long-running build; any tips on gathering more debugging info would be greatly appreciated!
Which operating system are you running Bazel on?
linux
What is the output of
bazel info release
?release 6.3.2
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No; no ideas from those on Slack yet either
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: