Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large actions with BwtB + dynamic execution never converge on local builds #23201

Open
jmmv opened this issue Aug 2, 2024 · 0 comments
Open
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug

Comments

@jmmv
Copy link
Contributor

jmmv commented Aug 2, 2024

Description of the bug:

The scenario is like this:

  • We have some intermediate C++ libraries with lots of large inputs. Let's call one of these X.
  • We enable BwtB to avoid transferring unnecessary intermediates in the common case.
  • We enable dynamic execution for the whole build to optimize incremental builds like "rebuild single C++ file, relink the intermediate library X, run test".

When BwtB is disabled, dynamic execution in this scenario offers a massive speedup for step 3 and our developers really want to have quick turnaround under those types of C++ edits.

However, when we enable BwtB, the build of X never converges on local execution when the network is relatively slow. The problem goes like this:

  • Bazel spawns the build of X locally and remotely.
  • Bazel starts downloading the inputs of X, some of which are very large.
  • The build of X finishes remotely.
  • Bazel cancels the local build of X. This interrupts the download of the large inputs half-way, and those are deleted from disk.

When running this build multiple times, one would expect the chain of actions to happen purely locally at some point. But that's never the case: the build of X is always remote.

The problem here is triggered by the network being relatively slow: because some of the large inputs to X can never complete, X never has a chance to even start running locally. And thus even if running X locally would be faster overall, the remote build always finishes before Bazel has downloaded all inputs.

I think Bazel should either keep the partial downloads on disk and try to resume them later on a subsequent run, or continue the downloads in the background even if the local action is cancelled. I'm not sure what's preferable though. The former seems hard to implement and the latter can lead to problems in large builds with Bazel ending up with a long queue of downloads to process that may ultimately be useless...

Which category does this issue belong to?

Local Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

N/A

What is the output of bazel info release?

release 6.5.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@github-actions github-actions bot added the team-Local-Exec Issues and PRs for the Execution (Local) team label Aug 2, 2024
@zhengwei143 zhengwei143 added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug
Projects
None yet
Development

No branches or pull requests

5 participants