Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel 5.0.0rc2 hangs when both --bes_backend and --build_event_binary_file specified #14363

Closed
darl opened this issue Dec 1, 2021 · 10 comments
Closed
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@darl
Copy link

darl commented Dec 1, 2021

Description of the problem / feature request:

--bes_backend specified in our .bazelrc
--build_event_binary_file added by intellij plugin

when specifying both bazel sometimes hang.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

bazel build //... --bes_backend=grpc://localhost:6000 --build_event_binary_file=build.log '--override_repository=intellij_aspect=/Users/darl/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/212.5457.46/IntelliJ IDEA.app.plugins/ijwb/aspect' --output_groups=intellij-resolve-java-direct-deps,intellij-info-generic,intellij-info-java-direct-deps --aspects=@intellij_aspect//:intellij_info_bundled.bzl%intellij_info_aspect 

What operating system are you running Bazel on?

MacOS, Linux

What's the output of bazel info release?

release 5.0.0rc2

Have you found anything relevant by searching the web?

Nothing similar

Any other information, logs, or outputs that you want to share?

Thread dump of bazel when it hangs:
https://gist.github.com/darl/94141514150e09b2460030f283cc9f21

@brentleyjones
Copy link
Contributor

cc: @coeuvre

@brentleyjones
Copy link
Contributor

@meteorcloudy probably a release blocker?

@darl
Copy link
Author

darl commented Dec 1, 2021

Just found that part of my report was caused by misbehaving BES.
But bazel still hangs when syncing with IDEA.
Updated issue

@meteorcloudy
Copy link
Member

Can you confirm this doesn't happen with Bazel 4.2.1?

@darl
Copy link
Author

darl commented Dec 2, 2021

Yes, it works with 4.2.1
It even works with 5.0.0rc1.

@meteorcloudy meteorcloudy added P1 I'll work on this now. (Assignee required) release blocker team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug labels Dec 2, 2021
@meteorcloudy
Copy link
Member

meteorcloudy commented Dec 2, 2021

/cc @coeuvre Any guess on how this is caused and how to fix?

@coeuvre
Copy link
Member

coeuvre commented Dec 2, 2021

I didn't see any suspicious commits between 5.0.0rc1 and 5.0.0rc2. Can you please share a minimal repro?

@darl
Copy link
Author

darl commented Dec 3, 2021

After some time I have found:

  1. --remote_max_connections=200 works like workaround.
  2. I'm able to reproduce the issue with https://github.com/bazelbuild/rules_scala if I specify --remote_max_connections=5. See below.
  3. --bes_upload_mode=fully_async looks broken. After build completion, bazel still waits for events to be uploaded (even without build_event_binary_file which disables async mode according to documentation).
  4. After build completion I see a lot of communication with remote_cache.
  5. I think "intellij idea plugin"'s aspect increases amount of outputs so bazel need to check a lot more files in remote_cache.
  6. I don't think build_event_binary_file is related to the issue. But it greatly increases wall time.
  7. When build_event_binary_file is enabled there some stange pauses in communications with remote_cache (according to tcpdump)

Steps to reproduce:

  1. We need aspect from idea plugin: https://github.com/bazelbuild/intellij/tree/master/aspect
wget 'https://plugins.jetbrains.com/plugin/download?rel=true&updateId=147375' -O intellij_bazel.zip
unzip intellij_bazel.zip
  1. some remote cache and bes server:
docker run --rm -d -p 9092:9092 buchgr/bazel-remote-cache:v2.1.2
docker run -p 1985:1985 -p 8080:8080 -d gcr.io/flame-public/buildbuddy-app-onprem:latest
  1. build rules_scala
git clone https://github.com/bazelbuild/rules_scala
cd rules_scala

# replace localhost with docker's host
# and $HOME with directory where plugin got downloaded
USE_BAZEL_VERSION=5.0.0rc2 bazel build //src/... //scala/... //private/... //scala_proto/... //junit/... //jmh/... --remote_cache=grpc://localhost:9092 --bes_backend=grpc://localhost:1985 '--override_repository=intellij_aspect=$HOME/ijwb/aspect' --output_groups=intellij-resolve-java-direct-deps,intellij-info-generic,intellij-info-java-direct-deps --aspects=@intellij_aspect//:intellij_info_bundled.bzl%intellij_info_aspect --remote_max_connections=5 --build_event_binary_file=build.log

after some time:

INFO: Elapsed time: 58.680s, Critical Path: 24.76s
INFO: 530 processes: 161 internal, 335 darwin-sandbox, 34 worker.
INFO: Build completed successfully, 530 total actions
  1. run same command again:
USE_BAZEL_VERSION=5.0.0rc2 bazel build //src/... //scala/... //private/... //scala_proto/... //junit/... //jmh/... --remote_cache=grpc://localhost:9092 --bes_backend=grpc://localhost:1985 '--override_repository=intellij_aspect=$HOME/ijwb/aspect' --output_groups=intellij-resolve-java-direct-deps,intellij-info-generic,intellij-info-java-direct-deps --aspects=@intellij_aspect//:intellij_info_bundled.bzl%intellij_info_aspect --remote_max_connections=5 --build_event_binary_file=build.log

And it hangs

INFO: Invocation ID: 3a03e834-2939-4e41-8921-1eec5613d888
DEBUG: /private/var/tmp/_bazel_darl/289eca14d256fd3c8c2c4ac11f093b5c/external/bazel_toolchains/rules/rbe_repo/checked_in.bzl:106:14: buildkite_config not using checked in configs as bazel rc version was used
INFO: Analyzed 72 targets (1 packages loaded, 4 targets configured).
INFO: Found 72 targets...
INFO: Elapsed time: 0.301s, Critical Path: 0.01s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
Waiting for build events upload: 201s
  BinaryFormatFileTransport
  Build Event Service

@coeuvre
Copy link
Member

coeuvre commented Dec 3, 2021

Thanks for the repro! I am looking into the fix.

@coeuvre
Copy link
Member

coeuvre commented Dec 9, 2021

Found the root cause. Working on the fix.

coeuvre added a commit to coeuvre/bazel that referenced this issue Dec 14, 2021
With recent change to limit the max number of gRPC connections by default, acquiring a connection could suspend a thread if there is no available connection.

gRPC calls are scheduled to a dedicated background thread pool. Workers in the thread pool are responsible to acquire the connection before starting the RPC call.

There could be a race condition that a worker thread handles some gRPC calls and then switches to a new call which will acquire new connections. If the number of connections reaches the max, the worker thread is suspended and doesn't have a chance to switch to previous calls. The connections held by previous calls are, hence, never released.

This PR changes to not use blocking get when acquiring gRPC connections.

Fixes bazelbuild#14363.

Closes bazelbuild#14416.

PiperOrigin-RevId: 416282883
brentleyjones pushed a commit to brentleyjones/bazel that referenced this issue Dec 14, 2021
With recent change to limit the max number of gRPC connections by default, acquiring a connection could suspend a thread if there is no available connection.

gRPC calls are scheduled to a dedicated background thread pool. Workers in the thread pool are responsible to acquire the connection before starting the RPC call.

There could be a race condition that a worker thread handles some gRPC calls and then switches to a new call which will acquire new connections. If the number of connections reaches the max, the worker thread is suspended and doesn't have a chance to switch to previous calls. The connections held by previous calls are, hence, never released.

This PR changes to not use blocking get when acquiring gRPC connections.

Fixes bazelbuild#14363.

Closes bazelbuild#14416.

PiperOrigin-RevId: 416282883
(cherry picked from commit ad663a7)
Wyverald pushed a commit that referenced this issue Dec 14, 2021
With recent change to limit the max number of gRPC connections by default, acquiring a connection could suspend a thread if there is no available connection.

gRPC calls are scheduled to a dedicated background thread pool. Workers in the thread pool are responsible to acquire the connection before starting the RPC call.

There could be a race condition that a worker thread handles some gRPC calls and then switches to a new call which will acquire new connections. If the number of connections reaches the max, the worker thread is suspended and doesn't have a chance to switch to previous calls. The connections held by previous calls are, hence, never released.

This PR changes to not use blocking get when acquiring gRPC connections.

Fixes #14363.

Closes #14416.

PiperOrigin-RevId: 416282883
Bencodes pushed a commit to Bencodes/bazel that referenced this issue Jan 10, 2022
With recent change to limit the max number of gRPC connections by default, acquiring a connection could suspend a thread if there is no available connection.

gRPC calls are scheduled to a dedicated background thread pool. Workers in the thread pool are responsible to acquire the connection before starting the RPC call.

There could be a race condition that a worker thread handles some gRPC calls and then switches to a new call which will acquire new connections. If the number of connections reaches the max, the worker thread is suspended and doesn't have a chance to switch to previous calls. The connections held by previous calls are, hence, never released.

This PR changes to not use blocking get when acquiring gRPC connections.

Fixes bazelbuild#14363.

Closes bazelbuild#14416.

PiperOrigin-RevId: 416282883
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants