-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel CI: Bazel server sometimes failed to bind a port when running inside integration tests #20743
Comments
The code related to the failure: bazel/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java Lines 414 to 447 in 1f2d576
|
We can temporarily set |
To mitigate #20743 PiperOrigin-RevId: 595652084 Change-Id: I349024873d86f9b57dffca3c53429bf8ec3ed453
This is reproducible on CI with running
See https://bazel-review.git.corp.google.com/c/bazel/+/236871/8 |
The issue doesn't seem to be reproducible with a simple Java binary trying to bind a port: |
Tried to bind with netty, still not reproducible: |
Tried to carve out the grpc + netty code from GrpcServerImpl.java and bind a socket in a simple Java binary, now it's reproducible: |
Well that's impressive work. I'm surprised that this is something specific to gRPC + Netty and neither Netty alone nor plain Java reproduces the problem. Maybe (Also, it's probably easier to choose either IPv6 or IPv4 for the testing code) |
I managed to reproduce with
built with the following BUILD file in the Bazel repo
and run
which failed on macstudio9 with (after a few tries)
|
Is it because |
This shouldn't happen randomly right? |
Retrying seems to work:
might be an acceptable workaround? @lberki |
I'm not really happy about it but it sure beats disabling the sandbox on our CI machines. |
To be precise, we are only turning Yeah, I guess this has to be fixed on the grpc side, so there isn't a quick fix likely. |
Fixes bazelbuild#20743 PiperOrigin-RevId: 595935153 Change-Id: I0409552aa92f3886c5abf3bd3ce50d67594dab7e
Hi, grpc team here. Not sure what do you mean has to be fixed on our side. Wouldn't it rather be a For the sake of experiment, I've tried to run it on my mac, all worked well:
|
I'm not running on arm though: |
FWIW, from the symptoms I saw, it's not clear whether it's an issue with the sandbox or gRPC. Also, one could make the case that even if it's a bug in sandbox, it would be nice to have a workaround in gRPC because gRPC is more malleable than the operating system. |
Well kudos to @hvadehra then :) -- Then the involvement of gRPC is probably limited to a workaround, if that. |
Thanks for investigating this! @lberki To be clear, grpc-java and grpc-python share no code at all, so I doubt workaround in grpc makes sense. I would understand if it was a bug in the java implementation, or JDK-specific (or, for that matter, a bug in py impl, or Python-interpreter specific). Retrying on the app side, as you did in #20755 is a very approach. |
I meant we reproduced with some minimal python code without grpc at all. So we are pretty sure this is a macOS sandbox problem instead of a grpc problem. Thanks for looking into this! |
@meteorcloudy Could you share @hvadehra's reproducer? |
But so far, we have only managed to reproduce on certain CI machines, @hvadehra can probably share more findings? |
@meteorcloudy That's a corp URL - could you push a branch or patch to github? Thanks! |
// server.py
repro one-liner: ^ On the problematic CI machines, when run in isolation, this fails reliably exactly every 15 seconds. If run in parallel with other bind attempts or test runs, one or both fail (once every 15 seconds). |
Looks like some recent change redirected me to the corp link, the original link is https://bazel-review.googlesource.com/c/bazel/+/237247. Hopefully it'll keep working. |
@hvadehra Can you share |
Good: Bad: 7/10 are bad. The evidence notwithstanding, to our knowledge there shouldn't be and isn't any difference between them. |
Fixes bazelbuild#20743 PiperOrigin-RevId: 595935153 Change-Id: I0409552aa92f3886c5abf3bd3ce50d67594dab7e
https://buildkite.com/bazel/bazel-bazel/builds/26312#018d1722-0e79-47b4-82ed-9cc47487e05a Related issues: - #20743 - #5206 PiperOrigin-RevId: 599452705 Change-Id: I2fdccd9df513064e5bc9add4f1802d4c1ce9c6da
https://buildkite.com/bazel/bazel-bazel/builds/26312#018d1722-0e79-47b4-82ed-9cc47487e05a Related issues: - bazelbuild#20743 - bazelbuild#5206 PiperOrigin-RevId: 599452705 Change-Id: I2fdccd9df513064e5bc9add4f1802d4c1ce9c6da
Related issues: - bazelbuild#20743 - bazelbuild#5206 PiperOrigin-RevId: 599754818 Change-Id: I228201d578b7459332aebfea6ab4d7c041b3e6c4
Set Xcode version to 15.1 on macOS arm64 machines bazel_determinism_test seems to be flaky due to an non-determinsitc issue of the clang compiler in Xcode 14.2 Fixes #20690 PiperOrigin-RevId: 598760276 Change-Id: Ibc46dfa64fe91f26acfa5091a07c17e3bf97f29c ____ Allow network for two Java tests to avoid binding issue on macOS sandbox https://buildkite.com/bazel/bazel-bazel/builds/26312#018d1722-0e79-47b4-82ed-9cc47487e05a Related issues: - #20743 - #5206 PiperOrigin-RevId: 599452705 Change-Id: I2fdccd9df513064e5bc9add4f1802d4c1ce9c6da ____ Allow network for StarlarkDebugServerTest Related issues: - #20743 - #5206 PiperOrigin-RevId: 599754818 Change-Id: I228201d578b7459332aebfea6ab4d7c041b3e6c4 ___ Add mirror for embedded JDKs URLs Reduce flakiness like: https://buildkite.com/bazel/bazel-bazel/builds/26343#018d1e09-4c23-404c-a307-7476e092c7ab PiperOrigin-RevId: 599759327 Change-Id: I43fa2ec996f03e77da926c2afeaca13cbf029a1b
Description of the bug:
There are many flaky tests on macOS failing with something like:
This started to happen after when enabling ipv6 on macOS machines due to recent infrastructure changes.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
This error is so far only reproducible on CI when running a large number of tests together, see
https://buildkite.com/bazel/bazel-bazel/builds/26101#018cd24a-0086-4792-92bc-16b274588cb4
Which operating system are you running Bazel on?
macOS arm64
What is the output of
bazel info release
?No response
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
Maybe related to #2486
Any other information, logs, or outputs that you want to share?
This seems to only happen with
--sandbox_default_allow_network=false
which we use to block internet access for all integration tests.The text was updated successfully, but these errors were encountered: