-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bazel: make go test
process timeout before bazel kills it
#86363
Conversation
4f055ba
to
80670c0
Compare
"strings" | ||
|
||
"github.com/alessio/shellescape" | ||
"github.com/cockroachdb/errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the standard library errors
instead of cockroachdb/errors
? cockroachdb/errors
is a pretty heavy dependency, we don't necessarily want to force people to build it.
build/bazelutil/bazel-generate.sh
Outdated
CONTENTS=$(bazel run //pkg/cmd/generate-test-suites --run_under="cd $PWD && ") | ||
echo "$CONTENTS" > pkg/BUILD.bazel | ||
fi | ||
|
||
bazel run //pkg/cmd/generate-test-timeouts --run_under="cd $PWD && " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put this in the branch above next to generate-test-suites
, or in a new branch that checks BUILD.bazel
files and *.bzl
files. If none of the BUILD.bazel
or *.bzl
files have changed from master
then generate-test-timeouts
doesn't need to be run.
Also we don't want to unconditionally build buildozer
. Building small go
executables is a non-negligible amount of latency, we don't want to force people to do this for no reason.
This is optional, but consider just merging what you currently have in generate-test-timeouts
into generate-test-suites
. generate-test-suites
already depends on buildozer
and does similar munging in the BUILD.bazel
files, so the logic may just fit right in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is optional, but consider just merging what you currently have in generate-test-timeouts into generate-test-suites. generate-test-suites already depends on buildozer and does similar munging in the BUILD.bazel files, so the logic may just fit right in there.
Right I adopted part of the code from generate-test-suites
but I thought it might be better to separate them because generate-test-suite
outputs something to stdout which is then forwarded to pkg/BUILD.bazel
. Maybe I should make it write its output to pkg/BUILD.bazel
directly (ie in the binary, instead of forwarding stdout to pkg/BUILD.bazel
)? Then move the timeout code to the same package and rename it to generate-test-suites-and-timeouts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should make it write its output to pkg/BUILD.bazel directly (ie in the binary, instead of forwarding stdout to pkg/BUILD.bazel)? Then move the timeout code to the same package and rename it to generate-test-suites-and-timeouts?
This approach makes sense to me. For the new name generate-test-suites-and-timeouts
is not ideal since the binary actually does a few different things at this point. Maybe generate-bazel-extra
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we generate timeouts when *.bzl
files change? I can't find a reason to do that but I don't know enough about everything we are doing with .bzl
files and I am wondering because we do update pkg/BUILD.bazel
when a *.bzl
file changes. My current understanding is that with timeouts we are only concerned about the test target size so if BUILD.bazel
files didn't change and our binary pkg/cmd/generate-bazel-extra
didn't change then we are good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'm not aware of any timeouts that would need to be updated when .bzl
files change. But I think you do need to run generate-test-suites
(especially the generation of pkg/BUILD.bazel
) if .bzl
files change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but I am thinking of doing this:
if ! (files_unchanged_from_upstream $(find_relevant ./pkg -name BUILD.bazel) $(find_relevant ./pkg/cmd/generate-bazel-extra -name BUILD.bazel -or -name '*.go')); then
bazel build @com_github_bazelbuild_buildtools//buildozer:buildozer
bazel run //pkg/cmd/generate-bazel-extra --run_under="cd $PWD && " -- -gen_test_suites -gen_tests_timeouts
elif files_unchanged_from_upstream $(find_relevant ./pkg -name '*.bzl'); then
echo "Skipping //pkg/cmd/generate-bazel-extra (relevant files are unchanged from upstream)."
else
echo "Skipping `generate tests timeouts` from //pkg/cmd/generate-bazel-extra (relevant files are unchanged from upstream)."
bazel build @com_github_bazelbuild_buildtools//buildozer:buildozer
bazel run //pkg/cmd/generate-bazel-extra --run_under="cd $PWD && " -- -gen_test_suites
so we can skip generating timeouts when we don't need to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems sensible to me.
"large": 900, | ||
"enormous": 3600, | ||
} | ||
for _, size := range []string{"small", "medium", "large", "enormous"} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for size, timeout := range testSizeToDefaultTimeout
5d18b47
to
87b0f0d
Compare
So this is ready for a second pass now other than GitHub CI failing. I tried to figure out what is wrong and it looks like it fails here but I wasn't able to understand how it ended in this state. Can you take a look at it when you review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the GitHub CI failure, that's pretty odd. I'm not seeing anything in your change that would cause the failure. Maybe try rebasing and re-running CI to see if it was intermittent?
pkg/cmd/generate-bazel-extra/main.go
Outdated
@@ -38,7 +41,32 @@ func runBuildozer(args []string) { | |||
} | |||
} | |||
|
|||
func main() { | |||
func getTestTargets(testTargetSize string) ([]string, error) { | |||
match, _ := regexp.MatchString("\\b(?:small|medium|large|enormous)\\b", testTargetSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Using regexp
for this is pretty odd. I'd say factor out testSizeToDefaultTimeout
so it is a global variable, and then you can do
if _, ok := testSizeToDefaultTimeout[testTargetSize]; !ok {
....
}
57838d5
to
f2ad53a
Compare
Should this PR close that issue? This is helping us debug bazel timeouts (which is great), but TC timeouts would still produce no stack trace. Or am I missing something? |
Assuming TC timeouts are configured to be longer than the underlying |
f2ad53a
to
24d930f
Compare
This is true; I think the issue may be that we need to update our TC timeout then. Every couple of days I see a build that times out running go tests (e.g., https://teamcity.cockroachdb.com/viewLog.html?buildId=6144348&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_BazelUnitTests) |
Hmm, that is a pretty odd case. 2 hours is very long for the unit test job, I'm not sure why it would have stalled to such an extent. We can look at bumping the timeout for this job, but also note I'll be increasing the size of our TeamCity agents to make these jobs faster. |
So I rebased on Friday and still got the same error. Rebased today and interestingly I now get |
While it is odd, it's happening somewhat frequently. Here are two more cases that came to my attention just this month (that I could remember off the top of my head, it's likely there were more):
Makes sense, hopefully that will drastically reduce these failures 👍 |
9727e5c
to
68160c3
Compare
@sjbarag Ahmad is consistently seeing this failure to build the web UI tests in GitHub CI (NOT Bazel CI). Is this error meaningful at all to you? We are pretty perplexed because this PR doesn't touch any UI or non-Bazel stuff at all.
|
@rickystewart There's basically no reason for Perhaps this was a fluke? |
Wait I am confused, why does |
Previously, bazel will kill the test process if it goes beyond the timeout duration set for its size. This prevented us from knowing which tests timed out and also prevented us from getting their stack traces. This patch causes the `go test` process to timeout before bazel kills it to allow us to know which test timed out and get its stack trace. Closes cockroachdb#78185 Release justification: Non-production code changes Release note: None
99430c8
to
26192d2
Compare
@sjbarag just pointing that it's still failing if you have any other thoughts |
bors r+ |
Build succeeded: |
@healthy-pod thanks for your work on this! When I'm running logic tests locally through
|
In cockroachdb#86363, we added a test.timeout arg to all go_test targets. When generate-logictest runs out of `bazel-generate.sh`, logic tests build files are re-created based on the template that was missing the `test.timeout` arg. This patch updates the template to include a test timeout. Release justification: Non-production code changes Release note: None
Thanks for catching this, it could have been easily missed until a test times out without a stack trace in teamcity. It should be fixed by #87299 |
In cockroachdb#86363, we added a test.timeout arg to all go_test targets. When generate-logictest runs out of `bazel-generate.sh`, logic tests build files are re-created based on the template that was missing the `test.timeout` arg. This patch updates the template to include a test timeout. Release justification: Non-production code changes Release note: None
@@ -319,6 +319,7 @@ go_test( | |||
"txn_recovery_integration_test.go", | |||
"txn_wait_queue_test.go", | |||
], | |||
args = ["-test.timeout=3595s"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@healthy-pod I am curious how these timeout values were derived?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bazel sets the test timeout based on the test size. If the test size is not explicitly set then it will be medium
. The default mappings are:
small: 60s timeout
medium: 300s timeout
large: 900s timeout
enormous: 3600s timeout
We let the go test
process timeout 5 seconds before bazel kills it and that's how the quoted 3595s
(enormous) was derived.
87299: generate-logictest: add test timeout to BUILD files template r=rickystewart a=healthy-pod In #86363, we added a test.timeout arg to all go_test targets. When generate-logictest runs out of `bazel-generate.sh`, logic tests build files are re-created based on the template that was missing the `test.timeout` arg and since the timeout script doesn't run, the timeouts get deleted. This patch updates the template to include a test timeout. Release justification: Non-production code changes Release note: None Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
In cockroachdb#86363, we added a timeout to tests at the test binary level. Tests running with `--config=race` however use a custom timeout, different from the original default values set by bazel based on the test size. This patch propagates those custom values to testrace in CI. Release justification: Non-production code changes Release note: None
86475: cli: support `COCKROACH_REDACTION_POLICY_MANAGED` env var r=knz a=abarganier Currently, log redaction policies have no way to discern their own runtime environment. Logged objects that may be considered sensitive and unsafe in on-prem deployments of CockroachDB might be otherwise safe when we're running within a managed service such as Cockroach Cloud. For example, CLI argument lists included as part of the `cockroach start` command are already known to those operating the managed service, so there's no reason we should be redacting this information from logs in this case. This patch adds the `--managed` flag to the start commands. This flag is plumbed through to the global logging config object where the log package has access to it. We also introduce `log.SafeManaged(s interface{})`, which conditionally marks an object with `redact.Safe()` depending on whether or not we are running as a managed service. This is inspired by the original `log.SafeOperational(s interface{})` function. I believe that this new `--managed` flag should not be advertised in our public documentation, as its intended use is for those running Cockroach Cloud. Release justification: low-risk, high benefit changes to existing functionality. The new CLI flag has a minimal impact on DB operations and provides high value reduction of log redaction, which will be necessary for support staff with our latest compliance requirements. Release note (cli change): `cockroach start` commands now have an additional `--managed` flag that can be used to indicate whether or not the node is running as part of a managed service (e.g. Cockroach Cloud). Perhaps this shouldn't be advertised in our public facing docs, as its only intended for use by those running Cockroach Cloud and not for on-prem deployments. Addresses #86316 86774: sql/schemachanger: version gate element creation r=Xiang-Gu a=ajwerner Commit 1: fix minSupportedVersion of `ADD COLUMN` in new schema changer from v22.1 to v22.2 Commit 2: We cannot create elements the old version of the code does not know about. Release justification: fixed mixed version incompatibility Release note: None 87317: sql: improve and clean up tracing a bit r=yuzefovich a=yuzefovich **tracing: omit distsql ids from SHOW TRACE** This commit removes the custom handling of tracing tags with `cockroach.` prefix when populating the output of SHOW TRACE. Previously, all tags with this prefix would be included into the "start span" message, possibly taking up multiple lines in the SHOW TRACE output. However, there is only one user of those tags - ids of different components of DistSQL infrastructure, and I don't think it's helpful to have those ids in the output at all, so this commit removes this ability and makes the "start span" message nicer. This special handling was introduced four years ago in 60978aa and at that time there might have been a reason to have some special handling of these tags (so that they become visible when viewing the jaeger trace), but that is not necessary anymore (I believe because we now always propagate all tags across nodes). Release justification: low-risk cleanup. Release note: None **execinfra: clean up ProcessorBase a bit** This commit performs the following cleanup: - it removes the redundant `InternalClose` implementations. At some point last year an "extended" version was introduced to take in a closure to be called when the processor is being closed. There is only one user for that, and it can itself do the necessary cleanup before calling `InternalClose` - it removes the update to `rowIdx` of `ProcOutputHelper` (which tracks how many rows the helper has emitted) when the processor is closed. The idea behind this was to protect from the future calls to `Next` method so that the helper doesn't emit more rows once it is closed, but it is not allowed by the interface anyway - once the processor is closed, no new calls to `Next` are allowed, so this protection was meaningless. However, what prompted me to look into this was the fact that the `rowIdx` field was being set to `MaxInt64` which would trip up the stats collection change in the following commit. Release justification: low-risk cleanup. Release note: None **sql: improve tracing of some things** This commit makes it so that we create a tracing span for all processors. Previously, out of performance considerations, we elided the spans for the columnarizer, materializer, planNodeToRowSource, and flowCoordinator, but given the improvements to tracing in the last year or so it doesn't seem necessary to do that anymore. In particular so given that we don't create tracing spans by default any way, only when the tracing is enabled for the statement. Additionally, this commit adds a couple of tags to the tracing span of the vectorized outbox (similar to what we have in the row-by-row engine). Release justification: low-risk improvement. Release note: None 87468: clusterversion: require env var to do poison dev upgrades r=dt a=dt Previously the offsetting of all in-development versions ensured that upgrading to one of these would mark the cluster as untrusted, dev-version-only, however the fact we did not offset already released versions meant that one could perform such an upgrade easily, by simply starting a dev binary in a stable release data directory, as upgrades happen by default automatically. This could lead to an inadvertent and irreversible conversion of a cluster to dev versions. This changes the behavior to default to offsetting _all_ versions, not just the the new ones, which has the effect of also offset the version _from which_ a binary is willing to upgrade. This significantly reduces the risk of inadvertently upgrading a cluster to a dev version, as by default, the dev version will refuse to start in a release-version's data directory. In some cases however it is useful to start a custom or development build in an existing data directory, e.g. a snapshot collected from production. For these cases, the env var COCKROACH_UPGRADE_TO_DEV_VERSION can be used to only offset the second defined version and above, meaning that the first version, which is typically the minBinaryVersion, is left alone, and that binary thus considers itself backwards compatible with that older release version and will thus be willing to start in / join that existing cluster. Release note: none. Release justification: bug fix in new functionality. 87474: ci: pass custom timeout to testrace in CI r=rickystewart a=healthy-pod In #86363, we added a timeout to tests at the test binary level. Tests running with `--config=race` however use a custom timeout, different from the original default values set by bazel based on the test size. This patch propagates those custom values to testrace in CI. Release justification: Non-production code changes Release note: None Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com> Co-authored-by: Xiang Gu <xiang@cockroachlabs.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
In cockroachdb#86363, we added a timeout to tests at the test binary level. Tests running with `--config=race` however use a custom timeout, different from the original default values set by bazel based on the test size. This patch propagates those custom values to testrace in CI. Release justification: Non-production code changes Release note: None
We manage unit tests timeouts at two levels: 1. Bazel timeout, by default [60s,300s,900s,3600s] for [small,medium,large,enormous] targets. 2. Go timeout, set to 5 seconds less than the corresponding Bazel timeout [see cockroachdb#86363]. Previously, unit tests used the same timeouts both when running in `Bazel Essential CI` and elsewhere. As a result, enormous test targets inherited a timeout of 1 hour from Bazel's default timeout. This is way beyond the expected time needed by any test target in `Bazel Essential CI`. We can't change enormous targets to large ones for two reasons: 1. `Enormous` is also used to indicate the resources needed by a test target. 2. `Enormous` test targets may still need the large timeout when running locally. To make this possible, we needed to support setting an `attr` value to a `select` using Buildozer. This was done in bazelbuild/buildtools#1153. This change only affects the timeout of `enormous` test targets. It however makes it simple to customize the timeout of other test sizes if desired in the future. Release note: None Epic: none
We manage unit tests timeouts at two levels: 1. Bazel timeout, by default [60s,300s,900s,3600s] for [small,medium,large,enormous] targets. 2. Go timeout, set to 5 seconds less than the corresponding Bazel timeout [see cockroachdb#86363]. Previously, unit tests used the same timeouts both when running in `Bazel Essential CI` and elsewhere. As a result, enormous test targets inherited a timeout of 1 hour from Bazel's default timeout. This is way beyond the expected time needed by any test target in `Bazel Essential CI`. We can't change enormous targets to large ones for two reasons: 1. `Enormous` is also used to indicate the resources needed by a test target. 2. `Enormous` test targets may still need the large timeout when running locally. To make this possible, we needed to support setting an `attr` value to a `select` using Buildozer. This was done in bazelbuild/buildtools#1153. This change only affects the timeout of `enormous` test targets. It however makes it simple to customize the timeout of other test sizes if desired in the future. Release note: None Epic: none
102719: *: customize the timeouts used by unit tests in `Bazel Essential CI` r=rickystewart a=healthy-pod We manage unit tests timeouts at two levels: 1. Bazel timeout, by default [60s,300s,900s,3600s] for [small,medium,large,enormous] targets. 2. Go timeout, set to 5 seconds less than the corresponding Bazel timeout [see #86363]. Previously, unit tests used the same timeouts both when running in `Bazel Essential CI` and elsewhere. As a result, enormous test targets inherited a timeout of 1 hour from Bazel's default timeout. This is way beyond the expected time needed by any test target in `Bazel Essential CI`. We can't change enormous targets to large ones for two reasons: 1. `Enormous` is also used to indicate the resources needed by a test target. 2. `Enormous` test targets may still need the large timeout when running locally. To make this possible, we needed to support setting an `attr` value to a `select` using Buildozer. This was done in bazelbuild/buildtools#1153. This change only affects the timeout of `enormous` test targets. It however makes it simple to customize the timeout of other test sizes if desired in the future. Release note: None Epic: none Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
Previously, bazel will kill the test process if it goes
beyond the timeout duration set for its size. This prevented
us from knowing which tests timed out and also prevented us
from getting their stack traces.
This patch causes the
go test
process to timeout beforebazel kills it to allow us to know which test timed out and
get its stack trace.
Closes #78185
Release justification: Non-production code changes
Release note: None