Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel: make go test process timeout before bazel kills it #86363

Merged
merged 1 commit into from
Aug 31, 2022

Conversation

healthy-pod
Copy link
Contributor

Previously, bazel will kill the test process if it goes
beyond the timeout duration set for its size. This prevented
us from knowing which tests timed out and also prevented us
from getting their stack traces.

This patch causes the go test process to timeout before
bazel kills it to allow us to know which test timed out and
get its stack trace.

Closes #78185

Release justification: Non-production code changes
Release note: None

@healthy-pod healthy-pod requested a review from a team as a code owner August 18, 2022 01:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

"strings"

"github.com/alessio/shellescape"
"github.com/cockroachdb/errors"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the standard library errors instead of cockroachdb/errors? cockroachdb/errors is a pretty heavy dependency, we don't necessarily want to force people to build it.

CONTENTS=$(bazel run //pkg/cmd/generate-test-suites --run_under="cd $PWD && ")
echo "$CONTENTS" > pkg/BUILD.bazel
fi

bazel run //pkg/cmd/generate-test-timeouts --run_under="cd $PWD && "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put this in the branch above next to generate-test-suites, or in a new branch that checks BUILD.bazel files and *.bzl files. If none of the BUILD.bazel or *.bzl files have changed from master then generate-test-timeouts doesn't need to be run.

Also we don't want to unconditionally build buildozer. Building small go executables is a non-negligible amount of latency, we don't want to force people to do this for no reason.

This is optional, but consider just merging what you currently have in generate-test-timeouts into generate-test-suites. generate-test-suites already depends on buildozer and does similar munging in the BUILD.bazel files, so the logic may just fit right in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is optional, but consider just merging what you currently have in generate-test-timeouts into generate-test-suites. generate-test-suites already depends on buildozer and does similar munging in the BUILD.bazel files, so the logic may just fit right in there.

Right I adopted part of the code from generate-test-suites but I thought it might be better to separate them because generate-test-suite outputs something to stdout which is then forwarded to pkg/BUILD.bazel. Maybe I should make it write its output to pkg/BUILD.bazel directly (ie in the binary, instead of forwarding stdout to pkg/BUILD.bazel)? Then move the timeout code to the same package and rename it to generate-test-suites-and-timeouts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should make it write its output to pkg/BUILD.bazel directly (ie in the binary, instead of forwarding stdout to pkg/BUILD.bazel)? Then move the timeout code to the same package and rename it to generate-test-suites-and-timeouts?

This approach makes sense to me. For the new name generate-test-suites-and-timeouts is not ideal since the binary actually does a few different things at this point. Maybe generate-bazel-extra or something like that?

Copy link
Contributor Author

@healthy-pod healthy-pod Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we generate timeouts when *.bzl files change? I can't find a reason to do that but I don't know enough about everything we are doing with .bzl files and I am wondering because we do update pkg/BUILD.bazel when a *.bzl file changes. My current understanding is that with timeouts we are only concerned about the test target size so if BUILD.bazel files didn't change and our binary pkg/cmd/generate-bazel-extra didn't change then we are good?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not aware of any timeouts that would need to be updated when .bzl files change. But I think you do need to run generate-test-suites (especially the generation of pkg/BUILD.bazel) if .bzl files change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but I am thinking of doing this:

if ! (files_unchanged_from_upstream $(find_relevant ./pkg -name BUILD.bazel) $(find_relevant ./pkg/cmd/generate-bazel-extra -name BUILD.bazel -or -name '*.go')); then
  bazel build @com_github_bazelbuild_buildtools//buildozer:buildozer
  bazel run //pkg/cmd/generate-bazel-extra --run_under="cd $PWD && " -- -gen_test_suites -gen_tests_timeouts
elif files_unchanged_from_upstream $(find_relevant ./pkg -name '*.bzl'); then
  echo "Skipping //pkg/cmd/generate-bazel-extra (relevant files are unchanged from upstream)."
else
  echo "Skipping `generate tests timeouts` from //pkg/cmd/generate-bazel-extra (relevant files are unchanged from upstream)."
  bazel build @com_github_bazelbuild_buildtools//buildozer:buildozer
  bazel run //pkg/cmd/generate-bazel-extra --run_under="cd $PWD && " -- -gen_test_suites

so we can skip generating timeouts when we don't need to

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sensible to me.

"large": 900,
"enormous": 3600,
}
for _, size := range []string{"small", "medium", "large", "enormous"} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for size, timeout := range testSizeToDefaultTimeout

@healthy-pod healthy-pod force-pushed the improve-ci-obs branch 3 times, most recently from 5d18b47 to 87b0f0d Compare August 18, 2022 22:07
@healthy-pod
Copy link
Contributor Author

So this is ready for a second pass now other than GitHub CI failing. I tried to figure out what is wrong and it looks like it fails here but I wasn't able to understand how it ended in this state. Can you take a look at it when you review?

Copy link
Collaborator

@rickystewart rickystewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the GitHub CI failure, that's pretty odd. I'm not seeing anything in your change that would cause the failure. Maybe try rebasing and re-running CI to see if it was intermittent?

@@ -38,7 +41,32 @@ func runBuildozer(args []string) {
}
}

func main() {
func getTestTargets(testTargetSize string) ([]string, error) {
match, _ := regexp.MatchString("\\b(?:small|medium|large|enormous)\\b", testTargetSize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Using regexp for this is pretty odd. I'd say factor out testSizeToDefaultTimeout so it is a global variable, and then you can do

if _, ok := testSizeToDefaultTimeout[testTargetSize]; !ok {
    ....
}

@healthy-pod healthy-pod force-pushed the improve-ci-obs branch 2 times, most recently from 57838d5 to f2ad53a Compare August 19, 2022 23:56
@renatolabs
Copy link
Contributor

Closes #78185

Should this PR close that issue?

This is helping us debug bazel timeouts (which is great), but TC timeouts would still produce no stack trace. Or am I missing something?

@rickystewart
Copy link
Collaborator

This is helping us debug bazel timeouts (which is great), but TC timeouts would still produce no stack trace. Or am I missing something?

Assuming TC timeouts are configured to be longer than the underlying bazel test timeout, there's no problem, right?

@renatolabs
Copy link
Contributor

Assuming TC timeouts are configured to be longer than the underlying bazel test timeout, there's no problem, right?

This is true; I think the issue may be that we need to update our TC timeout then. Every couple of days I see a build that times out running go tests (e.g., https://teamcity.cockroachdb.com/viewLog.html?buildId=6144348&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_BazelUnitTests)

@rickystewart
Copy link
Collaborator

Assuming TC timeouts are configured to be longer than the underlying bazel test timeout, there's no problem, right?

This is true; I think the issue may be that we need to update our TC timeout then. Every couple of days I see a build that times out running go tests (e.g., teamcity.cockroachdb.com/viewLog.html?buildId=6144348&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_BazelUnitTests)

Hmm, that is a pretty odd case. 2 hours is very long for the unit test job, I'm not sure why it would have stalled to such an extent.

We can look at bumping the timeout for this job, but also note I'll be increasing the size of our TeamCity agents to make these jobs faster.

@healthy-pod
Copy link
Contributor Author

For the GitHub CI failure, that's pretty odd. I'm not seeing anything in your change that would cause the failure. Maybe try rebasing and re-running CI to see if it was intermittent?

So I rebased on Friday and still got the same error. Rebased today and interestingly I now get /bin/sh: 1: npm-run-all: not found instead of tsc not found.

@renatolabs
Copy link
Contributor

Hmm, that is a pretty odd case.

While it is odd, it's happening somewhat frequently. Here are two more cases that came to my attention just this month (that I could remember off the top of my head, it's likely there were more):

I'll be increasing the size of our TeamCity agents to make these jobs faster.

Makes sense, hopefully that will drastically reduce these failures 👍

@healthy-pod healthy-pod force-pushed the improve-ci-obs branch 7 times, most recently from 9727e5c to 68160c3 Compare August 23, 2022 02:12
@rickystewart
Copy link
Collaborator

@sjbarag Ahmad is consistently seeing this failure to build the web UI tests in GitHub CI (NOT Bazel CI). Is this error meaningful at all to you? We are pretty perplexed because this PR doesn't touch any UI or non-Bazel stuff at all.

make: Entering directory '/go/src/github.com/cockroachdb/cockroach/pkg/ui'
[02:27:01 ]  make[1]: Entering directory '/go/src/github.com/cockroachdb/cockroach'
[02:27:01 ]  [WARNING] Makefile is deprecated and will be removed shortly.
[02:27:01 ] Running make with -j16
[02:27:01 ]  GOPATH set to /go
[02:27:01 ]  Warning: 'bazel' not found (`brew install bazelisk` for macs)
[02:27:01 ]  build/node-run.sh -C pkg/ui/workspaces/eslint-plugin-crdb yarn build
[02:27:01 ]  build/node-run.sh -C pkg/ui/workspaces/cluster-ui yarn build
[02:27:02 ]  yarn run v1.22.19
[02:27:02 ]  yarn run v1.22.19
[02:27:02 ]    $ tsc
[02:27:02 ]  $ npm-run-all -p build:typescript build:bundle
[02:27:02 ]   /bin/sh: 1: tsc: not found
[02:27:02 ]  error Command failed with exit code 127.
[02:27:02 ] info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
[02:27:02 ]  make[1]: *** [Makefile:364: pkg/ui/workspaces/eslint-plugin-crdb/dist/index.js] Error 127
[02:27:02 ]  make[1]: *** Waiting for unfinished jobs....

@sjbarag
Copy link
Contributor

sjbarag commented Aug 30, 2022

@sjbarag Ahmad is consistently seeing this failure to build the web UI tests in GitHub CI (NOT Bazel CI). Is this error meaningful at all to you? We are pretty perplexed because this PR doesn't touch any UI or non-Bazel stuff at all.

make: Entering directory '/go/src/github.com/cockroachdb/cockroach/pkg/ui'
[02:27:01 ]  make[1]: Entering directory '/go/src/github.com/cockroachdb/cockroach'
[02:27:01 ]  [WARNING] Makefile is deprecated and will be removed shortly.
[02:27:01 ] Running make with -j16
[02:27:01 ]  GOPATH set to /go
[02:27:01 ]  Warning: 'bazel' not found (`brew install bazelisk` for macs)
[02:27:01 ]  build/node-run.sh -C pkg/ui/workspaces/eslint-plugin-crdb yarn build
[02:27:01 ]  build/node-run.sh -C pkg/ui/workspaces/cluster-ui yarn build
[02:27:02 ]  yarn run v1.22.19
[02:27:02 ]  yarn run v1.22.19
[02:27:02 ]    $ tsc
[02:27:02 ]  $ npm-run-all -p build:typescript build:bundle
[02:27:02 ]   /bin/sh: 1: tsc: not found
[02:27:02 ]  error Command failed with exit code 127.
[02:27:02 ] info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
[02:27:02 ]  make[1]: *** [Makefile:364: pkg/ui/workspaces/eslint-plugin-crdb/dist/index.js] Error 127
[02:27:02 ]  make[1]: *** Waiting for unfinished jobs....

@rickystewart There's basically no reason for tsc to not be found — yarn run adds the node_modules/.bin/ directory to the $PATH before executing the script in package.json, and that's where tsc is supposed to be. Interestingly, the most recent commit seems to have passed the "Github CI" job, and is only fialing on some Go tests: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/6244028?buildTab=dependencies&mode=list

Perhaps this was a fluke?

@healthy-pod
Copy link
Contributor Author

Wait I am confused, why does bors say "Build succeeded" and GitHub CI branch is staging? I am pretty sure I didn't bors this code change 🤔

Previously, bazel will kill the test process if it goes
beyond the timeout duration set for its size. This prevented
us from knowing which tests timed out and also prevented us
from getting their stack traces.

This patch causes the `go test` process to timeout before
bazel kills it to allow us to know which test timed out and
get its stack trace.

Closes cockroachdb#78185

Release justification: Non-production code changes
Release note: None
@healthy-pod
Copy link
Contributor Author

@sjbarag just pointing that it's still failing if you have any other thoughts

@rickystewart
Copy link
Collaborator

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 31, 2022

Build succeeded:

@craig craig bot merged commit b272795 into cockroachdb:master Aug 31, 2022
@yuzefovich
Copy link
Member

yuzefovich commented Sep 1, 2022

@healthy-pod thanks for your work on this!

When I'm running logic tests locally through dev, the generation of the logic tests removes the test timeout arg:

diff --git a/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel b/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel
index 9297caca81..e79fb00dd8 100644
--- a/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel
+++ b/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel
@@ -5,7 +5,6 @@ go_test(
     name = "3node-tenant_test",
     size = "enormous",
     srcs = ["generated_test.go"],
-    args = ["-test.timeout=3595s"],
     data = [
         "//c-deps:libgeos",  # keep
         "//pkg/ccl/logictestccl:testdata",  # keep
...

healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Sep 1, 2022
In cockroachdb#86363, we added a test.timeout arg to all go_test targets.
When generate-logictest runs out of `bazel-generate.sh`,
logic tests build files are re-created based on the template
that was missing the `test.timeout` arg. This patch updates
the template to include a test timeout.

Release justification: Non-production code changes
Release note: None
@healthy-pod
Copy link
Contributor Author

@healthy-pod thanks for your work on this!

When I'm running logic tests locally through dev, the generation of the logic tests removes the test timeout arg:

diff --git a/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel b/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel
index 9297caca81..e79fb00dd8 100644
--- a/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel
+++ b/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel
@@ -5,7 +5,6 @@ go_test(
     name = "3node-tenant_test",
     size = "enormous",
     srcs = ["generated_test.go"],
-    args = ["-test.timeout=3595s"],
     data = [
         "//c-deps:libgeos",  # keep
         "//pkg/ccl/logictestccl:testdata",  # keep
...

Thanks for catching this, it could have been easily missed until a test times out without a stack trace in teamcity. It should be fixed by #87299

healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Sep 1, 2022
In cockroachdb#86363, we added a test.timeout arg to all go_test targets.
When generate-logictest runs out of `bazel-generate.sh`,
logic tests build files are re-created based on the template
that was missing the `test.timeout` arg. This patch updates
the template to include a test timeout.

Release justification: Non-production code changes
Release note: None
@@ -319,6 +319,7 @@ go_test(
"txn_recovery_integration_test.go",
"txn_wait_queue_test.go",
],
args = ["-test.timeout=3595s"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@healthy-pod I am curious how these timeout values were derived?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bazel sets the test timeout based on the test size. If the test size is not explicitly set then it will be medium. The default mappings are:

small: 60s timeout
medium: 300s timeout
large: 900s timeout
enormous: 3600s timeout

We let the go test process timeout 5 seconds before bazel kills it and that's how the quoted 3595s (enormous) was derived.

craig bot pushed a commit that referenced this pull request Sep 2, 2022
87299: generate-logictest: add test timeout to BUILD files template r=rickystewart a=healthy-pod

In #86363, we added a test.timeout arg to all go_test targets.
When generate-logictest runs out of `bazel-generate.sh`,
logic tests build files are re-created based on the template
that was missing the `test.timeout` arg and since the timeout
script doesn't run, the timeouts get deleted. This patch updates
the template to include a test timeout.

Release justification: Non-production code changes
Release note: None

Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Sep 7, 2022
In cockroachdb#86363, we added a timeout to tests at the test binary level.
Tests running with `--config=race` however use a custom
timeout, different from the original default values set by bazel
based on the test size.

This patch propagates those custom values to testrace in CI.

Release justification: Non-production code changes
Release note: None
craig bot pushed a commit that referenced this pull request Sep 8, 2022
86475: cli: support `COCKROACH_REDACTION_POLICY_MANAGED` env var r=knz a=abarganier

Currently, log redaction policies have no way to discern their own
runtime environment. Logged objects that may be considered sensitive
and unsafe in on-prem deployments of CockroachDB might be otherwise
safe when we're running within a managed service such as Cockroach
Cloud. For example, CLI argument lists included as part of the
`cockroach start` command are already known to those operating the
managed service, so there's no reason we should be redacting this
information from logs in this case.

This patch adds the `--managed` flag to the start commands. This
flag is plumbed through to the global logging config object where
the log package has access to it.

We also introduce `log.SafeManaged(s interface{})`, which conditionally
marks an object with `redact.Safe()` depending on whether or not we
are running as a managed service. This is inspired by the original
`log.SafeOperational(s interface{})` function.

I believe that this new `--managed` flag should not be advertised in
our public documentation, as its intended use is for those running
Cockroach Cloud.

Release justification: low-risk, high benefit changes to existing
functionality. The new CLI flag has a minimal impact on DB
operations and provides high value reduction of log redaction,
which will be necessary for support staff with our latest compliance
requirements.

Release note (cli change): `cockroach start` commands now have an
additional `--managed` flag that can be used to indicate whether
or not the node is running as part of a managed service (e.g.
Cockroach Cloud). Perhaps this shouldn't be advertised in our
public facing docs, as its only intended for use by those running
Cockroach Cloud and not for on-prem deployments.

Addresses #86316

86774: sql/schemachanger: version gate element creation r=Xiang-Gu a=ajwerner

Commit 1: fix minSupportedVersion of `ADD COLUMN` in new schema changer
from v22.1 to v22.2
Commit 2: We cannot create elements the old version of the code does not know about.

Release justification: fixed mixed version incompatibility
Release note: None

87317: sql: improve and clean up tracing a bit r=yuzefovich a=yuzefovich

**tracing: omit distsql ids from SHOW TRACE**

This commit removes the custom handling of tracing tags with
`cockroach.` prefix when populating the output of SHOW TRACE.
Previously, all tags with this prefix would be included into the "start
span" message, possibly taking up multiple lines in the SHOW TRACE
output. However, there is only one user of those tags - ids of different
components of DistSQL infrastructure, and I don't think it's helpful to
have those ids in the output at all, so this commit removes this ability
and makes the "start span" message nicer.

This special handling was introduced four years ago in
60978aa and at that time there might
have been a reason to have some special handling of these tags (so that
they become visible when viewing the jaeger trace), but that is not
necessary anymore (I believe because we now always propagate all tags
across nodes).

Release justification: low-risk cleanup.

Release note: None

**execinfra: clean up ProcessorBase a bit**

This commit performs the following cleanup:
- it removes the redundant `InternalClose` implementations. At some
point last year an "extended" version was introduced to take in
a closure to be called when the processor is being closed. There is only
one user for that, and it can itself do the necessary cleanup before
calling `InternalClose`
- it removes the update to `rowIdx` of `ProcOutputHelper` (which tracks
how many rows the helper has emitted) when the processor is closed. The
idea behind this was to protect from the future calls to `Next` method
so that the helper doesn't emit more rows once it is closed, but it is
not allowed by the interface anyway - once the processor is closed, no
new calls to `Next` are allowed, so this protection was meaningless.
However, what prompted me to look into this was the fact that the
`rowIdx` field was being set to `MaxInt64` which would trip up the stats
collection change in the following commit.

Release justification: low-risk cleanup.

Release note: None

**sql: improve tracing of some things**

This commit makes it so that we create a tracing span for all
processors. Previously, out of performance considerations, we elided the
spans for the columnarizer, materializer, planNodeToRowSource, and
flowCoordinator, but given the improvements to tracing in the last year
or so it doesn't seem necessary to do that anymore. In particular so
given that we don't create tracing spans by default any way, only when
the tracing is enabled for the statement.

Additionally, this commit adds a couple of tags to the tracing span of
the vectorized outbox (similar to what we have in the row-by-row
engine).

Release justification: low-risk improvement.

Release note: None

87468: clusterversion: require env var to do poison dev upgrades r=dt a=dt

Previously the offsetting of all in-development versions ensured that upgrading to one of these would mark the cluster as untrusted, dev-version-only, however the fact we did not offset already released versions meant that one could perform such an upgrade easily, by simply starting a dev binary in a stable release data directory, as upgrades happen by default automatically. This could lead to an inadvertent and irreversible conversion of a cluster to dev versions.

This changes the behavior to default to offsetting _all_ versions, not just the the new ones, which has the effect of also offset the version _from which_ a binary is willing to upgrade. This significantly reduces the risk of inadvertently upgrading a cluster to a dev version, as by default, the dev version will refuse to start in a release-version's data directory.

In some cases however it is useful to start a custom or development build in an existing data directory, e.g. a snapshot collected from production. For these cases, the env var COCKROACH_UPGRADE_TO_DEV_VERSION can be used to only offset the second defined version and above, meaning that the first version, which is typically the minBinaryVersion, is left alone, and that binary thus considers itself backwards compatible with that older release version and will thus be willing to start in / join that existing cluster.

Release note: none.

Release justification: bug fix in new functionality.

87474: ci: pass custom timeout to testrace in CI r=rickystewart a=healthy-pod

In #86363, we added a timeout to tests at the test binary level. Tests running with `--config=race` however use a custom timeout, different from the original default values set by bazel based on the test size.

This patch propagates those custom values to testrace in CI.

Release justification: Non-production code changes
Release note: None

Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com>
Co-authored-by: Xiang Gu <xiang@cockroachlabs.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
@healthy-pod healthy-pod deleted the improve-ci-obs branch September 20, 2022 23:31
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Oct 3, 2022
In cockroachdb#86363, we added a timeout to tests at the test binary level.
Tests running with `--config=race` however use a custom
timeout, different from the original default values set by bazel
based on the test size.

This patch propagates those custom values to testrace in CI.

Release justification: Non-production code changes
Release note: None
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request May 2, 2023
We manage unit tests timeouts at two levels:
1. Bazel timeout, by default [60s,300s,900s,3600s] for [small,medium,large,enormous] targets.
2. Go timeout, set to 5 seconds less than the corresponding Bazel timeout [see cockroachdb#86363].

Previously, unit tests used the same timeouts both when running in `Bazel Essential
CI` and elsewhere. As a result, enormous test targets inherited a timeout of 1 hour
from Bazel's default timeout. This is way beyond the expected time needed by any
test target in `Bazel Essential CI`. We can't change enormous targets to large ones for
two reasons:
1. `Enormous` is also used to indicate the resources needed by a test target.
2. `Enormous` test targets may still need the large timeout when running locally.

To make this possible, we needed to support setting an `attr` value to a `select`
using Buildozer. This was done in bazelbuild/buildtools#1153.

This change only affects the timeout of `enormous` test targets. It however makes it
simple to customize the timeout of other test sizes if desired in the future.

Release note: None
Epic: none
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request May 2, 2023
We manage unit tests timeouts at two levels:
1. Bazel timeout, by default [60s,300s,900s,3600s] for [small,medium,large,enormous] targets.
2. Go timeout, set to 5 seconds less than the corresponding Bazel timeout [see cockroachdb#86363].

Previously, unit tests used the same timeouts both when running in `Bazel Essential
CI` and elsewhere. As a result, enormous test targets inherited a timeout of 1 hour
from Bazel's default timeout. This is way beyond the expected time needed by any
test target in `Bazel Essential CI`. We can't change enormous targets to large ones for
two reasons:
1. `Enormous` is also used to indicate the resources needed by a test target.
2. `Enormous` test targets may still need the large timeout when running locally.

To make this possible, we needed to support setting an `attr` value to a `select`
using Buildozer. This was done in bazelbuild/buildtools#1153.

This change only affects the timeout of `enormous` test targets. It however makes it
simple to customize the timeout of other test sizes if desired in the future.

Release note: None
Epic: none
craig bot pushed a commit that referenced this pull request May 6, 2023
102719: *: customize the timeouts used by unit tests in `Bazel Essential CI` r=rickystewart a=healthy-pod

We manage unit tests timeouts at two levels:
1. Bazel timeout, by default [60s,300s,900s,3600s] for [small,medium,large,enormous] targets.
2. Go timeout, set to 5 seconds less than the corresponding Bazel timeout [see #86363].

Previously, unit tests used the same timeouts both when running in `Bazel Essential CI` and elsewhere. As a result, enormous test targets inherited a timeout of 1 hour from Bazel's default timeout. This is way beyond the expected time needed by any test target in `Bazel Essential CI`. We can't change enormous targets to large ones for two reasons:
1. `Enormous` is also used to indicate the resources needed by a test target.
2. `Enormous` test targets may still need the large timeout when running locally.

To make this possible, we needed to support setting an `attr` value to a `select` using Buildozer. This was done in bazelbuild/buildtools#1153.

This change only affects the timeout of `enormous` test targets. It however makes it simple to customize the timeout of other test sizes if desired in the future.

Release note: None
Epic: none

Co-authored-by: healthy-pod <ahmad@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

logictest: improve observability of TeamCity timeouts
7 participants