Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test failure in staging/src/k8s.io/client-go/util/workqueue #125581

Open
tatsuhiro-t opened this issue Jun 19, 2024 · 11 comments · May be fixed by kubernetes/utils#312
Open

Flaky test failure in staging/src/k8s.io/client-go/util/workqueue #125581

tatsuhiro-t opened this issue Jun 19, 2024 · 11 comments · May be fixed by kubernetes/utils#312
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@tatsuhiro-t
Copy link
Contributor

tatsuhiro-t commented Jun 19, 2024

What happened?

$ go test -run TestAddTwoFireEarly -race -count=10000 ./staging/src/k8s.io/client-go/util/workqueue/...
--- FAIL: TestAddTwoFireEarly (10.01s)
    delaying_queue_test.go:160: unexpected err: timed out waiting for the condition
FAIL
FAIL	k8s.io/client-go/util/workqueue	100.405s
FAIL

What did you expect to happen?

Tests always succeed

How can we reproduce it (as minimally and precisely as possible)?

Run

go test -run TestAddTwoFireEarly -race -count=10000 ./staging/src/k8s.io/client-go/util/workqueue/...

Repeat until it fails.

Anything else we need to know?

If FakeClock.Step is called after https://github.com/kubernetes/client-go/blob/d2f5fba1f82d119e09288c70ce87bccb3dc119e4/util/workqueue/delaying_queue.go#L279 and before https://github.com/kubernetes/client-go/blob/d2f5fba1f82d119e09288c70ce87bccb3dc119e4/util/workqueue/delaying_queue.go#L300,
the created timer fires at entry.readyAt.Sub(now) + FakeClock.time that has been adjusted by FakeClock.Step.
Because we do not call FakeClock.Step with the positive duration anymore, this timer never fire. Note now is the FakeClock.time before it is adjusted by FakeClock.Step.

The proposed fix is as follows.
First add NewDeadlineTimer(t *time.Time) Timer to k8s.io/utils/clock.Clock, which fires at the given t.
Second, replace NewTimer with NewDeadlineTimer at https://github.com/kubernetes/client-go/blob/d2f5fba1f82d119e09288c70ce87bccb3dc119e4/util/workqueue/delaying_queue.go#L300
Finally, call FakeClock.Step(0) at https://github.com/kubernetes/client-go/blob/d2f5fba1f82d119e09288c70ce87bccb3dc119e4/util/workqueue/delaying_queue_test.go#L234 so that FakeClock deals with waiters.

Kubernetes version

```console $ kubectl version # paste output here ```

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@tatsuhiro-t tatsuhiro-t added the kind/bug Categorizes issue or PR as related to a bug. label Jun 19, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 19, 2024
@tatsuhiro-t
Copy link
Contributor Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 19, 2024
@mauri870
Copy link
Member

mauri870 commented Jun 19, 2024

Is that reproducible for you with x/tools/stress?

$ go test -c -race ./staging/src/k8s.io/client-go/util/workqueue/...
$ go run golang.org/x/tools/cmd/stress@latest -p 256  ./workqueue.test -test.v -test.run ^TestAddTwoFireEarly$

Edit: I got it after a couple thousand runs

5s: 512 runs so far, 0 failures
10s: 1152 runs so far, 0 failures
15s: 1789 runs so far, 0 failures
20s: 2424 runs so far, 0 failures

/tmp/go-stress-20240619T095442-4267781378
=== RUN   TestAddTwoFireEarly
    delaying_queue_test.go:160: unexpected err: timed out waiting for the condition
--- FAIL: TestAddTwoFireEarly (10.00s)
FAIL


ERROR: exit status 1

25s: 3063 runs so far, 1 failures (0.03%)

@mauri870
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 19, 2024
@mauri870
Copy link
Member

@tatsuhiro-t Your fix makes sense, are you planning on sending a PR? Thanks.

@mauri870
Copy link
Member

/assign

@mauri870
Copy link
Member

Just preassigning, if you want to work on it let me know.

@mauri870
Copy link
Member

/kind flake

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 19, 2024
@mauri870
Copy link
Member

/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Jun 19, 2024
@tatsuhiro-t
Copy link
Contributor Author

Yes, let me work on this issue.

I have not tried x/tools/stress. Lots of trials are needed to reproduce it.

@mauri870
Copy link
Member

Thanks. I don't think -race is required to reproduce the issue. The race instrumentation might slow the runtime down just enough for this to reproduce more easier.

I had better luck reproducing with stress and a higher number of parallel processes:

go test -c ./staging/src/k8s.io/client-go/util/workqueue/...
go run golang.org/x/tools/cmd/stress@latest -p 1024 -failure unexpected ./workqueue.test -test.v -test.run ^TestAddTwoFireEarly$

/assign tatsuhiro-t

@tatsuhiro-t
Copy link
Contributor Author

Yeah, I used -race to save the number of runs and time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants