-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Quiet context.Canceled errors during shutdown #2745
Conversation
Hi @cbandy. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test, please?
/ok-to-test
@alvaroaleman Thanks for taking a look. Test added in an amended commit. |
/retest |
pkg/manager/manager_test.go
Outdated
|
||
logs := []string{} | ||
options.Logger = funcr.NewJSON(func(object string) { | ||
logs = append(logs, object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a datarace which is why the tests are failing. You need to protect logs
with a mutex or some such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mutex here solved a race among append
calls, but it revealed a race between this line and the assertion on L1091 during the LeaderElection suite. My guess is that this goroutine is to blame:
controller-runtime/pkg/manager/internal.go
Lines 423 to 427 in 2136860
// Start the leader election and all required runnables. | |
{ | |
ctx, cancel := context.WithCancel(context.Background()) | |
cm.leaderElectionCancel = cancel | |
go func() { |
😞 I added a time.Sleep
for that suite for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is that currently with leaderelection engageStopProcedure is still writing the the logger while the test is asserting on the logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess would be because the RunnableFunc is not a LeaderElectionRunnable. So it won't wait for leaderelection to close start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay reproduced locally. ^^ doesn't make sense as everything that doesn't implement LeaderElectionRunnable is treated like it needs leader election
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbandy @alvaroaleman If that makes sense to you both, let's add a godoc comment before the sleep and I think we're good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 The fact that Start
aggregates its err
with stopErr
makes me think it expects engageStopProcedure
to return the outcome of stopping. Why does it inject a channel at all?
Should engageStopProcedure
wait for the errChan
-draining goroutine to finish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 The fact that Start aggregates its err with stopErr makes me think it expects engageStopProcedure to return the outcome of stopping. Why does it inject a channel at all?
Not sure. I think the goal of this case case err := <-cm.errChan:
in engageStopProcedure
is mostly to avoid running into deadlocks because the errChan blocks.
Should engageStopProcedure wait for the errChan-draining goroutine to finish?
With the way it currently works it can't because this defer (defer close(stopComplete)
) in Start is called too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could wait in the Start func that the errChan
-draining goroutine also finishes.
It would be a bit nicer, but I'm not sure if I want to introduce even more complexity into all of this. It's already pretty bad to reason about it.
case err, ok := <-cm.errChan: | ||
if ok { | ||
case err := <-cm.errChan: | ||
if !errors.Is(err, context.Canceled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep checking the ok
value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any calls to close errChan
, so ok
is always true (today).
Also, in my experience, reading from a closed channel in select
is for some kind of control flow (break or return or so), or the channel is assigned nil
so that it is not selected again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any calls to close errChan, so ok is always true (today).
Sounds okay to me. I guess if we ever start closing the channel we have to check all usages of the channel anyway.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Runnable implementations that return ctx.Err() cause a spurious "error received after stop" log message.
Rebased with recommended changes. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! And sorry this ended up taking so long
LGTM label has been added. Git tree hash: 00e7e5cbef412bab8bf37d3bc3f98327a20c6934
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, cbandy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Runnable implementations that return ctx.Err() cause a spurious "error received after stop" log message.
Fixes #1927