Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Quiet context.Canceled errors during shutdown #2745

Merged
merged 1 commit into from
Aug 4, 2024

Conversation

cbandy
Copy link
Contributor

@cbandy cbandy commented Mar 30, 2024

Runnable implementations that return ctx.Err() cause a spurious "error received after stop" log message.

Fixes #1927

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 30, 2024
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 30, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @cbandy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 30, 2024
Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test, please?
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 30, 2024
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 30, 2024
@cbandy
Copy link
Contributor Author

cbandy commented Mar 30, 2024

@alvaroaleman Thanks for taking a look. Test added in an amended commit.

@alvaroaleman
Copy link
Member

/retest


logs := []string{}
options.Logger = funcr.NewJSON(func(object string) {
logs = append(logs, object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a datarace which is why the tests are failing. You need to protect logs with a mutex or some such

Copy link
Contributor Author

@cbandy cbandy Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mutex here solved a race among append calls, but it revealed a race between this line and the assertion on L1091 during the LeaderElection suite. My guess is that this goroutine is to blame:

// Start the leader election and all required runnables.
{
ctx, cancel := context.WithCancel(context.Background())
cm.leaderElectionCancel = cancel
go func() {

😞 I added a time.Sleep for that suite for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that currently with leaderelection engageStopProcedure is still writing the the logger while the test is asserting on the logs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess would be because the RunnableFunc is not a LeaderElectionRunnable. So it won't wait for leaderelection to close start

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay reproduced locally. ^^ doesn't make sense as everything that doesn't implement LeaderElectionRunnable is treated like it needs leader election

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason why it only occurs with leader election is because in that case we get one more log message:
image
(the other log messages are all produced "synchronously" within engageStopProcedure)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbandy @alvaroaleman If that makes sense to you both, let's add a godoc comment before the sleep and I think we're good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 The fact that Start aggregates its err with stopErr makes me think it expects engageStopProcedure to return the outcome of stopping. Why does it inject a channel at all?

Should engageStopProcedure wait for the errChan-draining goroutine to finish?

Copy link
Member

@sbueringer sbueringer Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 The fact that Start aggregates its err with stopErr makes me think it expects engageStopProcedure to return the outcome of stopping. Why does it inject a channel at all?

Not sure. I think the goal of this case case err := <-cm.errChan: in engageStopProcedure is mostly to avoid running into deadlocks because the errChan blocks.

Should engageStopProcedure wait for the errChan-draining goroutine to finish?

With the way it currently works it can't because this defer (defer close(stopComplete)) in Start is called too late.

Copy link
Member

@sbueringer sbueringer Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could wait in the Start func that the errChan-draining goroutine also finishes.

It would be a bit nicer, but I'm not sure if I want to introduce even more complexity into all of this. It's already pretty bad to reason about it.

Comment on lines -488 to +511
case err, ok := <-cm.errChan:
if ok {
case err := <-cm.errChan:
if !errors.Is(err, context.Canceled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep checking the ok value here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any calls to close errChan, so ok is always true (today).

Also, in my experience, reading from a closed channel in select is for some kind of control flow (break or return or so), or the channel is assigned nil so that it is not selected again.

Copy link
Member

@sbueringer sbueringer Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any calls to close errChan, so ok is always true (today).

Sounds okay to me. I guess if we ever start closing the channel we have to check all usages of the channel anyway.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2024
Runnable implementations that return ctx.Err() cause a spurious
"error received after stop" log message.
@cbandy
Copy link
Contributor Author

cbandy commented Aug 4, 2024

Rebased with recommended changes. PTAL.

Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And sorry this ended up taking so long

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 00e7e5cbef412bab8bf37d3bc3f98327a20c6934

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, cbandy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2024
@k8s-ci-robot k8s-ci-robot merged commit 89b5dee into kubernetes-sigs:main Aug 4, 2024
7 checks passed
@cbandy cbandy deleted the quiet-canceled branch August 4, 2024 23:43
@sbueringer
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: "error received after stop ..." is logged for context cancellation errors
6 participants