-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Kubernetes POD deployment goroutine leak #5693
Conversation
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Test Results2 699 tests +12 2 692 ✔️ +12 1m 54s ⏱️ -4s Results for commit 492d5de. ± Comparison against base commit 70eacc2. This pull request removes 3 and adds 15 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
pkg/corerp/handlers/kubernetes.go
Outdated
@@ -239,11 +241,11 @@ func (handler *kubernetesHandler) WatchUntilReady(ctx context.Context, informerF | |||
|
|||
deploymentInformer.AddEventHandler(handlers) | |||
// Start the informer | |||
informerFactory.Start(wait.NeverStop) | |||
informerFactory.WaitForCacheSync(wait.NeverStop) | |||
informerFactory.Start(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to validate this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes....go routine count is going down by doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah Sorry. I wanted to ask to add unit-test for this change.. As we discussed, I will add it since you're on vacation.
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
}() | ||
handler.startDeploymentInformer(ctx, item, doneCh, errCh) | ||
// This ensures that the informer is stopped when this function is returned. | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that we do not need to call cancel explicitly when the status is matched because of this defer cancel. In any situations, informer is cancelled.
func (handler *kubernetesHandler) Delete(ctx context.Context, options *DeleteOptions) error { | ||
identity := &resourcemodel.KubernetesIdentity{} | ||
if err := store.DecodeMap(options.Resource.Identity.Data, identity); err != nil { | ||
case err := <-errCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol. we never tried to get error from the channel, which was bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
) | ||
|
||
var TestHook bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added timeout duration to kubernetesHandler struct. So we can avoid the test specific code in the logic. (I do not like the old pattern)
return | ||
} | ||
|
||
// There might be parallel deployments in progress, we need to make sure we are watching the right one | ||
if obj.Name != item.GetName() { | ||
return | ||
} | ||
|
||
// This channel is used only by test code to signal that the deployment is being watched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was used for testing. We can test without this.
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
@@ -223,38 +191,74 @@ func (handler *kubernetesHandler) WatchUntilReady(ctx context.Context, informerF | |||
return | |||
} | |||
|
|||
// This channel is used only by test code to signal that the deployment is being watched | |||
if eventHandlerInvokedCh != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary code for testing.
pkg/corerp/handlers/kubernetes.go
Outdated
} | ||
|
||
handler.watchUntilDeploymentReady(ctx, obj, readinessCh) | ||
handler.checkDeploymentStatus(ctx, obj, doneCh) | ||
}, | ||
UpdateFunc: func(old_obj, new_obj any) { | ||
old, ok := old_obj.(*v1.Deployment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know how to test this with fakeImformer. :(
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
}, | ||
UpdateFunc: func(old_obj, new_obj any) { | ||
old, ok := old_obj.(*v1.Deployment) | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the redundant check. This will never happen since it listens v1 deployment informer.
} | ||
|
||
deploymentInformer.AddEventHandler(handlers) | ||
// Start the informer | ||
informerFactory.Start(wait.NeverStop) | ||
informerFactory.WaitForCacheSync(wait.NeverStop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good way since it waits all shared informers lol. we needs to wait with only infomers that we created here.
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - my comments here were minor, up to you whether you want to address them.
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
# Description Fixes goroutine leak in kubernetes handler. The informerfactory which was started in a goroutine was invoked with an option wait.NeverStop which caused the informer to keep the associated goroutines lingering around #5237 <!-- We strive to have all PR being opened based on an issue, where the problem or feature have been discussed prior to implementation. --> Fixes: #5237 ## Checklist Please make sure you've completed the relevant tasks for this PR, out of the following list: * [ ] Code compiles correctly * [ ] Adds necessary unit tests for change * [ ] Adds necessary E2E tests for change * [ ] Unit tests passing * [ ] Extended the documentation / Created issue for it ## Auto-generated summary <!-- GitHub Copilot for docs will auto-generate a summary of the PR --> <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at d1fc1d4</samp> ### Summary 🚀🐛🧹 <!-- 1. 🚀 - This emoji can represent the improvement in performance and concurrency by using go routines and tracking their IDs. 2. 🐛 - This emoji can represent the bug fix and error handling in case of a panic, which can prevent crashes and data loss. 3. 🧹 - This emoji can represent the cleanup and refactoring of unused code and better logging and shutdown logic. --> Improved logging and error handling for async operations and Kubernetes informer in `armrpc` and `corerp` packages. Added `goroutineID` variable to `worker.go` to identify go routines. > _`goroutineID` logs_ > _panic and async tasks - /_ > _improved error handling_ ### Walkthrough * Generate and log go routine ID for async operation worker ([link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-5d9b9964242d77606f4868e3fe049978ea1c3f1903afcfeb60c76e19d9785fc5L232-R235), [link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-5d9b9964242d77606f4868e3fe049978ea1c3f1903afcfeb60c76e19d9785fc5L245-R246)) * Use context for informer factory and logger in `WatchUntilReady` function ([link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-033dc8a7c9b970cc30f7420ce8cf2f12a26258b276779683ba3f761c481bf0a7L188-R189), [link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-033dc8a7c9b970cc30f7420ce8cf2f12a26258b276779683ba3f761c481bf0a7L242-R244)) --------- Co-authored-by: Young Bu Park <youngp@microsoft.com>
Description
Fixes goroutine leak in kubernetes handler. The informerfactory which was started in a goroutine was invoked with an option wait.NeverStop which caused the informer to keep the associated goroutines lingering around
#5237
Fixes: #5237
Checklist
Please make sure you've completed the relevant tasks for this PR, out of the following list:
Auto-generated summary
🤖 Generated by Copilot at d1fc1d4
Summary
🚀🐛🧹
Improved logging and error handling for async operations and Kubernetes informer in
armrpc
andcorerp
packages. AddedgoroutineID
variable toworker.go
to identify go routines.Walkthrough
WatchUntilReady
function (link, link)