Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in App Core RP - number of goroutines is increasing #5237

Closed
youngbupark opened this issue Mar 3, 2023 · 5 comments · Fixed by #5445 or #5693
Closed

Memory leak in App Core RP - number of goroutines is increasing #5237

youngbupark opened this issue Mar 3, 2023 · 5 comments · Fixed by #5445 or #5693
Labels
bug Something is broken or not working as expected triaged This issue has been reviewed and triaged

Comments

@youngbupark
Copy link

youngbupark commented Mar 3, 2023

Bug information

Steps to reproduce (required)

  1. Install prometheus and grafana dashboard
  2. Install radius
  3. Run functional tests

Observed behavior (required)

Link: https://radius-longhaul-wus3-awejb5drbjabh8h0.wus3.grafana.azure.com/d/Wh5JpyxVz/radius-overview?orgId=1&from=1677877199000&to=1677887999000

image

Number of goroutines of appcore-rp keeps growing while memory usage is increasing.

Desired behavior (required)

When GC executes, memory usage should be decreased and gorouine should be decresed.

Action items

We need to run load tests and find the memory leak by increased go routine.

AB#6467

@youngbupark youngbupark added the bug Something is broken or not working as expected label Mar 3, 2023
@rynowak
Copy link
Contributor

rynowak commented Mar 6, 2023

What immediately comes to mind is the async background processor. Not currently used in UCP but heavily used in appcore. Other than that there's very little concurrency in our server-side APIs since webservers are already "embarassingly parallel".

@youngbupark
Copy link
Author

What immediately comes to mind is the async background processor. Not currently used in UCP but heavily used in appcore. Other than that there's very little concurrency in our server-side APIs since webservers are already "embarassingly parallel".

Yeah nested go routines in async worker may have the bug or it could be Go GC behavior. We need to do some load test in order to see the root cause.

@shalabhms shalabhms added the triaged This issue has been reviewed and triaged label Mar 13, 2023
@youngbupark youngbupark changed the title Number of goroutine is increasing Memory leak in App Core RP - number of goroutines is increasing Mar 16, 2023
@AaronCrawfis AaronCrawfis removed the P0 label Mar 20, 2023
vinayada1 added a commit that referenced this issue Apr 10, 2023
# Description

Close opened buffers. Potential memory leak cause

#5237 

<!--
We strive to have all PR being opened based on an issue, where the
problem or feature have been discussed prior to implementation.
-->

Fixes: #issue_number

## Checklist

Please make sure you've completed the relevant tasks for this PR, out of
the following list:

* [ ] Code compiles correctly
* [ ] Adds necessary unit tests for change
* [ ] Adds necessary E2E tests for change
* [ ] Unit tests passing
* [ ] Extended the documentation / Created issue for it
youngbupark added a commit that referenced this issue May 4, 2023
# Description

otelhttp has the high cardinality problem which record all remote
address from client. This issue is tracked by
open-telemetry/opentelemetry-go-contrib#3765 .

This PR is adding workaround to prevent otelhttp from recording
remoteaddr as an attribute.

## Issue reference

<!--
We strive to have all PR being opened based on an issue, where the
problem or feature have been discussed prior to implementation.
-->

#5299
#5237

## Checklist

Please make sure you've completed the relevant tasks for this PR, out of
the following list:

* [ ] Code compiles correctly
* [ ] Adds necessary unit tests for change
* [ ] Adds necessary E2E tests for change
* [ ] Unit tests passing
* [ ] Extended the documentation / Created issue for it

---------

Co-authored-by: Yetkin Timocin <ytimocin@microsoft.com>
@vinayada1 vinayada1 mentioned this issue May 5, 2023
5 tasks
vinayada1 added a commit that referenced this issue May 5, 2023
# Description

Making some improvements found while investigating the memory leak
issue.

#5237 

<!--
We strive to have all PR being opened based on an issue, where the
problem or feature have been discussed prior to implementation.
-->

Fixes: #5237 

## Checklist

Please make sure you've completed the relevant tasks for this PR, out of
the following list:

* [ ] Code compiles correctly
* [ ] Adds necessary unit tests for change
* [ ] Adds necessary E2E tests for change
* [ ] Unit tests passing
* [ ] Extended the documentation / Created issue for it
@youngbupark
Copy link
Author

@vinayada1 UCP has been fixed, but memory and goroutines of 0.21 appcore-rp are still leaking now.

https://radius-grafana-a3d0dafefbhbegae.wus3.grafana.azure.com/d/Wh5JpyxVz/radius-overview?orgId=1&from=1686009600000&to=1686063599000

image

image

@youngbupark youngbupark reopened this Jun 6, 2023
youngbupark added a commit that referenced this issue Jun 13, 2023
# Description

Fixes goroutine leak in kubernetes handler. The informerfactory which
was started in a goroutine was invoked with an option wait.NeverStop
which caused the informer to keep the associated goroutines lingering
around

#5237 

<!--
We strive to have all PR being opened based on an issue, where the
problem or feature have been discussed prior to implementation.
-->

Fixes: #5237 

## Checklist

Please make sure you've completed the relevant tasks for this PR, out of
the following list:

* [ ] Code compiles correctly
* [ ] Adds necessary unit tests for change
* [ ] Adds necessary E2E tests for change
* [ ] Unit tests passing
* [ ] Extended the documentation / Created issue for it

## Auto-generated summary

<!--
GitHub Copilot for docs will auto-generate a summary of the PR
-->

<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at d1fc1d4</samp>

### Summary
🚀🐛🧹

<!--
1. 🚀 - This emoji can represent the improvement in performance and
concurrency by using go routines and tracking their IDs.
2. 🐛 - This emoji can represent the bug fix and error handling in case
of a panic, which can prevent crashes and data loss.
3. 🧹 - This emoji can represent the cleanup and refactoring of unused
code and better logging and shutdown logic.
-->
Improved logging and error handling for async operations and Kubernetes
informer in `armrpc` and `corerp` packages. Added `goroutineID` variable
to `worker.go` to identify go routines.

> _`goroutineID` logs_
> _panic and async tasks - /_
> _improved error handling_

### Walkthrough
* Generate and log go routine ID for async operation worker
([link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-5d9b9964242d77606f4868e3fe049978ea1c3f1903afcfeb60c76e19d9785fc5L232-R235),
[link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-5d9b9964242d77606f4868e3fe049978ea1c3f1903afcfeb60c76e19d9785fc5L245-R246))
* Use context for informer factory and logger in `WatchUntilReady`
function
([link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-033dc8a7c9b970cc30f7420ce8cf2f12a26258b276779683ba3f761c481bf0a7L188-R189),
[link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-033dc8a7c9b970cc30f7420ce8cf2f12a26258b276779683ba3f761c481bf0a7L242-R244))

---------

Co-authored-by: Young Bu Park <youngp@microsoft.com>
@rynowak
Copy link
Contributor

rynowak commented Jun 13, 2023

Nice to see that graph go dooooowwwwwn

nithyatsu pushed a commit that referenced this issue Jun 21, 2023
# Description

Fixes goroutine leak in kubernetes handler. The informerfactory which
was started in a goroutine was invoked with an option wait.NeverStop
which caused the informer to keep the associated goroutines lingering
around

#5237 

<!--
We strive to have all PR being opened based on an issue, where the
problem or feature have been discussed prior to implementation.
-->

Fixes: #5237 

## Checklist

Please make sure you've completed the relevant tasks for this PR, out of
the following list:

* [ ] Code compiles correctly
* [ ] Adds necessary unit tests for change
* [ ] Adds necessary E2E tests for change
* [ ] Unit tests passing
* [ ] Extended the documentation / Created issue for it

## Auto-generated summary

<!--
GitHub Copilot for docs will auto-generate a summary of the PR
-->

<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at d1fc1d4</samp>

### Summary
🚀🐛🧹

<!--
1. 🚀 - This emoji can represent the improvement in performance and
concurrency by using go routines and tracking their IDs.
2. 🐛 - This emoji can represent the bug fix and error handling in case
of a panic, which can prevent crashes and data loss.
3. 🧹 - This emoji can represent the cleanup and refactoring of unused
code and better logging and shutdown logic.
-->
Improved logging and error handling for async operations and Kubernetes
informer in `armrpc` and `corerp` packages. Added `goroutineID` variable
to `worker.go` to identify go routines.

> _`goroutineID` logs_
> _panic and async tasks - /_
> _improved error handling_

### Walkthrough
* Generate and log go routine ID for async operation worker
([link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-5d9b9964242d77606f4868e3fe049978ea1c3f1903afcfeb60c76e19d9785fc5L232-R235),
[link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-5d9b9964242d77606f4868e3fe049978ea1c3f1903afcfeb60c76e19d9785fc5L245-R246))
* Use context for informer factory and logger in `WatchUntilReady`
function
([link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-033dc8a7c9b970cc30f7420ce8cf2f12a26258b276779683ba3f761c481bf0a7L188-R189),
[link](https://github.com/project-radius/radius/pull/5693/files?diff=unified&w=0#diff-033dc8a7c9b970cc30f7420ce8cf2f12a26258b276779683ba3f761c481bf0a7L242-R244))

---------

Co-authored-by: Young Bu Park <youngp@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not working as expected triaged This issue has been reviewed and triaged
Projects
None yet
4 participants