-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
massive CPU usage at scale with v3.18.0 #557
Comments
I grabbed a cpu profile while attempting to reproduce the issue. It suggests the issue lies here: go-agent/v3/newrelic/instrumentation.go Line 47 in 5421539
Basically, code-level metrics do not perform well at scale due to the caller lookups. And |
The |
@nr-swilloughby Even if we did have CLM enabled, that doesn't mean we want it for every single handler. But if |
@rittneje |
@rittneje ...and if you don't collect CLM for a particular wrapped function I'll suppress the call to |
I didn't quite understand this. Do you mean having another application flag for whether
Another thing to consider is even if you are manually passing Maybe the simplest approach is to conditionally call |
@rittneje I'm just brainstorming to find something that will be generally useful. The main idea is for CLM to "just work" without making the application devs add a lot of custom code like What I was thinking of above is the And yes, one way or other I intend to make the logic around that internal |
@nr-swilloughby Further analysis reveals another cause of this issue. We were panicking a lot on this line due to some nil pointer dereference. go-agent/v3/newrelic/internal_txn.go Line 107 in b580d7f
It is unclear exactly why, but evidently one of the |
@nr-swilloughby Actually the real issue is a race condition/bad multithreading. go-agent/v3/newrelic/instrumentation.go Line 47 in 5421539
This is assigning the result of That it is calling |
@rittneje yes we're working on both of those. |
I have experimental code in place to correct these issues. Testing (including adding more unit tests) is up next. Pressing on this so the fixes can be released ASAP. |
After upgrading our services to v3.18.0 from v3.17.0, we noticed a massive increase in CPU usage (to almost 100%) in our services that have the most production load at scale. We do not have code-level metrics enabled. We did not change our configuration at all from v3.17.0.
The text was updated successfully, but these errors were encountered: