Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

massive CPU usage at scale with v3.18.0 #557

Closed
rittneje opened this issue Aug 19, 2022 · 11 comments · Fixed by #566 or newrelic/docs-website#9155
Closed

massive CPU usage at scale with v3.18.0 #557

rittneje opened this issue Aug 19, 2022 · 11 comments · Fixed by #566 or newrelic/docs-website#9155
Assignees
Labels

Comments

@rittneje
Copy link
Contributor

rittneje commented Aug 19, 2022

After upgrading our services to v3.18.0 from v3.17.0, we noticed a massive increase in CPU usage (to almost 100%) in our services that have the most production load at scale. We do not have code-level metrics enabled. We did not change our configuration at all from v3.17.0.

@rittneje rittneje added the bug label Aug 19, 2022
@rittneje
Copy link
Contributor Author

I grabbed a cpu profile while attempting to reproduce the issue. It suggests the issue lies here:

options = append([]TraceOption{WithThisCodeLocation()}, options...)

Basically, code-level metrics do not perform well at scale due to the caller lookups. And WrapHandle is always fetching this information, even if CLM is disabled for the entire app. I don't know why WithThisCodeLocation() is being forcefully added instead of making the caller pass it in, but at the very least it should not be injected if the app has CLM disabled.

@nr-swilloughby
Copy link
Contributor

The WithThisCodeLocation() call is there to accurately identify the wrapping function as the function instead of looking elsewhere, although the user may override that if they choose. However, you are right, that shouldn't be called if CLM is disabled.

@nr-swilloughby nr-swilloughby self-assigned this Aug 20, 2022
@rittneje
Copy link
Contributor Author

@nr-swilloughby Even if we did have CLM enabled, that doesn't mean we want it for every single handler. But if WrapHandle always calls WithThisCodeLocation then there is no way to conditionally skip it.

@nr-swilloughby
Copy link
Contributor

nr-swilloughby commented Aug 20, 2022

@rittneje
Yeah I see that. I'm thinking I can make a simple change that would make the call to WithThisCodeLocation conditional on at least CLM being disabled, as well as allowing you to select CLM on/off for specific wrapped functions. Would it be useful to make wrapped functions be an entire class of things that can be selected on/off for CLM? So you could enable CLM except for wrapped functions if you wanted?

@nr-swilloughby
Copy link
Contributor

nr-swilloughby commented Aug 20, 2022

@rittneje ...and if you don't collect CLM for a particular wrapped function I'll suppress the call to WithThisCodeLocation as well.

@rittneje
Copy link
Contributor Author

rittneje commented Aug 20, 2022

Would it be useful to make wrapped functions be an entire class of things that can be selected on/off for CLM?

I didn't quite understand this. Do you mean having another application flag for whether WrapHandle and WrapHandler should auto-inject WithThisCodeLocation? If so I think that's reasonable.

if you don't collect CLM for a particular wrapped function I'll suppress the call to WithThisCodeLocation as well.

Another thing to consider is even if you are manually passing WithCodeLocation or similar explicitly to WrapHandle, at present it doesn't know that and will call WithThisCodeLocation anyway for no reason. And you can't really tell which trace options were provided without invoking all of them, so that complicates things.

Maybe the simplest approach is to conditionally call Transaction.SetOption after StartTransaction? Or make some special internal TraceOption that does the conditional?

@nr-swilloughby
Copy link
Contributor

@rittneje I'm just brainstorming to find something that will be generally useful. The main idea is for CLM to "just work" without making the application devs add a lot of custom code like WithThisCodeLocation everywhere, so we'd like the defaults to enable things to work without code changes. However, of course there are cases where CLM isn't wanted or isn't practical for some code, hence the various options to turn it off for individual transactions, etc.

What I was thinking of above is the ConfigCodeLevelMetricsScope option which is designed to select broad categories of traces to be included in CLM or not. Currently there's only one category (transactions) but I could add one for wrapped functions so that you could use that configuration option to globally include or exclude those entirely if that would be useful. I think it would be preferable to having to manually turn off each of them.

And yes, one way or other I intend to make the logic around that internal WithThisCodeLocation call avoid a needless invocation.

@rittneje
Copy link
Contributor Author

rittneje commented Aug 22, 2022

@nr-swilloughby Further analysis reveals another cause of this issue. We were panicking a lot on this line due to some nil pointer dereference.

net/http/server.go:3195: http: panic serving 127.0.0.6:42057: runtime error: invalid memory address or nil pointer dereference
goroutine 2152022 [running]:
net/http.(*conn).serve.func1()
net/http/server.go:1825 +0xbf
panic({0xce5340, 0x1653d40})
runtime/panic.go:844 +0x258
vendor/github.com/newrelic/go-agent/v3/newrelic.newTxn(0xc000137400, 0xc000476900, {0xc0006f8900, 0xc}, {0xc003dc0000, 0x1379a, 0xde5e1a?})
vendor/github.com/newrelic/go-agent/v3/newrelic/internal_txn.go:107 +0xe1
vendor/github.com/newrelic/go-agent/v3/newrelic.(*app).StartTransaction(0xc000137400, {0xc0006f8900, 0xc}, {0xc003dc0000, 0x1379a, 0x13800})
vendor/github.com/newrelic/go-agent/v3/newrelic/internal_app.go:511 +0x74
vendor/github.com/newrelic/go-agent/v3/newrelic.(*Application).StartTransaction(...)
vendor/github.com/newrelic/go-agent/v3/newrelic/application.go:23
vendor/github.com/newrelic/go-agent/v3/newrelic.WrapHandle.func1({0x109e908, 0xc000b920e0}, 0xc00007c100)
vendor/github.com/newrelic/go-agent/v3/newrelic/instrumentation.go:48 +0x208
net/http.HandlerFunc.ServeHTTP(0x109f430?, {0x109e908?, 0xc000b920e0?}, 0x1095400?)
net/http/server.go:2084 +0x2f

It is unclear exactly why, but evidently one of the TraceOptions is coming in as nil. It should be noted that we are not explicitly passing any options anywhere. And since we weren't seeing a 100% failure rate, that means it wasn't always panicking.

@rittneje
Copy link
Contributor Author

rittneje commented Aug 22, 2022

@nr-swilloughby Actually the real issue is a race condition/bad multithreading.

options = append([]TraceOption{WithThisCodeLocation()}, options...)

This is assigning the result of append into options, which is a variable that is shared by all the threads using that handler. So it keeps appending into the same slice, causing it to grow without bound. As the slice grows, that means it will spend more time iterating through the list of options. And sometimes, this race condition results in the aforementioned panic. Please fix this line to use a different variable with :=. Please also add test cases.

That it is calling WithThisCodeLocation for no reason is also a bug. So both issues need to be fixed.

@nr-swilloughby
Copy link
Contributor

@rittneje yes we're working on both of those.

@nr-swilloughby
Copy link
Contributor

I have experimental code in place to correct these issues. Testing (including adding more unit tests) is up next. Pressing on this so the fixes can be released ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants