-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid potential overhead of ThreadLocal.remove in LogbackMetrics #3952
base: main
Are you sure you want to change the base?
Avoid potential overhead of ThreadLocal.remove in LogbackMetrics #3952
Conversation
In one of our apps we're seeing that over time the size of the ThreadLocalMap underpinning the ThreadLocal class gets large (due to other libraries overusing/abusing ThreadLocals). When this happens `ThreadLocal.remove` can start to become a performance bottleneck as it does some `expungeStaleEntry` calls that iterate over the whole map. Because logger calls can be so frequent (even if they're suppressed this code is still called) the `TurboFilter` gets called _a lot_, and thus so does `ThreadLocal.remove`. Therefore Micrometer ends up slowing down all our code. I agree that technically this isn't Micrometer's fault, but the potential for it to happen in any application is still there and MM should really do what it can to avoid becoming a bottleneck. So this PR changes `LogbackMetrics` to remove the need for it to call `ThreadLocal.remove` and thus the potential for this bug. It does this by _not clearing the ThreadLocal after use_ and instead setting it to a 'holder' class that has its own boolean flag that gets flipped on/off inside `ignoreMetrics`. Running the LogbackBenchmarks this seems to be possibly faster than the regular code even when ThreadLocals aren't overloaded, so it seems like a decent improvement overall Of course the tradeoff is that now MM 'pollutes' all threads with an extraneous ThreadLocal that it never cleans up. However: 1. If a thread doesn't log(?) then it won't get a ThreadLocal. If it does then the chances are it will log again (soon?) so the ThreadLocal will be needed again then anyway. 2. It's not a big object, one boolean 3. It should be cleaned up when the threads die In developing this fix I did try a couple of other solutions: 1. Just call `ThreadLocal.set(false)` instead of `remove`. Unfortunately `set` does some similar cleanup to `remove` so the issue would still be there 2. Use an `AtomicBoolean` instead of a custom class to hold the flag. This wasn't performant enough as `AtomicBoolean` does a load of locking etc as it's meant to be used across threads - as these flags are explicitly only used on a single thread that locking is unnecessary overhead 3. Use a boolean[] as the flag holder. This seems a bit too ugly to me Note that there is a similar issue to this one that might have helped me here (micrometer-metrics#3891) as most of the time this is happening for me on some supressed `logger.debug` calls. However this would still be an issue for every non-suppressed logger call that my app made, without this fix.
@spmason Please sign the Contributor License Agreement! Click here to manually synchronize the status of this Pull Request. See the FAQ for frequently asked questions. |
@spmason Thank you for signing the Contributor License Agreement! |
Thank you for the PR and sorry that we did not get back to you sooner. The leak that this change is causing might not be a big issue for apps that are (re)using platform threads but with virtual threads this can be an issue: virtual threads can be created and dropped at a high rate so you can end-up with a lot of Thread.ofVirtual().start(() -> threadLocal.set(Boolean.TRUE)); I can see the following solutions:
|
Since this pull request was originally opened we've addressed the part about the ThreadLocal being called when the log level is not enabled in #3891, so this issue should be at least somewhat mitigated. I'm concerned about the following mentioned tradeoff this solution introduces:
So I'm a bit weary to adopt this solution as we have to think about how it may affect users with different run-time characteristics and usage in their applications. The whole use of ThreadLocal to avoid potentially infinitely recursive calls in the case a Counter implementation's increment method logs is unfortunate because I think in practice it only came about because of the Statsd implementation. Though the general idea stands that at the level of logging instrumentation we can't know that a Counter implementation doesn't itself log. |
In one of our apps we're seeing that over time the size of the ThreadLocalMap underpinning the ThreadLocal class gets large (due to other libraries overusing/abusing ThreadLocals). When this happens
ThreadLocal.remove
can start to become a performance bottleneck as it does someexpungeStaleEntry
calls that iterate over the whole map.Because logger calls can be so frequent (even if they're suppressed this code is still called) the
TurboFilter
gets called a lot, and thus so doesThreadLocal.remove
. Therefore Micrometer ends up slowing down all our code.I agree that technically this isn't Micrometer's fault, but the potential for it to happen in any application is still there and MM should really do what it can to avoid becoming a bottleneck.
So this PR changes
LogbackMetrics
to remove the need for it to callThreadLocal.remove
and thus the potential for this bug. It does this by not clearing the ThreadLocal after use and instead setting it to a 'holder' class that has its own boolean flag that gets flipped on/off insideignoreMetrics
.Running the LogbackBenchmarks this seems to be possibly faster than the regular code even when ThreadLocals aren't overloaded, so it seems like a decent improvement overall
Of course the tradeoff is that now micrometer 'pollutes' all threads with an extraneous ThreadLocal value that it never cleans up. However:
In developing this fix I did try a couple of other solutions:
ThreadLocal.set(false)
instead ofremove
. Unfortunatelyset
does some similar cleanup toremove
so the issue would still be thereAtomicBoolean
instead of a custom class to hold the flag. This wasn't performant enough asAtomicBoolean
does a load of locking etc as it's meant to be used across threads - as these flags are explicitly only used on a single thread that locking is unnecessary overheadNote that there is a similar issue to this one that might have helped this situation (#3891) as most of the time this is happening for me on some suppressed
logger.debug
calls. However even without that fix this would still be an issue for every non-suppressed logger call that my app made.