Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid potential overhead of ThreadLocal.remove in LogbackMetrics #3952

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

spmason
Copy link

@spmason spmason commented Jul 3, 2023

In one of our apps we're seeing that over time the size of the ThreadLocalMap underpinning the ThreadLocal class gets large (due to other libraries overusing/abusing ThreadLocals). When this happens ThreadLocal.remove can start to become a performance bottleneck as it does some expungeStaleEntry calls that iterate over the whole map.

Because logger calls can be so frequent (even if they're suppressed this code is still called) the TurboFilter gets called a lot, and thus so does ThreadLocal.remove. Therefore Micrometer ends up slowing down all our code.

I agree that technically this isn't Micrometer's fault, but the potential for it to happen in any application is still there and MM should really do what it can to avoid becoming a bottleneck.

So this PR changes LogbackMetrics to remove the need for it to call ThreadLocal.remove and thus the potential for this bug. It does this by not clearing the ThreadLocal after use and instead setting it to a 'holder' class that has its own boolean flag that gets flipped on/off inside ignoreMetrics.

Running the LogbackBenchmarks this seems to be possibly faster than the regular code even when ThreadLocals aren't overloaded, so it seems like a decent improvement overall

Of course the tradeoff is that now micrometer 'pollutes' all threads with an extraneous ThreadLocal value that it never cleans up. However:

  1. If a thread doesn't log(?) then it won't get a ThreadLocal. If it does then the chances are it will log again (soon?) so the ThreadLocal will be needed again then anyway.
  2. It's not a big object, one boolean
  3. It should be cleaned up when the threads die

In developing this fix I did try a couple of other solutions:

  1. Just call ThreadLocal.set(false) instead of remove. Unfortunately set does some similar cleanup to remove so the issue would still be there
  2. Use an AtomicBoolean instead of a custom class to hold the flag. This wasn't performant enough as AtomicBoolean does a load of locking etc as it's meant to be used across threads - as these flags are explicitly only used on a single thread that locking is unnecessary overhead
  3. Use a boolean[] as the flag holder. This seems a bit too ugly to me

Note that there is a similar issue to this one that might have helped this situation (#3891) as most of the time this is happening for me on some suppressed logger.debug calls. However even without that fix this would still be an issue for every non-suppressed logger call that my app made.

In one of our apps we're seeing that over time the size of the ThreadLocalMap underpinning the ThreadLocal class gets large (due to other libraries overusing/abusing ThreadLocals). When this happens `ThreadLocal.remove` can start to become a performance bottleneck as it does some `expungeStaleEntry` calls that iterate over the whole map.

Because logger calls can be so frequent (even if they're suppressed this code is still called) the `TurboFilter` gets called _a lot_, and thus so does `ThreadLocal.remove`. Therefore Micrometer ends up slowing down all our code.

I agree that technically this isn't Micrometer's fault, but the potential for it to happen in any application is still there and MM should really do what it can to avoid becoming a bottleneck.

So this PR changes `LogbackMetrics` to remove the need for it to call `ThreadLocal.remove` and thus the potential for this bug. It does this by _not clearing the ThreadLocal after use_ and instead setting it to a 'holder' class that has its own boolean flag that gets flipped on/off inside `ignoreMetrics`.

Running the LogbackBenchmarks this seems to be possibly faster than the regular code even when ThreadLocals aren't overloaded, so it seems like a decent improvement overall

Of course the tradeoff is that now MM 'pollutes' all threads with an extraneous ThreadLocal that it never cleans up. However:

1. If a thread doesn't log(?) then it won't get a ThreadLocal. If it does then the chances are it will log again (soon?) so the ThreadLocal will be needed again then anyway.
2. It's not a big object, one boolean
3. It should be cleaned up when the threads die

In developing this fix I did try a couple of other solutions:
1. Just call `ThreadLocal.set(false)` instead of `remove`. Unfortunately `set` does some similar cleanup to `remove` so the issue would still be there
2. Use an `AtomicBoolean` instead of a custom class to hold the flag. This wasn't performant enough as `AtomicBoolean` does a load of locking etc as it's meant to be used across threads - as these flags are explicitly only used on a single thread that locking is unnecessary overhead
3. Use a boolean[] as the flag holder. This seems a bit too ugly to me

Note that there is a similar issue to this one that might have helped me here (micrometer-metrics#3891) as most of the time this is happening for me on some supressed `logger.debug` calls. However this would still be an issue for every non-suppressed logger call that my app made, without this fix.
@pivotal-cla
Copy link

@spmason Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

@pivotal-cla
Copy link

@spmason Thank you for signing the Contributor License Agreement!

@shakuzen shakuzen added performance Issues related to general performance module: micrometer-core An issue that is related to our core module enhancement A general enhancement labels Jul 6, 2023
@jonatan-ivanov
Copy link
Member

Thank you for the PR and sorry that we did not get back to you sooner.

The leak that this change is causing might not be a big issue for apps that are (re)using platform threads but with virtual threads this can be an issue: virtual threads can be created and dropped at a high rate so you can end-up with a lot of BooleanHolder instances. I tested this with JDK 22 and I was able to create a significant amount of leak (well over 1GiB) just by doing this (a lot) even without introducing a BooleanHolder:

Thread.ofVirtual().start(() -> threadLocal.set(Boolean.TRUE));

I can see the following solutions:

  • Logback implementing this natively (through JMX?) so we don't need the filter neither the thread-local (please create an issue for Logback if you think this would work for you)
  • Scoped Values can solve this issue but the feature is still in preview
  • Making this somehow overridable, I'm not sure how though since ignoreMetrics is static

@shakuzen
Copy link
Member

shakuzen commented Aug 5, 2024

Because logger calls can be so frequent (even if they're suppressed this code is still called) the TurboFilter gets called a lot, and thus so does ThreadLocal.remove. Therefore Micrometer ends up slowing down all our code.

Since this pull request was originally opened we've addressed the part about the ThreadLocal being called when the log level is not enabled in #3891, so this issue should be at least somewhat mitigated.

I'm concerned about the following mentioned tradeoff this solution introduces:

Of course the tradeoff is that now micrometer 'pollutes' all threads with an extraneous ThreadLocal value that it never cleans up.

So I'm a bit weary to adopt this solution as we have to think about how it may affect users with different run-time characteristics and usage in their applications.

The whole use of ThreadLocal to avoid potentially infinitely recursive calls in the case a Counter implementation's increment method logs is unfortunate because I think in practice it only came about because of the Statsd implementation. Though the general idea stands that at the level of logging instrumentation we can't know that a Counter implementation doesn't itself log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A general enhancement module: micrometer-core An issue that is related to our core module performance Issues related to general performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants