Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actuator is using CPU while app is doing nothing #11333

Closed
dgsqrs opened this issue Dec 12, 2017 · 20 comments
Closed

Actuator is using CPU while app is doing nothing #11333

dgsqrs opened this issue Dec 12, 2017 · 20 comments
Assignees
Labels
type: enhancement A general enhancement
Milestone

Comments

@dgsqrs
Copy link

dgsqrs commented Dec 12, 2017

Hi there.

If you add the spring-boot-starter-actuator as dependency, after the first http call, the app will consume 6%-7% CPU forever (on a few threads). Even when there is no active http connection nor application background thread working.

If I remove this dependency, the app is idle at 0% as expected.

Edit
Spring Boot version 2.0.0.0M7
Java 8
Archive.zip

The demo app have been generated with https://start.spring.io/

Steps to reproduce the issue

  1. Build the app
  2. Launch it using java -jar
  3. Monitor CPU usage using htop => Nothing special is seen
  4. Open a web browser and browse to localhost:8080 => error page is shown
  5. Check CPU usage on htop => 6-7% used endlessly by the app
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Dec 12, 2017
@philwebb
Copy link
Member

philwebb commented Dec 12, 2017

@dgsqrs What version of Spring Boot are you using? Can you share a small sample that shows the issue?

@philwebb philwebb added the status: waiting-for-feedback We need additional information before we can continue label Dec 12, 2017
@dgsqrs
Copy link
Author

dgsqrs commented Dec 13, 2017

Hi @philwebb .
I updated my initial post with the requested information.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Dec 13, 2017
@bclozel
Copy link
Member

bclozel commented Dec 13, 2017

I can reproduce this. After the first request, the CPU seems to be busy:

after-first-request

VisualVM shows this:
cpu

This behaviour seems to be related with the new metrics support, since excluding the autoconfiguration with @SpringBootApplication(exclude = MetricsAutoConfiguration.class) "fixes" this.

@jkschneider is this related to the GC pause detection or some specific instrumentation? Is the cost constant even when the application is busy or is this just something to keep the VM busy on purpose?

@jkschneider
Copy link
Contributor

jkschneider commented Dec 13, 2017

Is this related to the GC pause detection

Yes, I believe so (and other system-level pauses unrelated to GC).

Is the cost constant even when the application is busy

Yes.


I'm not sure if it's possible to improve this mechanism to use no CPU. We could offer a config option to tradeoff monitoring accuracy for CPU consumption, though it's unclear to me what the default should be.

More research is in order. Perhaps there is a way to accomplish accurate pause detection and indiscernible CPU consumption.

@snicoll snicoll added the for: team-attention An issue we'd like other members of the team to review label Dec 13, 2017
@philwebb
Copy link
Member

My initial feeling is that we should set the default to not consume CPU. I expect that GC pause detection is probably quite an advanced problem for most users and by the time that they need it they'll be quite a long way into their journey with metrics. Constant CPU usage, however, is probably going to get noticed.

tldr: I think this should be configurable and by default off. In the docs we should call it out.

@jkschneider
Copy link
Contributor

jkschneider commented Dec 13, 2017

I don't necessarily disagree with the default if the mechanism can't be improved, but this kind of reasoning really troubles me:

by the time that they need it they'll be quite a long way into their journey with metrics

Unless we compensate for pauses, they are basically looking at garbage. This lures folks into a false sense of security, the "blindfold with a lollipop" state of being.

I saw a study once (conveniently can't remember where) that suggested something like 20% of Java apps exhibited periodic pauses of 5 sec or greater. Even if both of these figures were far lower, it's still frightening. If nothing else, I want the default set of metrics people get to guide them to responsible operationalization, e.g. you should probably run more than one instance in prod and have some circuit breaker mechanism to control SLA boundaries.

I understand the temptation to view this as a day 2 problem. It's just not clear to me that many organizations ever even realize there is a problem in the first place.

@jkschneider
Copy link
Contributor

jkschneider commented Dec 13, 2017

At a cursory glance, I'm pretty confident we can implement a pause detector that trades off pause-length granularity for apparent CPU consumption. Then we can select a default that still captures well the eye-popping GC pauses.

@osialr
Copy link

osialr commented Dec 13, 2017

I found that setting the system property LatencyUtils.useActualTime to false disables the SimplePauseDetector but keeps the rest of micrometer running.

It switches LatencyUtils into a test mode such that TimeServices#sleepNanos will call wait() on a monitor that never fires.

@jkschneider
Copy link
Contributor

jkschneider commented Dec 13, 2017

I found that setting the system property LatencyUtils.useActualTime to false

Wow, the hacking prize goes to you today, @osialr :). Thanks for the workaround option.

@snicoll
Copy link
Member

snicoll commented Jan 3, 2018

@jkschneider do you have a micrometer issue that we can track?

@snicoll snicoll added priority: normal type: enhancement A general enhancement and removed for: team-attention An issue we'd like other members of the team to review status: feedback-provided Feedback has been provided status: waiting-for-triage An issue we've not yet triaged labels Jan 3, 2018
@snicoll snicoll added this to the 2.0.0.RC1 milestone Jan 3, 2018
@jkschneider
Copy link
Contributor

Testing the fix this morning. Here is the issue: micrometer-metrics/micrometer#328

@jkschneider
Copy link
Contributor

Summarizing the changes coming in Micrometer 1.0.0-rc.6:

The pause detector is configurable to an extent via MeterRegistry#config(). The default pause detector will still be a clock-drift detector which needs to wake up periodically to check for drift. By configuring the sleep interval (time between checks), you can control CPU consumption. We've selected a default in rc6 that reduces the CPU consumption to about 0.1% while still providing decent compensation for long pauses.

You can change the intervals on the clock-drift detector like so:

@Bean
MeterRegistryConfigurer registryConfig() {
   return r -> r.config().pauseDetector(new ClockDriftPauseDetector(sleepInt, pauseInt));
}

We also provide an alternate, noop implementation. Unless you are concerned about even the slightest amount of CPU use, this is not recommended:

@Bean
MeterRegistryConfigurer registryConfig() {
   return r -> r.config().pauseDetector(new NoPauseDetector());
}

It's possible that we'll find alternate pause detector implementations in the future. Perhaps the JDK will someday provide an event-based interface for GC events. Or we could probably provide a detector limited to GC events by scraping GC logs.

@bclozel
Copy link
Member

bclozel commented Jan 15, 2018

I've tried another round of testing using the latest Spring Boot snapshots; I'm not really seeing an improvement here and even seeing conflicting results.

Note that I'm looking at CPU usage on the PID line, which is the amount of CPU used by the process last time it was scheduled on a CPU. So "1%" means "1% on the CPU it was scheduled on", not "1% of the host capacity". This might be biased, but I'm trying to ignore other processes running on the host.

Results

Spring Boot and Actuator:
boot-actuator

Spring Boot and Actuator, after first request:
boot-actuator-first-request

Spring Boot and Actuator, NoPauseDetector:
boot-actuator-noop-pause-detector

Spring Boot and Actuator, NoPauseDetector after first request:
boot-actuator-no-pause-detector-first-request

Spring Boot without Actuator:
boot-without-actuator

Spring Boot with Actuator, but Metrics auto-configuration ignored:
boot-actuator-metrics-excluded

Summary

  • "Spring Boot without Actuator" and "Spring Boot with Metrics auto-configuration ignored" are still equivalent.
  • We now see the CPU usage even before the first request, which means some initialization code might have changed. CPU usage seems to decrease after the first request.
  • NoPauseDetector doesn't seem to make a big difference in my tests

There might be something else at play here and we need to look deeper.
Any idea @jkschneider @osialr ?

@jkschneider
Copy link
Contributor

Fixed a bug in micrometer-core that prevented the configured detector from being used
(micrometer-metrics/micrometer#350).

@philwebb
Copy link
Member

@jkschneider Is there a SNAPSHOT available that we can try? I'd like to test this before you cut RC8

@jkschneider
Copy link
Contributor

@philwebb Yes. Snapshots are published on every successful build to repo.spring.io

@philwebb
Copy link
Member

Thanks @jkschneider. Did you have a PR lined up for this as well? I can't remember what we discussed and if the upgrade is easy or not.

@bclozel
Copy link
Member

bclozel commented Jan 25, 2018

I've hacked a draft PR here.

I'm not sure about a few of those changes, especially in our test suite where we were previously counting measures (if I'm not mistaken). But the test suite is green.

@jkschneider
Copy link
Contributor

@bclozel Oops, I guess you got a little ahead of me.

I'll be piling more commits into: #11776

@snicoll snicoll self-assigned this Jan 29, 2018
@snicoll
Copy link
Member

snicoll commented Jan 29, 2018

OK so the update to the snapshot has fixed the issue for me. @dgsqrs if you can give 2.0.0.BUILD-SNAPSHOT a try that would be great.

@snicoll snicoll closed this as completed Jan 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

7 participants