Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause Detection contributes to count when it should just contribute to total time. #844

Open
koa opened this issue Sep 12, 2018 · 30 comments
Labels
bug A general bug
Milestone

Comments

@koa
Copy link

koa commented Sep 12, 2018

Situation:

  • Spring Boot 2.0.4
  • Prometheus
  • VMWare Server
  • high throughput (20k timers/s per instance)

Symptoms:

  • *_seconds_count is too high (approx factor 5k, there are huge jumps on some few counters)
  • 1 core used 100% by org.LatencyUtils.PauseDetector$PauseDetectorThread.run after 1-8h runtime:
   java.lang.Thread.State: RUNNABLE
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram$FixedBoundaryHistogram.leastLessThanOrEqualTo(TimeWindowFixedBoundaryHistogram.java:160)
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram$FixedBoundaryHistogram.record(TimeWindowFixedBoundaryHistogram.java:140)
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram.recordLong(TimeWindowFixedBoundaryHistogram.java:60)
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram.recordLong(TimeWindowFixedBoundaryHistogram.java:35)
        at io.micrometer.core.instrument.distribution.AbstractTimeWindowHistogram.recordLong(AbstractTimeWindowHistogram.java:217)
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram.recordLong(TimeWindowFixedBoundaryHistogram.java:35)
        at io.micrometer.prometheus.PrometheusTimer.recordNonNegative(PrometheusTimer.java:69)
        at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:176)
        at io.micrometer.core.instrument.AbstractTimer.recordValueWithExpectedInterval(AbstractTimer.java:135)
        at io.micrometer.core.instrument.AbstractTimer.lambda$initPauseDetector$1(AbstractTimer.java:121)
        at io.micrometer.core.instrument.AbstractTimer$$Lambda$448/1392734885.handlePauseEvent(Unknown Source)
        at org.LatencyUtils.PauseDetector$PauseDetectorThread.run(PauseDetector.java:107)

Workaround:
Switch off Pause Detection

@Bean
public MeterRegistryCustomizer<PrometheusMeterRegistry> registryConfig() {
    return r -> r.config().pauseDetector(new NoPauseDetector());
}
@jkschneider
Copy link
Contributor

jkschneider commented Sep 12, 2018

I'll state my belief that the amount of work that the pause detector does never changes, that it has a fixed CPU overhead governed by its sleep and pause intervals. Maybe you can demonstrate that I'm wrong in this belief ;)

CPU overhead on pause detection should be inversely proportional to sleep and pause intervals on the detector. By default they are both 100ms. Could you try setting them higher and reobserve?

For example:

registry.config().pauseDetector(new ClockDriftPauseDetector(Duration.ofSeconds(1), Duration.ofSeconds(1))

@koa
Copy link
Author

koa commented Sep 13, 2018

Interesting.
The CPU waste seems to be reduced, but there are still too much increase of *_seconds_count counter.
Before (with default settings) the increment was too high by factor of 5k, With your settings there is only 5%.

@jkschneider
Copy link
Contributor

Pause detection doesn't effect *_seconds_count, only contributes latency to the *_seconds_sum. Could you make a reproducible example?

CPU consumption is dependent on the hardware, so there is no one value we can choose to always use 1% or something like that. We've just picked a default that is good for most modern infrastructure.

@koa
Copy link
Author

koa commented Sep 13, 2018

How does it avoid the increment of *_seconds_count?
The stacktrace above shows a call of record() on timer (out of PauseDetector).
Every call of this method increments *_second_count by 1 (PrometheusTimer.java:63).

@jkschneider jkschneider changed the title Pause Detection goes wild on VMWare server Pause Detection contributes to count when it should just contribute to total time. Sep 13, 2018
@jkschneider jkschneider added this to the 1.0.7 milestone Sep 13, 2018
@jkschneider jkschneider added the bug A general bug label Sep 13, 2018
@jkschneider
Copy link
Contributor

Thanks. I think I finally understand. So if it's OK I'll consider the CPU discussion over, since that seems to be working as intended. This issue then becomes about ensuring that pause detection contributes to total time and not throughput.

@koa
Copy link
Author

koa commented Sep 14, 2018

Both symptoms are a bigger problem than a minimal jitter on measured durations (in my situation).

@jkschneider
Copy link
Contributor

@koa Of course you are free to turn off pause detection if you'd like. I wouldn't understate how significant that jitter can be though. Buddy at Netflix reported a 12 second VM pause in EC2 the other day.

@jkschneider jkschneider modified the milestones: 1.0.7, 1.0.8 Oct 15, 2018
@shakuzen
Copy link
Member

Just reporting that we were also bitten by this bug on an application having long pauses, where our dashboards showed a spike in traffic of 800,000+ requests per second on an endpoint that really had 1 request per minute.

@jkschneider
Copy link
Contributor

Wish we could have gotten it fixed in time for 1.1.0, but it is not trivial. Will have to put it in a patch release :/

@mweirauch
Copy link
Contributor

mweirauch commented Nov 4, 2018

When looking into this, I think we should review some Timer usage in JvmGcMetrics (e.g. jvm.gc.pause) which are used there for convenience I assume. I am not quite sure if these should be enriched with pauses from the PauseDetector. (When digging into it a bit, these were at least part of triggers and I am not sure they should, but it's beyond my knowledge if LatencyUtils pause detection should or should not add to these timings which we derive from JMX.)

@shakuzen did you observe some crazy jvm.gc.pause numbers during that "spike", too?

@shakuzen
Copy link
Member

shakuzen commented Nov 6, 2018

did you observe some crazy jvm.gc.pause numbers during that "spike", too?

@mweirauch Yes. The jvm.gc.pause rate also has a 800,000+ GCs per second spike on that instance.

@jkschneider
Copy link
Contributor

@mweirauch Ouch, we don't currently have a way to disable pause detection on a single timer.

@lixiang4543
Copy link

I also encountered this problem. How did you solve it?

@mweirauch
Copy link
Contributor

@lixiang4543 It's currently not solved. You can mitigate this by setting a NoopPauseDetector. See first post.

@lixiang4543
Copy link

@mweirauch How many versions does this problem affect? My version of springboot is: 2.0.2; There is no PrometheusMeterRegistry class。

@lixiang4543
Copy link

How did this problem arise?

@tobbco
Copy link

tobbco commented Nov 28, 2018

How to produce

java.lang.Thread.State: RUNNABLE
at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:176)
at io.micrometer.core.instrument.AbstractTimer.recordValueWithExpectedInterval(AbstractTimer.java:135)
at io.micrometer.core.instrument.AbstractTimer.lambda$initPauseDetector$1(AbstractTimer.java:121)
at io.micrometer.core.instrument.AbstractTimer$$Lambda$401/1892795103.handlePauseEvent(Unknown Source)
at org.LatencyUtils.PauseDetector$PauseDetectorThread.run(PauseDetector.java:107)

@595979101
Copy link

I used springboot2.1, which was version 1.1 but still had this problem.*_seconds_count increases crazily to a billion.It's impossible.Now what do I do?
We need to write our own set of rules to compute it and then override it, right?

@jkschneider
Copy link
Contributor

@595979101 You can turn off pause detection altogether for now:

registry.config().pauseDetector(new NoPauseDetector());

@595979101
Copy link

@jkschneider thank you very much.

@595979101
Copy link

@jkschneider I'm sorry to bother you.I set NoPauseDetector.But *_service_count is still too high.And it happened after the pause.This phenomenon happens from time to time.
image
image

@jkschneider
Copy link
Contributor

jkschneider commented Dec 31, 2018

@595979101 How on earth does service_seconds_sum go negative?

@595979101
Copy link

@jkschneider I don't know.But there was a period of time before this happened when a pause occurred.I found a daemon thread org. LatencyUtils. SimplePauseDetector.If I kill it.Do you think there will be a problem?

@jkschneider
Copy link
Contributor

I don't see how such a daemon thread would exist if you were using NoPauseDetector everywhere. Perhaps you are using it in some places and not others?

@shakuzen
Copy link
Member

We will switch the default to the no-op NoPauseDetector in 1.0.10, 1.1.4, and 1.2.0 to workaround this issue (see #1324). Hopefully in the future we will resolve this and restore the functionality lost by disabling this.

@addozhang
Copy link

addozhang commented Jul 19, 2019

I reproduced this issue in debug mode on local machine.

This should comes when pause occurs in system. The PauseDetector detected pause and compensation latency via invoking AbstractTimer#recordValueWithExpectedInterval(). It compensations time together with count.

When long pause occurs, it falls in loop highlighted below. The looped operation will cause decreasing of estimatedInterval evaluated by TimeCappedMovingAverageIntervalEstimator, because in #record method the estimator will record interval as normal record.

During one captured latency compensation, nanoValueandexpectedIntervalBetweenValueSamples are 6301897087 and 193.
The 1st one is much larger than 2nd one. So the loop will run 6301897087 / 193 times.

image

image

The benchmark test:

  1. jmeter: 2 threads and 20k requests total.

all goes well. The request count is same as request total.

image

image

  1. At SimplePauseDetector#L124 add a 300ms stop

Request account is far more that request count. It's still increasing event no request coming.

image

image

Lots of mins after jmeter shutdown.
image

@addozhang
Copy link

addozhang commented Jul 25, 2019

Continue previous comment:

reproduce

Reproduce case in docker container by trigger full gc via heap dump per seconds, the kept increasing high even no incoming request.

image

image

workaround solution

Customized PrometheusMeterRegistery which creates customized timer which compensate time only once pause occurred.

image

image

@markmaxey
Copy link

markmaxey commented Mar 16, 2021

What is the status of this issue? I spent almost 2 days trying to figure out why my counts were off only to discover there was a background pause detector thread mutating the counts seemingly randomly.

I have a very simple way of reproducing the problem. Below is a parameterized JUnit test recording a Timer duration 10,000,000 times in another (single) thread. The one with the pause detector fails while the one without passes. Another test (not shown below) where the Timer is recorded in the same "main" thread without using another thread / executor actually passes. The only time it fails is when using pause detection in a multi-threaded context. No amount of synchronization helps because the pause detector runs in the background mutating the timer.

package test;

import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.distribution.pause.NoPauseDetector;
import io.micrometer.core.instrument.simple.SimpleMeterRegistry;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;

import java.util.Arrays;
import java.util.Collection;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import static junit.framework.TestCase.assertEquals;

@RunWith(Parameterized.class)
public class TestTimerCount {
    private final long LOOP_TOTAL = 10000000;
    private final boolean usePauseDetector;

    public TestTimerCount(final boolean usePauseDetector) { this.usePauseDetector = usePauseDetector; }

    @Parameterized.Parameters(name = "{index}: test timer with usePauseDetector({0})")
    public static Collection<Boolean> data() { return Arrays.asList(true, false); }

    class Runner implements Runnable {
        final Timer cut;

        Runner() {
            Timer.Builder builder = Timer.builder(Boolean.toString(usePauseDetector));
            if (!usePauseDetector) builder.pauseDetector(new NoPauseDetector());
            cut = builder.register(new SimpleMeterRegistry());
        }

        @Override public void run() { cut.record(1, TimeUnit.MILLISECONDS); }
    }

    @Test public void test() {
        final ExecutorService executorService = Executors.newSingleThreadExecutor();
        final Runner runner = new Runner();

        for(int ndx = 0; ndx < LOOP_TOTAL; ndx++) executorService.submit(runner);

        try {
            executorService.shutdown();
            executorService.awaitTermination(30, TimeUnit.SECONDS);
        } catch (InterruptedException ex) {}

        assertEquals(LOOP_TOTAL, runner.cut.count());
    }
}

@shakuzen
Copy link
Member

What is the status of this issue?

See #844 (comment). It should not be an issue for you if you are on a supported version of Micrometer and are not explicitly configuring a pause detector.

@markmaxey
Copy link

@shakuzen - Agreed. That's what my example above shows. From my newbie perspective,

  1. I believe most people would expect the count should represent the number of samples collected by calling the Timer API. Letting the pause detector (or anything else) add additional samples causes confusion to users looking at the counts and rates. In my case, the counts were up to 3 times more than the actual count. This is not a small unnoticeable skew.
  2. At the very least, the "micrometer concepts" documentation on timer pause detectors needs to have some documentation saying that using a pause detector mutates the count as well as the values. It may be obvious once you read up on how the pause detector works or once you look at the code, but it isn't obvious from reading the documentation.
  3. I think my biggest concern is abated by Switch default pause detector to NoPauseDetector #1324.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug
Projects
None yet
Development

No branches or pull requests

9 participants