Pause Detection contributes to count when it should just contribute to total time. #844

koa · 2018-09-12T04:59:01Z

Situation:

Spring Boot 2.0.4
Prometheus
VMWare Server
high throughput (20k timers/s per instance)

Symptoms:

*_seconds_count is too high (approx factor 5k, there are huge jumps on some few counters)
1 core used 100% by org.LatencyUtils.PauseDetector$PauseDetectorThread.run after 1-8h runtime:

   java.lang.Thread.State: RUNNABLE
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram$FixedBoundaryHistogram.leastLessThanOrEqualTo(TimeWindowFixedBoundaryHistogram.java:160)
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram$FixedBoundaryHistogram.record(TimeWindowFixedBoundaryHistogram.java:140)
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram.recordLong(TimeWindowFixedBoundaryHistogram.java:60)
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram.recordLong(TimeWindowFixedBoundaryHistogram.java:35)
        at io.micrometer.core.instrument.distribution.AbstractTimeWindowHistogram.recordLong(AbstractTimeWindowHistogram.java:217)
        at io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram.recordLong(TimeWindowFixedBoundaryHistogram.java:35)
        at io.micrometer.prometheus.PrometheusTimer.recordNonNegative(PrometheusTimer.java:69)
        at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:176)
        at io.micrometer.core.instrument.AbstractTimer.recordValueWithExpectedInterval(AbstractTimer.java:135)
        at io.micrometer.core.instrument.AbstractTimer.lambda$initPauseDetector$1(AbstractTimer.java:121)
        at io.micrometer.core.instrument.AbstractTimer$$Lambda$448/1392734885.handlePauseEvent(Unknown Source)
        at org.LatencyUtils.PauseDetector$PauseDetectorThread.run(PauseDetector.java:107)

Workaround:
Switch off Pause Detection

@Bean
public MeterRegistryCustomizer<PrometheusMeterRegistry> registryConfig() {
    return r -> r.config().pauseDetector(new NoPauseDetector());
}

The text was updated successfully, but these errors were encountered:

jkschneider · 2018-09-12T17:16:07Z

I'll state my belief that the amount of work that the pause detector does never changes, that it has a fixed CPU overhead governed by its sleep and pause intervals. Maybe you can demonstrate that I'm wrong in this belief ;)

CPU overhead on pause detection should be inversely proportional to sleep and pause intervals on the detector. By default they are both 100ms. Could you try setting them higher and reobserve?

For example:

registry.config().pauseDetector(new ClockDriftPauseDetector(Duration.ofSeconds(1), Duration.ofSeconds(1))

koa · 2018-09-13T08:51:35Z

Interesting.
The CPU waste seems to be reduced, but there are still too much increase of *_seconds_count counter.
Before (with default settings) the increment was too high by factor of 5k, With your settings there is only 5%.

jkschneider · 2018-09-13T12:45:02Z

Pause detection doesn't effect *_seconds_count, only contributes latency to the *_seconds_sum. Could you make a reproducible example?

CPU consumption is dependent on the hardware, so there is no one value we can choose to always use 1% or something like that. We've just picked a default that is good for most modern infrastructure.

koa · 2018-09-13T13:03:33Z

How does it avoid the increment of *_seconds_count?
The stacktrace above shows a call of record() on timer (out of PauseDetector).
Every call of this method increments *_second_count by 1 (PrometheusTimer.java:63).

jkschneider · 2018-09-13T13:53:27Z

Thanks. I think I finally understand. So if it's OK I'll consider the CPU discussion over, since that seems to be working as intended. This issue then becomes about ensuring that pause detection contributes to total time and not throughput.

koa · 2018-09-14T15:16:49Z

Both symptoms are a bigger problem than a minimal jitter on measured durations (in my situation).

jkschneider · 2018-09-14T15:55:17Z

@koa Of course you are free to turn off pause detection if you'd like. I wouldn't understate how significant that jitter can be though. Buddy at Netflix reported a 12 second VM pause in EC2 the other day.

shakuzen · 2018-10-29T02:28:20Z

Just reporting that we were also bitten by this bug on an application having long pauses, where our dashboards showed a spike in traffic of 800,000+ requests per second on an endpoint that really had 1 request per minute.

jkschneider · 2018-10-29T04:16:17Z

Wish we could have gotten it fixed in time for 1.1.0, but it is not trivial. Will have to put it in a patch release :/

mweirauch · 2018-11-04T20:02:15Z

When looking into this, I think we should review some Timer usage in JvmGcMetrics (e.g. jvm.gc.pause) which are used there for convenience I assume. I am not quite sure if these should be enriched with pauses from the PauseDetector. (When digging into it a bit, these were at least part of triggers and I am not sure they should, but it's beyond my knowledge if LatencyUtils pause detection should or should not add to these timings which we derive from JMX.)

@shakuzen did you observe some crazy jvm.gc.pause numbers during that "spike", too?

shakuzen · 2018-11-06T03:11:57Z

did you observe some crazy jvm.gc.pause numbers during that "spike", too?

@mweirauch Yes. The jvm.gc.pause rate also has a 800,000+ GCs per second spike on that instance.

jkschneider · 2018-11-06T13:20:12Z

@mweirauch Ouch, we don't currently have a way to disable pause detection on a single timer.

lixiang4543 · 2018-11-28T09:59:03Z

I also encountered this problem. How did you solve it?

mweirauch · 2018-11-28T10:02:36Z

@lixiang4543 It's currently not solved. You can mitigate this by setting a NoopPauseDetector. See first post.

lixiang4543 · 2018-11-28T10:27:57Z

@mweirauch How many versions does this problem affect? My version of springboot is: 2.0.2; There is no PrometheusMeterRegistry class。

lixiang4543 · 2018-11-28T10:39:39Z

How did this problem arise?

tobbco · 2018-11-28T10:58:41Z

How to produce
：
java.lang.Thread.State: RUNNABLE
at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:176)
at io.micrometer.core.instrument.AbstractTimer.recordValueWithExpectedInterval(AbstractTimer.java:135)
at io.micrometer.core.instrument.AbstractTimer.lambda$initPauseDetector$1(AbstractTimer.java:121)
at io.micrometer.core.instrument.AbstractTimer$$Lambda$401/1892795103.handlePauseEvent(Unknown Source)
at org.LatencyUtils.PauseDetector$PauseDetectorThread.run(PauseDetector.java:107)

595979101 · 2018-12-10T06:58:11Z

I used springboot2.1, which was version 1.1 but still had this problem.*_seconds_count increases crazily to a billion.It's impossible.Now what do I do?
We need to write our own set of rules to compute it and then override it, right?

jkschneider · 2018-12-12T16:10:25Z

@595979101 You can turn off pause detection altogether for now:

registry.config().pauseDetector(new NoPauseDetector());

595979101 · 2018-12-13T02:45:44Z

@jkschneider thank you very much.

595979101 · 2018-12-29T11:58:04Z

@jkschneider I'm sorry to bother you.I set NoPauseDetector.But *_service_count is still too high.And it happened after the pause.This phenomenon happens from time to time.

jkschneider · 2018-12-31T03:05:55Z

@595979101 How on earth does service_seconds_sum go negative?

595979101 · 2019-01-02T03:08:37Z

@jkschneider I don't know.But there was a period of time before this happened when a pause occurred.I found a daemon thread org. LatencyUtils. SimplePauseDetector.If I kill it.Do you think there will be a problem?

jkschneider · 2019-01-02T16:15:14Z

I don't see how such a daemon thread would exist if you were using NoPauseDetector everywhere. Perhaps you are using it in some places and not others?

shakuzen · 2019-03-26T06:44:39Z

We will switch the default to the no-op NoPauseDetector in 1.0.10, 1.1.4, and 1.2.0 to workaround this issue (see #1324). Hopefully in the future we will resolve this and restore the functionality lost by disabling this.

addozhang · 2019-07-19T06:46:57Z

I reproduced this issue in debug mode on local machine.

This should comes when pause occurs in system. The PauseDetector detected pause and compensation latency via invoking AbstractTimer#recordValueWithExpectedInterval(). It compensations time together with count.

When long pause occurs, it falls in loop highlighted below. The looped operation will cause decreasing of estimatedInterval evaluated by TimeCappedMovingAverageIntervalEstimator, because in #record method the estimator will record interval as normal record.

During one captured latency compensation, nanoValueandexpectedIntervalBetweenValueSamples are 6301897087 and 193.
The 1st one is much larger than 2nd one. So the loop will run 6301897087 / 193 times.

The benchmark test:

jmeter: 2 threads and 20k requests total.

all goes well. The request count is same as request total.

At SimplePauseDetector#L124 add a 300ms stop

Request account is far more that request count. It's still increasing event no request coming.

Lots of mins after jmeter shutdown.

addozhang · 2019-07-25T00:59:30Z

Continue previous comment:

reproduce

Reproduce case in docker container by trigger full gc via heap dump per seconds, the kept increasing high even no incoming request.

workaround solution

Customized PrometheusMeterRegistery which creates customized timer which compensate time only once pause occurred.

markmaxey · 2021-03-16T19:44:37Z

What is the status of this issue? I spent almost 2 days trying to figure out why my counts were off only to discover there was a background pause detector thread mutating the counts seemingly randomly.

I have a very simple way of reproducing the problem. Below is a parameterized JUnit test recording a Timer duration 10,000,000 times in another (single) thread. The one with the pause detector fails while the one without passes. Another test (not shown below) where the Timer is recorded in the same "main" thread without using another thread / executor actually passes. The only time it fails is when using pause detection in a multi-threaded context. No amount of synchronization helps because the pause detector runs in the background mutating the timer.

package test;

import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.distribution.pause.NoPauseDetector;
import io.micrometer.core.instrument.simple.SimpleMeterRegistry;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;

import java.util.Arrays;
import java.util.Collection;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import static junit.framework.TestCase.assertEquals;

@RunWith(Parameterized.class)
public class TestTimerCount {
    private final long LOOP_TOTAL = 10000000;
    private final boolean usePauseDetector;

    public TestTimerCount(final boolean usePauseDetector) { this.usePauseDetector = usePauseDetector; }

    @Parameterized.Parameters(name = "{index}: test timer with usePauseDetector({0})")
    public static Collection<Boolean> data() { return Arrays.asList(true, false); }

    class Runner implements Runnable {
        final Timer cut;

        Runner() {
            Timer.Builder builder = Timer.builder(Boolean.toString(usePauseDetector));
            if (!usePauseDetector) builder.pauseDetector(new NoPauseDetector());
            cut = builder.register(new SimpleMeterRegistry());
        }

        @Override public void run() { cut.record(1, TimeUnit.MILLISECONDS); }
    }

    @Test public void test() {
        final ExecutorService executorService = Executors.newSingleThreadExecutor();
        final Runner runner = new Runner();

        for(int ndx = 0; ndx < LOOP_TOTAL; ndx++) executorService.submit(runner);

        try {
            executorService.shutdown();
            executorService.awaitTermination(30, TimeUnit.SECONDS);
        } catch (InterruptedException ex) {}

        assertEquals(LOOP_TOTAL, runner.cut.count());
    }
}

shakuzen · 2021-03-17T01:29:17Z

What is the status of this issue?

See #844 (comment). It should not be an issue for you if you are on a supported version of Micrometer and are not explicitly configuring a pause detector.

markmaxey · 2021-03-17T12:13:11Z

@shakuzen - Agreed. That's what my example above shows. From my newbie perspective,

I believe most people would expect the count should represent the number of samples collected by calling the Timer API. Letting the pause detector (or anything else) add additional samples causes confusion to users looking at the counts and rates. In my case, the counts were up to 3 times more than the actual count. This is not a small unnoticeable skew.
At the very least, the "micrometer concepts" documentation on timer pause detectors needs to have some documentation saying that using a pause detector mutates the count as well as the values. It may be obvious once you read up on how the pause detector works or once you look at the code, but it isn't obvious from reading the documentation.
I think my biggest concern is abated by Switch default pause detector to NoPauseDetector #1324.

Thanks!

jkschneider changed the title ~~Pause Detection goes wild on VMWare server~~ Pause Detection contributes to count when it should just contribute to total time. Sep 13, 2018

jkschneider added this to the 1.0.7 milestone Sep 13, 2018

jkschneider added the bug A general bug label Sep 13, 2018

jkschneider modified the milestones: 1.0.7, 1.0.8 Oct 15, 2018

jkschneider modified the milestones: 1.0.8, 1.0.9 Nov 29, 2018

jkschneider mentioned this issue Dec 10, 2018

Micrometer Timer count spikes unexpectedly #1064

Closed

jkschneider modified the milestones: 1.0.9, 1.0.10 Jan 10, 2019

clementdessoude mentioned this issue Jan 29, 2019

High CPU caused by micrometer.io thread jhipster/generator-jhipster#9133

Closed

1 task

shakuzen modified the milestones: 1.0.10, 1.x Mar 26, 2019

shakuzen mentioned this issue Mar 26, 2019

Switch default pause detector to NoPauseDetector #1324

Closed

sinrimin mentioned this issue Feb 8, 2022

Do not use pause detector for CommandLatencyCollector by default redis/lettuce#1995

Closed

shakuzen mentioned this issue Apr 26, 2022

Bug: PrometheusTimer reports >0 metrics when fed only with =0 durations #3039

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pause Detection contributes to count when it should just contribute to total time. #844

Pause Detection contributes to count when it should just contribute to total time. #844

koa commented Sep 12, 2018

jkschneider commented Sep 12, 2018 •

edited

Loading

koa commented Sep 13, 2018

jkschneider commented Sep 13, 2018

koa commented Sep 13, 2018

jkschneider commented Sep 13, 2018

koa commented Sep 14, 2018

jkschneider commented Sep 14, 2018

shakuzen commented Oct 29, 2018

jkschneider commented Oct 29, 2018

mweirauch commented Nov 4, 2018 •

edited

Loading

shakuzen commented Nov 6, 2018

jkschneider commented Nov 6, 2018

lixiang4543 commented Nov 28, 2018

mweirauch commented Nov 28, 2018

lixiang4543 commented Nov 28, 2018

lixiang4543 commented Nov 28, 2018

tobbco commented Nov 28, 2018

595979101 commented Dec 10, 2018

jkschneider commented Dec 12, 2018

595979101 commented Dec 13, 2018

595979101 commented Dec 29, 2018

jkschneider commented Dec 31, 2018 •

edited

Loading

595979101 commented Jan 2, 2019

jkschneider commented Jan 2, 2019

shakuzen commented Mar 26, 2019

addozhang commented Jul 19, 2019 •

edited

Loading

addozhang commented Jul 25, 2019 •

edited

Loading

markmaxey commented Mar 16, 2021 •

edited

Loading

shakuzen commented Mar 17, 2021

markmaxey commented Mar 17, 2021

Pause Detection contributes to count when it should just contribute to total time. #844

Pause Detection contributes to count when it should just contribute to total time. #844

Comments

koa commented Sep 12, 2018

jkschneider commented Sep 12, 2018 • edited Loading

koa commented Sep 13, 2018

jkschneider commented Sep 13, 2018

koa commented Sep 13, 2018

jkschneider commented Sep 13, 2018

koa commented Sep 14, 2018

jkschneider commented Sep 14, 2018

shakuzen commented Oct 29, 2018

jkschneider commented Oct 29, 2018

mweirauch commented Nov 4, 2018 • edited Loading

shakuzen commented Nov 6, 2018

jkschneider commented Nov 6, 2018

lixiang4543 commented Nov 28, 2018

mweirauch commented Nov 28, 2018

lixiang4543 commented Nov 28, 2018

lixiang4543 commented Nov 28, 2018

tobbco commented Nov 28, 2018

595979101 commented Dec 10, 2018

jkschneider commented Dec 12, 2018

595979101 commented Dec 13, 2018

595979101 commented Dec 29, 2018

jkschneider commented Dec 31, 2018 • edited Loading

595979101 commented Jan 2, 2019

jkschneider commented Jan 2, 2019

shakuzen commented Mar 26, 2019

addozhang commented Jul 19, 2019 • edited Loading

addozhang commented Jul 25, 2019 • edited Loading

reproduce

workaround solution

markmaxey commented Mar 16, 2021 • edited Loading

shakuzen commented Mar 17, 2021

markmaxey commented Mar 17, 2021

jkschneider commented Sep 12, 2018 •

edited

Loading

mweirauch commented Nov 4, 2018 •

edited

Loading

jkschneider commented Dec 31, 2018 •

edited

Loading

addozhang commented Jul 19, 2019 •

edited

Loading

addozhang commented Jul 25, 2019 •

edited

Loading

markmaxey commented Mar 16, 2021 •

edited

Loading