-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pause Detection contributes to count when it should just contribute to total time. #844
Comments
I'll state my belief that the amount of work that the pause detector does never changes, that it has a fixed CPU overhead governed by its sleep and pause intervals. Maybe you can demonstrate that I'm wrong in this belief ;) CPU overhead on pause detection should be inversely proportional to sleep and pause intervals on the detector. By default they are both 100ms. Could you try setting them higher and reobserve? For example:
|
Interesting. |
Pause detection doesn't effect CPU consumption is dependent on the hardware, so there is no one value we can choose to always use 1% or something like that. We've just picked a default that is good for most modern infrastructure. |
How does it avoid the increment of *_seconds_count? |
Thanks. I think I finally understand. So if it's OK I'll consider the CPU discussion over, since that seems to be working as intended. This issue then becomes about ensuring that pause detection contributes to total time and not throughput. |
Both symptoms are a bigger problem than a minimal jitter on measured durations (in my situation). |
@koa Of course you are free to turn off pause detection if you'd like. I wouldn't understate how significant that jitter can be though. Buddy at Netflix reported a 12 second VM pause in EC2 the other day. |
Just reporting that we were also bitten by this bug on an application having long pauses, where our dashboards showed a spike in traffic of 800,000+ requests per second on an endpoint that really had 1 request per minute. |
Wish we could have gotten it fixed in time for 1.1.0, but it is not trivial. Will have to put it in a patch release :/ |
When looking into this, I think we should review some @shakuzen did you observe some crazy |
@mweirauch Yes. The |
@mweirauch Ouch, we don't currently have a way to disable pause detection on a single timer. |
I also encountered this problem. How did you solve it? |
@lixiang4543 It's currently not solved. You can mitigate this by setting a |
@mweirauch How many versions does this problem affect? My version of springboot is: 2.0.2; There is no PrometheusMeterRegistry class。 |
How did this problem arise? |
How to produce |
I used springboot2.1, which was version 1.1 but still had this problem.*_seconds_count increases crazily to a billion.It's impossible.Now what do I do? |
@595979101 You can turn off pause detection altogether for now:
|
@jkschneider thank you very much. |
@jkschneider I'm sorry to bother you.I set NoPauseDetector.But *_service_count is still too high.And it happened after the pause.This phenomenon happens from time to time. |
@595979101 How on earth does |
@jkschneider I don't know.But there was a period of time before this happened when a pause occurred.I found a daemon thread org. LatencyUtils. SimplePauseDetector.If I kill it.Do you think there will be a problem? |
I don't see how such a daemon thread would exist if you were using |
We will switch the default to the no-op |
I reproduced this issue in debug mode on local machine. This should comes when pause occurs in system. The When long pause occurs, it falls in loop highlighted below. The looped operation will cause decreasing of During one captured latency compensation, The benchmark test:
all goes well. The request count is same as request total.
Request account is far more that request count. It's still increasing event no request coming. |
Continue previous comment: reproduceReproduce case in docker container by trigger full gc via heap dump per seconds, the kept increasing high even no incoming request. workaround solutionCustomized |
What is the status of this issue? I spent almost 2 days trying to figure out why my counts were off only to discover there was a background pause detector thread mutating the counts seemingly randomly. I have a very simple way of reproducing the problem. Below is a parameterized JUnit test recording a Timer duration 10,000,000 times in another (single) thread. The one with the pause detector fails while the one without passes. Another test (not shown below) where the Timer is recorded in the same "main" thread without using another thread / executor actually passes. The only time it fails is when using pause detection in a multi-threaded context. No amount of synchronization helps because the pause detector runs in the background mutating the timer.
|
See #844 (comment). It should not be an issue for you if you are on a supported version of Micrometer and are not explicitly configuring a pause detector. |
@shakuzen - Agreed. That's what my example above shows. From my newbie perspective,
Thanks! |
Situation:
Symptoms:
Workaround:
Switch off Pause Detection
The text was updated successfully, but these errors were encountered: