-
Notifications
You must be signed in to change notification settings - Fork 20.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: sliding histogram #26356
base: master
Are you sure you want to change the base?
metrics: sliding histogram #26356
Conversation
Hmm, so... we kind of agree that the issue is real, the resetting things were a bit of a wonky hack. That said, figuring out the exact issue and evaluating / fixing it might take a bit as it's not really a priority for us. TL;DR This PR might take a bit, please bare with us :P |
Sounds good. I tried to use a standard improved solution from upstream libraries, happy to look at other solutions if desired. This particular solution has been working well for us. |
To add an additional note for anyone that wants to use existing metrics to graph per RPC call request rates in Grafana:
|
The design of the metrics was refactored in #28035, so this PR would have to be updated to the new write/read interface split. Apologies again for us not having gotten this reviewed in nearly a year! |
Thanks for the ping, will get to this soon |
Use single function for creating bounded histogram sample
The current go-ethereum histogram metrics object is set to reset whenever the metrics endpoint is scraped. The reset is done to prevent the problem of histograms not getting updated for commands that are rarely called. However, this breaks many Prometheus assumptions and makes computing correct metrics difficult and unreliable. The go-ethereum metrics appears to be a fork of rcrowley/go-metrics which in turn is a port of the Java library https://github.com/dropwizard/metrics. Looking at https://metrics.dropwizard.io/4.2.0/manual/core.html#exponentially-decaying-reservoirs there is an implementation of `SlidingTimeWindowArrayReservoir`: > SlidingTimeWindowArrayReservoir is comparable with > ExponentiallyDecayingReservoir in terms GC overhead and performance. As for > required memory, SlidingTimeWindowArrayReservoir takes ~128 bits per stored > measurement and you can simply calculate required amount of heap. > Example: 10K measurements / sec with reservoir storing time of 1 minute will > take 10000 * 60 * 128 / 8 = 9600000 bytes ~ 9 megabytes Here is more information on the sampling error introduced by Exponential Decay sampling: https://medium.com/expedia-group-tech/your-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374
…for latency histograms
43a342e
to
00e202f
Compare
The current go-ethereum histogram metrics object is set to reset whenever
the metrics endpoint is scraped. The reset is done to prevent the problem of
histograms not getting updated for commands that are rarely called. However,
this breaks many Prometheus assumptions and makes computing correct metrics
difficult and unreliable.
The go-ethereum metrics appears to be a fork of
rcrowley/go-metrics which in turn is a port of the Java library
https://github.com/dropwizard/metrics. Looking at
https://metrics.dropwizard.io/4.2.0/manual/core.html#exponentially-decaying-reservoirs
there is an implementation of
SlidingTimeWindowArrayReservoir
:Here is more information on the sampling error introduced by Exponential Decay
sampling:
https://medium.com/expedia-group-tech/your-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374
After implementing this code, the per rpc call request rates are displayed properly with a simple Grafana
rate()
graph, and rare get_log query timeouts started showing up in P9999 histogram graph that were being lost before.I only replaced some of the
NewExpDecaySample
calls to be conservative, but it would likely be an improvement to replace all of them eventually.