metrics: sliding histogram #26356

joshuacolvin0 · 2022-12-14T00:37:19Z

The current go-ethereum histogram metrics object is set to reset whenever
the metrics endpoint is scraped. The reset is done to prevent the problem of
histograms not getting updated for commands that are rarely called. However,
this breaks many Prometheus assumptions and makes computing correct metrics
difficult and unreliable.

The go-ethereum metrics appears to be a fork of
rcrowley/go-metrics which in turn is a port of the Java library
https://github.com/dropwizard/metrics. Looking at
https://metrics.dropwizard.io/4.2.0/manual/core.html#exponentially-decaying-reservoirs
there is an implementation of SlidingTimeWindowArrayReservoir:

SlidingTimeWindowArrayReservoir is comparable with
ExponentiallyDecayingReservoir in terms GC overhead and performance. As for
required memory, SlidingTimeWindowArrayReservoir takes ~128 bits per stored
measurement and you can simply calculate required amount of heap.
Example: 10K measurements / sec with reservoir storing time of 1 minute will
take 10000 * 60 * 128 / 8 = 9600000 bytes ~ 9 megabytes

Here is more information on the sampling error introduced by Exponential Decay
sampling:
https://medium.com/expedia-group-tech/your-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374

After implementing this code, the per rpc call request rates are displayed properly with a simple Grafana rate() graph, and rare get_log query timeouts started showing up in P9999 histogram graph that were being lost before.

I only replaced some of the NewExpDecaySample calls to be conservative, but it would likely be an improvement to replace all of them eventually.

karalabe · 2022-12-14T09:54:25Z

Hmm, so... we kind of agree that the issue is real, the resetting things were a bit of a wonky hack. That said, figuring out the exact issue and evaluating / fixing it might take a bit as it's not really a priority for us. TL;DR This PR might take a bit, please bare with us :P

joshuacolvin0 · 2022-12-14T19:01:00Z

Hmm, so... we kind of agree that the issue is real, the resetting things were a bit of a wonky hack. That said, figuring out the exact issue and evaluating / fixing it might take a bit as it's not really a priority for us. TL;DR This PR might take a bit, please bare with us :P

Sounds good. I tried to use a standard improved solution from upstream libraries, happy to look at other solutions if desired. This particular solution has been working well for us.

joshuacolvin0 · 2022-12-14T23:04:02Z

To add an additional note for anyone that wants to use existing metrics to graph per RPC call request rates in Grafana:

Do not use rate() or irate()
Divide count by number of seconds between prometheus scrapes. As an example, if metrics are scraped by prometheus every 30 seconds, you can graph rpc_duration_eth_getLogs_success_count{}/30. This assumes there is only one prometheus scraper.

holiman · 2023-10-23T13:52:14Z

The design of the metrics was refactored in #28035, so this PR would have to be updated to the new write/read interface split.

Apologies again for us not having gotten this reviewed in nearly a year!

joshuacolvin0 · 2023-11-02T17:13:49Z

The design of the metrics was refactored in #28035, so this PR would have to be updated to the new write/read interface split.

Apologies again for us not having gotten this reviewed in nearly a year!

Thanks for the ping, will get to this soon

Use single function for creating bounded histogram sample

The current go-ethereum histogram metrics object is set to reset whenever the metrics endpoint is scraped. The reset is done to prevent the problem of histograms not getting updated for commands that are rarely called. However, this breaks many Prometheus assumptions and makes computing correct metrics difficult and unreliable. The go-ethereum metrics appears to be a fork of rcrowley/go-metrics which in turn is a port of the Java library https://github.com/dropwizard/metrics. Looking at https://metrics.dropwizard.io/4.2.0/manual/core.html#exponentially-decaying-reservoirs there is an implementation of `SlidingTimeWindowArrayReservoir`: > SlidingTimeWindowArrayReservoir is comparable with > ExponentiallyDecayingReservoir in terms GC overhead and performance. As for > required memory, SlidingTimeWindowArrayReservoir takes ~128 bits per stored > measurement and you can simply calculate required amount of heap. > Example: 10K measurements / sec with reservoir storing time of 1 minute will > take 10000 * 60 * 128 / 8 = 9600000 bytes ~ 9 megabytes Here is more information on the sampling error introduced by Exponential Decay sampling: https://medium.com/expedia-group-tech/your-latency-metrics-could-be-misleading-you-how-hdrhistogram-can-help-9d545b598374

…for latency histograms

joshuacolvin0 requested review from fjl, holiman, zsfelfoldi, karalabe and rjl493456442 as code owners December 14, 2022 00:37

fjl changed the title ~~Sliding histogram~~ metrics: sliding histogram Dec 14, 2022

holiman mentioned this pull request Dec 29, 2022

[Metrics] rpc_duration_xxxx Indirectly return zero data #26393

Open

joshuacolvin0 added 3 commits January 24, 2024 09:11

Reduce code duplication for resetting expdecaysample

1d98eeb

Use single function for creating bounded histogram sample

Use SlidingTimeWindowArray instead of using resetting ExpDecaySample …

00e202f

…for latency histograms

joshuacolvin0 force-pushed the sliding-histogram branch from 43a342e to 00e202f Compare January 24, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metrics: sliding histogram #26356

metrics: sliding histogram #26356

Uh oh!

joshuacolvin0 commented Dec 14, 2022

Uh oh!

karalabe commented Dec 14, 2022

Uh oh!

joshuacolvin0 commented Dec 14, 2022

Uh oh!

joshuacolvin0 commented Dec 14, 2022

Uh oh!

holiman commented Oct 23, 2023

Uh oh!

joshuacolvin0 commented Nov 2, 2023

Uh oh!

Uh oh!

metrics: sliding histogram #26356

Are you sure you want to change the base?

metrics: sliding histogram #26356

Uh oh!

Conversation

joshuacolvin0 commented Dec 14, 2022

Uh oh!

karalabe commented Dec 14, 2022

Uh oh!

joshuacolvin0 commented Dec 14, 2022

Uh oh!

joshuacolvin0 commented Dec 14, 2022

Uh oh!

holiman commented Oct 23, 2023

Uh oh!

joshuacolvin0 commented Nov 2, 2023

Uh oh!

Uh oh!