Description
Problem
I have a check that runs every 5 minutes and when checking the underlying data in Mimir I can see an inconsistent amount of samples for each probe event. I have created a spreadsheet to demonstrate what I have versus what I would expect.
If you want to view the raw data and experiment, add yourself to my stack and see it here
In a five-minute period I would expect to see 3 samples per probe but it alternates between having 2 and 4 samples. There should be a consistent amount of samples per execution.
The context of this change is ensuring we can accurately portray uptime. In light of knowing we have to have a republish mechanism, I am thinking about how we can integrate that when using PromQL and Grafana's time range querying mechanism (evaluating periods in neat blocks of time, e.g. always 09:15:00 - 09:20:00 and 09:20:00 - 09:25:00 rather than 09:16:17 - 09:21:17). I am considering this query as a version for v4:
max by () (round(avg_over_time(probe_success{job="${job}", instance="${instance}", probe=~"${probe}"}[${interval}])))
The second issue I face with the republish mechanism is I need an uneven amount of samples per probe execution for the above query to work accurately. In the spreadsheet above I've made a second tab showing two hypothetical scenarios for a check with a four-minute interval.
If we hard code the two-minute interval, failures besides successes will never get reported.