Skip to content

Ensure republish mechanism creates consistent results #1253

Open
@ckbedwell

Description

@ckbedwell

Problem

I have a check that runs every 5 minutes and when checking the underlying data in Mimir I can see an inconsistent amount of samples for each probe event. I have created a spreadsheet to demonstrate what I have versus what I would expect.

If you want to view the raw data and experiment, add yourself to my stack and see it here

In a five-minute period I would expect to see 3 samples per probe but it alternates between having 2 and 4 samples. There should be a consistent amount of samples per execution.

The context of this change is ensuring we can accurately portray uptime. In light of knowing we have to have a republish mechanism, I am thinking about how we can integrate that when using PromQL and Grafana's time range querying mechanism (evaluating periods in neat blocks of time, e.g. always 09:15:00 - 09:20:00 and 09:20:00 - 09:25:00 rather than 09:16:17 - 09:21:17). I am considering this query as a version for v4:

max by () (round(avg_over_time(probe_success{job="${job}", instance="${instance}", probe=~"${probe}"}[${interval}])))

The second issue I face with the republish mechanism is I need an uneven amount of samples per probe execution for the above query to work accurately. In the spreadsheet above I've made a second tab showing two hypothetical scenarios for a check with a four-minute interval.

If we hard code the two-minute interval, failures besides successes will never get reported.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions