-
Notifications
You must be signed in to change notification settings - Fork 1.1k
The Zen of Prometheus #1783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
The Zen of Prometheus #1783
Conversation
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
|
||
## Instrument first, ask questions later | ||
|
||
During development you will never know what questions you need to ask later. Software needs good instrumentation, it's not optional. Metrics are cheap. Use them generously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A metric is cheap. Metrics are not cheap.
Metrics without labels are cheap, metrics with labels tend to blow up in cardinality. Making it seem that metrics are cheap would likely lead to more users falling into this trap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, would be good to be more nuanced here and perhaps link to the cardinality section.
|
||
## Measure what users care | ||
|
||
Do your users care if your database servers are down? Do they care about your CPU saturation? No, they care about what they experience. They care about whether they can access the page that they have requested and their results are fresh. Think in terms of latencies and availability. Let your [SLO](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/)s guide your instrumentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, you should be measuring all of those things.
You're conflating metrics with alerting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be phrased in a way that measuring all of those things is still important, but that instrumentation should additionally be suitable for SLO-based alerting.
|
||
## Avoid missing metrics | ||
|
||
Time series that are not present until something happens are difficult to deal with. To avoid this, export 0 (or NaN if 0 would be misleading) for any time series you know may exist in advance. You have to initialize your metrics with the zero values to prevent broken dashboards and misfiring alerts. For more detailed explanation check out [Existential issues with metrics](https://www.robustperception.io/existential-issues-with-metrics). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN shouldn't be used like this.
You don't have to expose 0, but it's much tricker to deal with correctly if you don't.
|
||
## Cardinality Matters | ||
|
||
Every unique set of labels create a new timeseries. Use labels with care watch out what you put into your labels. Avoid cardinality explosion, unbounded labels will blow up Prometheus. And keep in mind that labels are multiplicative. You will have multiple labels, multiple target labels and targets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of target labels doesn't matter for cardinality.
|
||
## Naming is hard | ||
|
||
One of the hardest problem in computer science. Metric name should only have a single meaning within a single job. And you should always guard against metric name collisions among the jobs on your system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A metric name
What do you mean by "guard against"? It's perfectly normal for different jobs to share metric names, such as process_cpu_seconds_total. As worded this seems to be arguing for prefixing metric names with job names.
|
||
And as always, let your [SLO](https://www.youtube.com/watch?v=X99X-VDzxnw)s guide your bucket layout, create boundaries to match your SLO. | ||
|
||
The histograms underneath are just counter with labels; where bucket boundaries used as labels. Be precautious while adding additional labels to your histograms. Remember *Labels are multiplicative* and [Cardinality Matters](#cardinality-matters). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
counters
|
||
## If you can graph it, you can alert on it | ||
|
||
You can't look at dashboards 24/7. Prometheus unified metrics, dashboarding, and alerting. PromQL is the core of every Prometheus alert, and a PromQL query is the source of any graph on a dashboard. That is very powerful. Use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unifies
I doubt we were first.
|
||
## If you run it, then you should put an alert on it | ||
|
||
Always have an alert -*at least*- on presence and healthiness of your targets. You can't that rely on the things and take action that you can't observe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/that//
|
||
Always have an alert -*at least*- on presence and healthiness of your targets. You can't that rely on the things and take action that you can't observe. | ||
|
||
Avoid missing targets and unhealthy targets. All of the client libraries provides `up` metric by default. Use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client libraries do not produce up, that comes from Prometheus.
|
||
## Please five more minutes | ||
|
||
Prometheus alerting rules let you to specify a time duration that determines after how long an alert should start firing according to given query, which is called `FOR`. If you don't specify it a single failed scrape could cause an alert to fire. You need more tolerance. Don't make it short and always specify it. And also don't make it too long. For more information check out [Setting Thresholds on Alerts](https://www.robustperception.io/setting-thresholds-on-alerts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just for
these days.
I'd recommend simplifying, and saying always at least 5 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "always" is too strong. I would word it more along the lines that you should only do something else if you know exactly what you are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of adding this here a lot.
Of course, the original idea had a humorous touch and was meant to be taken with a grain of salt. As an "official" statement here on the project's website it will receive quite a lot of scrutiny…
|
||
## Instrument first, ask questions later | ||
|
||
During development you will never know what questions you need to ask later. Software needs good instrumentation, it's not optional. Metrics are cheap. Use them generously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, would be good to be more nuanced here and perhaps link to the cardinality section.
|
||
The First and the most important rule, if you have to remember only one thing remember this one. Instrument all the things! | ||
|
||
## Measure what users care |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… care about?
|
||
## Measure what users care | ||
|
||
Do your users care if your database servers are down? Do they care about your CPU saturation? No, they care about what they experience. They care about whether they can access the page that they have requested and their results are fresh. Think in terms of latencies and availability. Let your [SLO](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/)s guide your instrumentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be phrased in a way that measuring all of those things is still important, but that instrumentation should additionally be suitable for SLO-based alerting.
|
||
--- | ||
|
||
## Counters rule and gauges suck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a nuanced statement, it's a proverb!
It refers to not using gauges for something that can be represented as a counter. That's explained in the text below.
|
||
## Please five more minutes | ||
|
||
Prometheus alerting rules let you to specify a time duration that determines after how long an alert should start firing according to given query, which is called `FOR`. If you don't specify it a single failed scrape could cause an alert to fire. You need more tolerance. Don't make it short and always specify it. And also don't make it too long. For more information check out [Setting Thresholds on Alerts](https://www.robustperception.io/setting-thresholds-on-alerts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "always" is too strong. I would word it more along the lines that you should only do something else if you know exactly what you are doing.
Signed-off-by: Kemal Akkoyun kakkoyun@gmail.com
As suggested and discussed on #1692, I would like to make
The Zen of Prometheus
an official part of documentation.It would be great if you could help me to refine it by your reviews
@brian-brazil @beorn7 @SuperQ @juliusv @brancz @bwplotka @RichiH
Fix #1692