Skip to content

The Zen of Prometheus #1783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kakkoyun
Copy link
Member

@kakkoyun kakkoyun commented Nov 4, 2020

Signed-off-by: Kemal Akkoyun kakkoyun@gmail.com

As suggested and discussed on #1692, I would like to make The Zen of Prometheus an official part of documentation.

It would be great if you could help me to refine it by your reviews
@brian-brazil @beorn7 @SuperQ @juliusv @brancz @bwplotka @RichiH

Fix #1692

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>

## Instrument first, ask questions later

During development you will never know what questions you need to ask later. Software needs good instrumentation, it's not optional. Metrics are cheap. Use them generously.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A metric is cheap. Metrics are not cheap.

Metrics without labels are cheap, metrics with labels tend to blow up in cardinality. Making it seem that metrics are cheap would likely lead to more users falling into this trap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, would be good to be more nuanced here and perhaps link to the cardinality section.


## Measure what users care

Do your users care if your database servers are down? Do they care about your CPU saturation? No, they care about what they experience. They care about whether they can access the page that they have requested and their results are fresh. Think in terms of latencies and availability. Let your [SLO](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/)s guide your instrumentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, you should be measuring all of those things.

You're conflating metrics with alerting here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be phrased in a way that measuring all of those things is still important, but that instrumentation should additionally be suitable for SLO-based alerting.


## Avoid missing metrics

Time series that are not present until something happens are difficult to deal with. To avoid this, export 0 (or NaN if 0 would be misleading) for any time series you know may exist in advance. You have to initialize your metrics with the zero values to prevent broken dashboards and misfiring alerts. For more detailed explanation check out [Existential issues with metrics](https://www.robustperception.io/existential-issues-with-metrics).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN shouldn't be used like this.

You don't have to expose 0, but it's much tricker to deal with correctly if you don't.


## Cardinality Matters

Every unique set of labels create a new timeseries. Use labels with care watch out what you put into your labels. Avoid cardinality explosion, unbounded labels will blow up Prometheus. And keep in mind that labels are multiplicative. You will have multiple labels, multiple target labels and targets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of target labels doesn't matter for cardinality.


## Naming is hard

One of the hardest problem in computer science. Metric name should only have a single meaning within a single job. And you should always guard against metric name collisions among the jobs on your system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A metric name

What do you mean by "guard against"? It's perfectly normal for different jobs to share metric names, such as process_cpu_seconds_total. As worded this seems to be arguing for prefixing metric names with job names.


And as always, let your [SLO](https://www.youtube.com/watch?v=X99X-VDzxnw)s guide your bucket layout, create boundaries to match your SLO.

The histograms underneath are just counter with labels; where bucket boundaries used as labels. Be precautious while adding additional labels to your histograms. Remember *Labels are multiplicative* and [Cardinality Matters](#cardinality-matters).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counters


## If you can graph it, you can alert on it

You can't look at dashboards 24/7. Prometheus unified metrics, dashboarding, and alerting. PromQL is the core of every Prometheus alert, and a PromQL query is the source of any graph on a dashboard. That is very powerful. Use it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unifies

I doubt we were first.


## If you run it, then you should put an alert on it

Always have an alert -*at least*- on presence and healthiness of your targets. You can't that rely on the things and take action that you can't observe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/that//


Always have an alert -*at least*- on presence and healthiness of your targets. You can't that rely on the things and take action that you can't observe.

Avoid missing targets and unhealthy targets. All of the client libraries provides `up` metric by default. Use it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client libraries do not produce up, that comes from Prometheus.


## Please five more minutes

Prometheus alerting rules let you to specify a time duration that determines after how long an alert should start firing according to given query, which is called `FOR`. If you don't specify it a single failed scrape could cause an alert to fire. You need more tolerance. Don't make it short and always specify it. And also don't make it too long. For more information check out [Setting Thresholds on Alerts](https://www.robustperception.io/setting-thresholds-on-alerts).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just for these days.

I'd recommend simplifying, and saying always at least 5 minutes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "always" is too strong. I would word it more along the lines that you should only do something else if you know exactly what you are doing.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of adding this here a lot.

Of course, the original idea had a humorous touch and was meant to be taken with a grain of salt. As an "official" statement here on the project's website it will receive quite a lot of scrutiny…


## Instrument first, ask questions later

During development you will never know what questions you need to ask later. Software needs good instrumentation, it's not optional. Metrics are cheap. Use them generously.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, would be good to be more nuanced here and perhaps link to the cardinality section.


The First and the most important rule, if you have to remember only one thing remember this one. Instrument all the things!

## Measure what users care
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… care about?


## Measure what users care

Do your users care if your database servers are down? Do they care about your CPU saturation? No, they care about what they experience. They care about whether they can access the page that they have requested and their results are fresh. Think in terms of latencies and availability. Let your [SLO](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/)s guide your instrumentation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be phrased in a way that measuring all of those things is still important, but that instrumentation should additionally be suitable for SLO-based alerting.


---

## Counters rule and gauges suck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a nuanced statement, it's a proverb!

It refers to not using gauges for something that can be represented as a counter. That's explained in the text below.


## Please five more minutes

Prometheus alerting rules let you to specify a time duration that determines after how long an alert should start firing according to given query, which is called `FOR`. If you don't specify it a single failed scrape could cause an alert to fire. You need more tolerance. Don't make it short and always specify it. And also don't make it too long. For more information check out [Setting Thresholds on Alerts](https://www.robustperception.io/setting-thresholds-on-alerts).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "always" is too strong. I would word it more along the lines that you should only do something else if you know exactly what you are doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include zen of prometheus
3 participants