Skip to content

fix(om2): histograms and negative observed values #2627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

krajorama
Copy link
Member

@krajorama krajorama commented Apr 17, 2025

OM1.0 required that the Sum of Histograms is not represented when there are negative observations in a histogram.

This PR is removing this requirement in OM2.0. Due to:
The requirement was never implemented by the Go and Java instrumentation libraries. Enforcing it now would be breaking.
The requirement makes it impossible to implement the use case where the user wants to measure the Sum anyway. Which means for example that you'll not be able to calculate average from Sum/Count.
The PromQL engine does not take the Sum into account when doing counter reset detection, thus it does not matter that it can decrease.
We already warned users in the documentation about the possibility of Sum decreasing and not being usable for rate() 10 years ago: PR.
And native histograms will not take Sum into account when calculating counter resets during rate() , thus this problem won't come up.

Note1: the python reference implementation did follow the requirement.

Note 2: this PR does not make Sum mandatory, that is a different question.

@krajorama krajorama force-pushed the krajo/om2.0-nonosum branch from 7880f71 to bd3c521 Compare April 17, 2025 16:26
@beorn7
Copy link
Member

beorn7 commented Apr 17, 2025

The PromQL engine does not take the Sum into account when doing counter reset detection,

This is only true for native histograms, but not for classic histograms.

(FTR: I proposed to improve the counter reset handling for summaries and classic histograms at KubeCon Berlin in 2017. My proposal was ultimately rejected, so I guess we should not change course now and instead encourage native histograms including NHCB.)

@krajorama
Copy link
Member Author

The PromQL engine does not take the Sum into account when doing counter reset detection,

This is only true for native histograms, but not for classic histograms.

(FTR: I proposed to improve the counter reset handling for summaries and classic histograms at KubeCon Berlin in 2017. My proposal was ultimately rejected, so I guess we should not change course now and instead encourage native histograms including NHCB.)

I've reworded the PR description and I'll copy the final text into the commit message once we agree on it.
Are you ok with making the change in the specification otherwise?

OM1.0 required that the Sum of Histograms is not represented when there
are negative observations in a histogram.

This PR is removing this requirement in OM2.0. Due to:
The requirement was never implemented by the Go and Java instrumentation
  libraries. Enforcing it now would be breaking.
The requirement makes it impossible to implement the use case where the
user wants to measure the Sum anyway.
We already warned users in the documentation about the possibility of
Sum decreasing and not being usable for rate() 10 years ago: #43.
And native histograms will not take Sum into account when calculating
counter resets during rate() , thus this problem won't come up.

Note: this PR does not make Sum mandatory, that is a different question.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama force-pushed the krajo/om2.0-nonosum branch from bd3c521 to 3b7d783 Compare April 18, 2025 09:26
@beorn7
Copy link
Member

beorn7 commented Apr 19, 2025

I think the only way of solving this problem properly (beyond getting rid of classic histograms and summaries altogether) is to require PromQL to detect a counter reset in the sum via different means (historically by looking at the count, but nowadays we could also look at the CT).

I don't know how to solve this given that the Prometheus community has decided to not do that. Maybe just leaving it as is in practice (which is arguably what this PR proposes) is the least bad way, but I don't feel I should make this call about OMv2.

@krajorama
Copy link
Member Author

I think the only way of solving this problem properly (beyond getting rid of classic histograms and summaries altogether) is to require PromQL to detect a counter reset in the sum via different means (historically by looking at the count, but nowadays we could also look at the CT).

I don't know how to solve this given that the Prometheus community has decided to not do that. Maybe just leaving it as is in practice (which is arguably what this PR proposes) is the least bad way, but I don't feel I should make this call about OMv2.

I agree that the solution is native histograms and this PR does not want to actually solve the problem of negative values in Sum. This PR is just about getting rid of a requirement that's not implemented by anyone and just makes things more complicated.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama
Copy link
Member Author

cc @fstab @csmarchbanks

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally ok with this as the above paragraph only has the sum as a SHOULD. I didn't realize it was a MUST NOT for OM 1.0, I guess that means Java is not OpenMetrics compliant today.

@csmarchbanks
Copy link
Member

Also, just to note the above comment - the requirement to not expose _sum when there are negative observations is implemented in client_python today which is the reference client for OpenMetrics. So I wouldn't say it is not implemented by anyone. That said, I don't think it needs to be a MUST, and the fact that I can no longer use averages with negative observations is a pretty big downside.

@krajorama
Copy link
Member Author

Also, just to note the above comment - the requirement to not expose _sum when there are negative observations is implemented in client_python today which is the reference client for OpenMetrics. So I wouldn't say it is not implemented by anyone. That said, I don't think it needs to be a MUST, and the fact that I can no longer use averages with negative observations is a pretty big downside.

noted

@krajorama
Copy link
Member Author

Related issue about Sum allowing NaN or not: prometheus/client_golang#1275 (comment)

We agreed to just have good PR descriptions.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants