Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base-2 exponential histogram protocol support #322

Merged
merged 16 commits into from
Sep 17, 2021

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 23, 2021

Adds a base-2 exponential histogram to OTLP metrics.

This protocol is copied from the OpenMetrics/Prometheus draft created by @beorn7, with comments written by me to summarize discussion in open-telemetry/opentelemetry-specification#1776 (particularly around the meaning of the zero bucket), following the design effort in OTEP 149 led by @yzhuge.

Note: Past debate and discussion in the OTel Metrics SIG has led us to use different types for different kinds of histogram, as opposed to burying a new oneof in the existing types. This proposal follows that decision, and I would prefer not to revisit that topic. The new ExponentialHistogram and ExponentialHistogramDataPoint types mimic the existing explicit histogram types for most of their fields, including attributes, timestamps, sum, count, and exemplars fields. As such, the real review here is on lines 483 onward, starting from the scale field.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 26, 2021

@jsuereth and @beorn7 thank you. About the sum field, I copied the existing Histogram data types and modified the bucketing strategy, so this sum is the same as HistogramDataPoint. If #320 merges first, this PR would be updated to again be a copy of HistogramDataPoint with the same comment about sum.

// base = (2^(2^-scale))
//
// The histogram bucket identified by `index`, a signed integer,
// contains values that are less than or equal to (base^index) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defines the value at base^index as the upper bound of a bucket. I think lower bound is more intuitive, and computationally more convenient. "Round down" is generally more convenient in computing. For example, at positive scales, we can compute index by right shifting the exponent field in double representation.

Lower bound is more intuitive for humans. For example, bucket 0 will be [1, base) in lower bound, rather than (1/base, 1] in upper bound. In lower bound, we can also say 0 and above buckets map to 1 and above values; and negative buckets map to values less than 1. The mathematical neutral point of index space (0) and log space (1) coincide perfectly.

I known traditionally, Prometheus uses "le" upper bound buckets. But do we have to keep such compatibility? And from the current OM protocol (https://github.com/prometheus/client_model/blob/61b6c1aac0648ff14fae91ab976b8f2b30c5c5d3/io/prometheus/client/metrics.proto#L63-L73), it is not obvious if it still sticks the to upper bound semantic. Maybe @beorn7 can clarify?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the argument of "more intuitive for humans". In fact, I see it the other way around.

The computational argument is minor and shouldn't tip the balance. Positive scales will be the exception, and even then, the actual difference in computational or algorithmic complexity doesn't really make a dent.

More generally (and along the lines I have explained before), this is at a level of detail that I don't feel qualified to discuss while we are still experimenting to get the basics right. Little knobs like this can easily be tweaked later in the process, based in the insights gained.

Having said that, if I had to take a guess, I would say it's likely Prometheus will stick to the "less or equal" upper bounds because it is consistent with Prometheus's existing histograms so that creating compatibility mappings is easier. There needed to be stronger arguments than "I save a single if frac == 0.5 { sparseKey-- } line in the code in the rare case of very low resolution histograms" to tip the balance.

Finally, to avoid misunderstandings: Everything you see in a "sparsehistogram" branch in a Prometheus repo is part of our experiment. There is no direct link to OpenMetrics, i.e. this is not part of some kind of "OpenMetrics 2.0 draft". First these experiments have to be done and insights have to be gained. Only then can we think about the actual implementation in a future Prometheus release, which will in turn inform how OpenMetrics will embrace sparse high-resolution histograms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzhuge I was on the fence about this issue -- whether the upper or lower bound should be inclusive -- until the comment from @oertl here: https://github.com/open-telemetry/opentelemetry-proto/pull/322/files/e680b5938be129ac8410f7ca66f0a50bcc4c4b52#r681484304

You have both recommended against the zero-tolerance field described here, and the remark by @oertl shows how it can be added later in a way that is backwards compatible with exclusive lower-bound buckets. I recommend that we avoid unnecessary conflict with the Prometheus group on this and keep the "less-than-or-equal" upper bound.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First: "inclusiveness/exclusiveness is irrelevant in practice". See previous discussion at open-telemetry/oteps#149 (review) (click on "outdated" to show the comment "CharlesMasson reviewed on Mar 17"). We should avoid inclusiveness/exclusiveness constraint in OTel standard.

I don't see why bucket index (base^index as upper vs lower bound) is relevant to the zero region width. The indexed buckets are "regular" buckets that meet the bound = base ^ i formula. The zero region bucket is a special bucket for zero and values near zero. https://github.com/open-telemetry/opentelemetry-proto/pull/322/files/e680b5938be129ac8410f7ca66f0a50bcc4c4b52#r681484304 seems to be introducing a third category of bucket [zero_tolerance, lowest_regular_bound]. This will add complexity. There was a proposal of linear buckets near zero, followed by exponential buckets further out. For now, I'd like to just stick to a single zero-region bucket and a sequence of regular exponential buckets.

Previous version says: "
// zero_tolerance may be optionally set to convey the width of the
// zero region. When this is not set, implementations may assume
// the zero tolerance equals the smallest representable positive
// IEEE 754 double-precision floating point number, which is (2^−1074)."

This only "suggests" that default is 2^-1074 (smallest positive number "double" can represent). It does not define the field as "the smallest nonnegative value that maps to a regular bucket". When it is not set, most producers mean "I don't know" or "I don't care". The consumer using 2^-1074 is assuming the narrowest zero region, which may not may not be what the producer meant. The key issue with default and backward compatibility is that Protobuf default value is 0, but zero_tolerance default is "unknown". If we remap 0 to unknown, we need another value for logical 0. There are plenty of choices, like -1. But all are confusing when physical 0 is not logical 0.

A better solution is to use an addition boolean field for "zero_tolerance_is_set". Booleans default to false in protobuf. So we know it defaults to "not set" (ie. "unknown"). When it is set, we use zero_tolerance value as is (including the protobuf default of 0) for the real zero_tolerance. Logically, this achieves a default of "Null".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to carry the explicit flag (see flags below) for "zero_tolerance_enabled", it's somewhat akin to what we proposed for handling monotonicity of sums in histograms. I'd suggest a similar approach if we need it.

From our (Google) perspective, we have an explicit zero bucket that ends where the first exponential bucket starts, so we don't really need an explicit tolerance here.

opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
//
// Implementations MAY consider the zero bucket to have probability
// density equal to (zero_count / 2*zero_tolerance), if set.
double zero_tolerance = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced on the value of the zero_tolerance field. From open-telemetry/opentelemetry-specification#1776 (comment):
I lean against recording the rounding threshold, even as optional field in OTel protocol. First the question is how are you going to use it, if you have it in the protocol? Besides showing the threshold to the user in some particular UI, there is not much use. And it actually creates confusion. For example, if it is larger than the lowest bucket, what does that mean (overlapping bucket boundary?)? And since protobuf optional fields default to 0, when it is not set, it looks like the threshold is 0, but in fact it may not be (the producer neglected to set it, or the producer does not know what it is). Then we may need a value for "unknown" (use "NaN"?). Tracking the threshold opens a can of worms.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure if the zero_tolerance field is really needed. However, if it is needed, I would rather define it as the largest nonnegative double-precision value that maps to the zero bucket. (Currently, it is defined as the smallest nonnegative value that maps to a regular bucket. ) As a consequence, the default value would be 0 and not the "magic number" 2^-1074.

This definition would also be more consistent as we would have exclusive lower bounds for all buckets. Consider the first regular bucket to which values may be mapped to. With the proposed definition of zero_tolerance, this bucket would contain all values from the interval (zero_tolerance, base^i] where i is the bucket index. Hence, we would have again an exclusive lower bound as for all other buckets which contain values from (base^(i-1), base^i].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion by @oertl that zero_tolerance could be defined so that the zero value (default) is an inclusive upper bound.

I've removed this field from the protocol because it can be added later, by this logic, at which time the can of worms described by @yzhuge will have to be addressed. To add this field we will need to pin down more detail about how to merge histograms with different zero tolerances, for example.

opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
@yzhuge
Copy link
Contributor

yzhuge commented Jul 28, 2021

Besides my in-line comments. I have a major one:

I am not convinced of the value of the sparse array encoding. It adds a lot of complexity. The counter array is separate from the span list. So it is easy to misalign in encoding or decoding. Adding delta encoding on counters just adds another layer of complexity. Implementations will be error prone and more expensive than simple array (basically one span, no delta encoding). Having no other encoding options requires all producers and consumers to implement this encoding/decoding scheme. I feel this is undue burden.

For array compaction, there is always the trade off between cpu, complexity and space. I remember in previous discussions, some people are even concerned on the cpu cost of using varint (Protobuf default encoding of integer arrays) and favor "repeated fixed64 bucket_counts", which is the current spec for explicit bucket histograms.

For final on-disk storage, heavy compression at the cost of cpu may be worth it. The Prometheus/OM sparse array appears like such a scheme. For intermediate result and transportation, we need to balance cpu and space. In fact, for streaming, where the processor typically holds one value at a time, cpu (ie. throughput) maybe favored over space. In streaming, an object may need to pass though multiple stages of processing, going through deserialization and re-serialization at each stage. In such cases, reducing cpu cost is even more critical.

My preference for OTel bucket counters is to use simple array of "repeated sint64 bucket_counts" (one span, no delta, varint). We get "varint" for free from Protobuf. In NrSketch (https://github.com/newrelic-experimental/newrelic-sketch-java), the default 320 buckets works very well in New Relic production as a "one size fits all" size. Our serialization uses variable bytes per bucket, though still requires same number of bytes on all buckets in a histogram. It is not even as aggressive as varint on individual bucket. In practice, typical serialized size is a few hundred bytes. At this level of size, fancy"sparse array" compression is not cost (cpu and complexity) effective.

In the spirit of keeping it simple, I'd like to see simple arrays in the OTel standard first. There are many sparse array encoding methods. In the future, if we do go that way, we have to be very careful choosing one.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 28, 2021

In the spirit of keeping it simple, I'd like to see simple arrays in the OTel standard first.

You wrote a very convincing explanation, and you have my support. I will update this PR, unless you would like to take over @yzhuge? The key point is that we have to convert losslessly into the OpenMetrics protocol, and it looks like this just means agreeing to an optional width for the zero bucket.

@yzhuge
Copy link
Contributor

yzhuge commented Jul 29, 2021

There is another decision to make before we proceed:
Whether base^index is the lower or upper bound of the bucket at “index”.

The discussion started at #322 (comment). I am continuing here so that it doesn’t get buried or lost in the in-line discussion. I favor the lower bound option. It has the pros:

  • It is mathematically cleaner. The mathematical neutral point of index space (0) and log space (1) coincide perfectly. Bucket 0 and above map to values 1 and above; negative buckets map to values less than 1. Bucket 0 covers [1, base), rather than (1/base, 1].
  • Both NrSketch and DDSketch use lower bound
  • It is often simpler to code. "Round down" via right shifting an integer, or casting a floating point number to integer is straightforward. The reduction in code complexity (therefore less chance of a bug) is more important here than the runtime gain of one or a few instructions.

Now the cons:

  • Different from the "le" upper bound convention in Prometheus explicit histograms. Note that the Prometheus exponential histogram project is still going on. They do not have a final decision on upper vs lower bound. If OTel follows the older Prometheus convention, and newer Prometheus histogram goes the other way, it would be funny.

So I am suggesting that OTel go with lower bound. If we have to do "off by one" conversion from Prometheus index, no big deal. We need to look into the future more than the past.

@yzhuge
Copy link
Contributor

yzhuge commented Jul 29, 2021

A minor point: bucket index needs to be 64 bit int. At higher scale, index can go beyond 32 bit.

"double" has 64 bit, therefore about 2^63 distinct values (roughly, as there are special code points for Inf, Nan, etc.) each in positive and negative space. It is perfectly valid to divide the values into more than 2^32 buckets. As a standard, we need to handle extreme cases.

Such high scale use cases are more likely the case where contrast (dynamic range) of dataset is very small, rather than somebody really needing .00001% relative error. For example, if dataset ranges from 1000 to 1000.1, with 1.0001 contrast, we really need high scales to have enough buckets to split the small range.

In such use cases, the exponential buckets actually resemble linear buckets more. In the example above, the bucket at end of the range is only 1.0001 wider than the one at the beginning. This shows that although exponential histogram is designed for high dynamic range dataset, it can handle low dynamic range dataset too.

For something like NrSketch with a high enough default initial scale, it can handle low range dataset with default config. Right now default initial scale is 12, at base of 1.00016, which cannot quite resolve a range of 1.0001. This made me think about raising the default. In New Relic production, default init scale is 32, which has a base of 1 + 1.6E-10. So it can handle ranges as low as 1 + 1E-8 out of the box. The cost of higher initial scale is that it has to downscale to the more common scale of 4 or 5 in most cases (ie. cost on common case so that we can cover extreme cases). But the number of downscaling is finite and small compare to the millions and billions of numbers inserted to a high throughput histogram.

BTW, NrSketch can downscale more than 1 scale at once. For example, if the 2nd value it ingests is 1.1x of the first, it will directly downscale from 32 (or whatever the init scale is) to scale 11. A round of downscaling iterates over the bucket window on the old scale. On the 1st downscaling, the old window size is exactly one (one value in one bucket). So the 1st downscaling is super cheap. Thus we see that the cost of a high initial scale is very small. Switching from low dynamic range mode to high dynamic range mode usually occurs within the first a few values ingested.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 6, 2021

I've reviewed all the feedback, and it doesn't seem like more is coming.

I am not convinced of the value of the sparse array encoding.

I think I agree.

Whether base^index is the lower or upper bound of the bucket at “index”.

I'm inclined to follow your guidance, @yzhuge.

Different from the "le" upper bound convention in Prometheus explicit histograms.

We've repeatedly made statements that the boundary condition doesn't matter in theory or in practice, which is how we justified simply changing that OTLP definition to match Prometheus for explicit boundaries, which made sense. I think we can use the same argument to go the other way in this case. 😀

I would rather define it as the largest nonnegative double-precision value that maps to the zero bucket.

This makes sense! Thanks @oertl.

This implies a number of changes in this protocol, which I will gladly follow up on next week.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 9, 2021

@open-telemetry/specs-metrics-approvers please take a look.

Copy link

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the oneof will be fixed.

opentelemetry/proto/metrics/v1/metrics.proto Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Sep 10, 2021

@oertl and @yzhuge I updated this text to use the inclusive lower bound.

TL;DR I will give detailed feedback in this comment and a summary in the following comment.

A number of the questions being discussed at the end of this thread can be treated not as protocol issues, but as issues present at runtime and/or issues with the compiled code. We are discussing the advisable ranges for scale, the width of the bucket index, and how to handle hardware imprecision, all of them interrelated topics.

For scale, I don't see anything mathematically wrong with large values (or negative values), just a limited range of useful outcomes.

For the bucket index, we've already specified varint encoding for the relevant offset, so any decision about 32-bit vs 64-bit width is made by the protocol compiler, and we can generate new code to support wider values. We don't need to change the protocol to take advantage of large indexes, but we do need to change the generated code. If we went ahead with sint32 as the type for the offset field, then OTel collectors would not be able to manipulate or inspect large-scale histograms without code changes.

These are connected by the hardware precision issue, because at certain scales we have a limited ability to represent the entire range of IEEE floating point values and/or we can represent distinctions that the IEEE floating point values cannot distinguish.

@yzhuge has explained his motivation for not restricting scale to scale or bucket index. @oertl points out that indexes below -2^63 and above 2^63-1 will be difficult to work with. Whether indexes are limited to 32-bit or 64-bit, the hypothetical auto-ranging histogram has to be aware that the maximum scale, subject to index-width constraints, is limited by the exponent of the data (i.e., your best scale depends on the data and the width of the index, but there was always going to be a limit). I believe a good auto-ranging histogram can be built for 32-bit indexes. If we have such a thing and users are unsatisfied with it because it does not have enough precision for their data ranges, then it will not require a protocol change to widen the bucket index. It will only require recompiling a bunch of code. I suspect it is likely that before we need 64-bit index support, we will want to support a sparse encoding the way Prometheus has proposed.

@yzhuge for example if New Relic wants to support 64-bit bucket indexes internally it can maintain a copy of the .proto file with sint32 offset = 1; replaced by sint64 offset = 1; and nothing will break. This would be a good way to convince the community to adopt 64-bit bucket widths in the OTel collector.

@jmacd
Copy link
Contributor Author

jmacd commented Sep 10, 2021

@open-telemetry/specs-metrics-approvers to summarize the current state of this PR:

We have deliberated on every detail and I believe this is in its final form. There are questions about number widths in the .proto that can change in the future because we are using varint-encoding for the fields in question.

A number of questions have come up that I think should be specified in the data model, not the protocol.

An example: implementations are free to use log() to compute the bucket index, which may lead to inconsistent bucket calculations across platforms and could lead to inconsistencies between exemplar data and the buckets (e.g., the bucket index calculated on the server for the max is an empty bucket, because the client calculated the adjacent bucket).

A second example: implementations may reject data with scale or bucket index that are too large or small for them to handle. We are trying to avoid being overly prescriptive about maximum scale, but when histogram buckets map onto values that cannot be expressed by an available IEEE floating point value, I think implementations should be free to decide how to proceed.

I think we should accept this PR in its current form and end the debate. We have demonstrated three prototypes. I'll follow up next week with proposed text for the data model. Please review.

@jsuereth
Copy link
Contributor

Still approve, and +1 to hashing out further specification on acceptable precision + calculation consistency in the datamodel specification.

@jamesmoessis
Copy link

I'm trying to understand the description of bucket_counts.

    // Count is an array of counts, where count[i] carries the count
    // of the bucket at index (offset+i).  count[i] is the count of
    // values greater than or equal to base^(offset+i) and less than
    // base^(offset+i+1).

For simplicity, assume my offset and scale are both 0. So my bucket lower bounds would be 1, 2, 4, 8, 16, etc. Does this mean that values below 1 and greater than 0 don't fall into any bucket? Is this what you are describing as "subnormal values"? Certainly possible I'm missing someone super obvious here so someone please point it out if I am!

@yzhuge
Copy link
Contributor

yzhuge commented Sep 15, 2021

@jamesmoessis index and offset can be negative. base^index will be less than 1 when index is negative. For example: 2^-1 = 0.5. It will never reach 0 though. Thus we have a separate count for 0.

@jamesmoessis
Copy link

@yzhuge thankyou for the explanation, that makes sense.

@jmacd
Copy link
Contributor Author

jmacd commented Sep 16, 2021

This issue captures the need to specify more data model.
open-telemetry/opentelemetry-specification#1932

I propose we merge this after there's a PR up for review on 1932.

@jmacd
Copy link
Contributor Author

jmacd commented Sep 16, 2021

I am working on adding this to the Metrics data model here:
open-telemetry/opentelemetry-specification#1935

It is still a work in progress, but I feel we can merge this PR!

@jsuereth jsuereth merged commit 4f5d71f into open-telemetry:main Sep 17, 2021
@jsuereth
Copy link
Contributor

Agreed this can be merged and we can hash out information in the DataModel portion.

@jmacd jmacd deleted the jmacd/expohist branch September 17, 2021 17:24
@jmacd
Copy link
Contributor Author

jmacd commented Sep 17, 2021

Thank you to @yzhuge, @oertl, @beorn7, @giltene and the OpenTelemetry community who contributed to this design. 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.