Description
Problem Statement
OM proto is not currently adopted (Prometheus libs and main binary is not aware of it).
Prometheus ecosystem still use and invest in Prometheus Proto although in the past it was attempted to be deprecated. (proto3 version). Currently it's on the way to be used as a default scrape configuration (it's default for native histograms and bunch of other feature flags).
Given that, it's not clear if, as a part of OM 2.0 WG we should continue OM proto, improve it or remove from OM completely and recommend the existing Prometheus proto. Note that this is a separate topic to the OM text which is the main area of the OM 2.0 focus.
OM Proto vs Prometheus Proto
Protocols are pretty similar, both uses similar MetricFamily abstraction and have similar gauge, counter histogram, summary structures. They do differ a little bit though too:
- Uses repeated proto oneof for
Metric.MetricPoint
on the already repeatedMetric
. Therepeated
part is interesting, because potentially encourages sending multiple points (e.g. historical too), not only current values, not sure if intended. - Every value can be either double or int.
- Defines
MetricSet
that blocks major optimizations possible with PrometheusProto delimited format. - Uses not recommended package name format (nit).
- Lacks native histogram support.
- Is versioned.
- Uses non repeated "implicit oneof" defined directly into
Metric
for each metric value. - Every value has to be double.
- Supports native histogram.
- Uses the delimited format that allows to send each metric family in separate message allowing streaming parsers.
- Misses
Info
andStateSet
MetricTypes (both are interpreted as gauges in Prometheus as of now). - Has inconsistent timestamps. Some use
google.protobuf.Timestamp timestamp = 3; // OpenMetrics-style.
, some useint64 timestamp_ms = 6;
. The latter is easier (and faster) to use, but0
means not set, which blocks the use of the exact 0 millisecond timestamp (implicitly accepted in many places in Prometheus e.g. Remote Write).
To sum up, PrometheusProto is closer to what Prometheus implements now, including native histograms. It also unblocks a bit more efficient parsing. On the other hand OM Proto is consistent with OM 1.0 types and makes it a bit easier (?) to send historical samples for the same series. OM proto is also strictly versioned (read below why that's important).
Protobuf versioning
During WG discussions there was a point made around protobuf versioning -- the fact it does not need strict minor/patch versioning as we can do a lot of changes without breaking users or user interaction.
I would argue, in the world of data heavy network protocols like OM or Remote Write that's not practically true. Generally, we need to use the same versioning structructure as for the text format.
Examples:
- We add
schemaURL
attribute to MetricFamily one day. Adding field with this new information is not a breaking change. However, without a concrete minor version bump this change won't be well announced. This is also the same if our text format make a MUST on skipping unknown lines. - The addition of Info and StateSet metric types to Prometheus Proto. One could say it's not a breaking change. Normally adding fields to protobuf is not breaking and on the protocol correctness, it's true it will not crash encoding/decoding. However such a change is *practically semantically breaking, because when SDK/client upgrades and starts to generate MetricFamily for e.g.
Info
type it has to decide where to put it (a) as the newInfo
type, (b) old, deprecated for info metrics,Gauge
type or (c) both. To not break user it would need to be (c), but it's not practically possible for complexity and efficiency reasons (not easily compressible duplicated data send over network, detecting duplicates on parse).
To sum up, some versioning and content negotiation might be needed for protobuf protocols as well.
Proposed solution
Implementing Protobuf support, efficiently was a big task, and PrometheusProto unblocks streaming and is already adopted. There's also not many differences vs OM Proto that would motivate the ecosystem to adopt OM proto either.
Perhaps the best course of action would be:
- Deprecate the OM 1.0 Proto.
- Release the OM 2.0 without Protobuf schema.
- Release the official versioned spec (1.0/0.1?) document for PrometheusProto (on prometheus.io docs) and iterate on it (e.g. 1.0/1.1/2.0 with OM types at some point and decision around timestamp 0s). Put the proto in one offcial place (prometheus/prometheus and buf registry), remove gogo parts (doable with new custom parser now).
Pros:
- Allowing separate versioning/lifetime for text vs proto (also a downside, maybe consistency is useful).
- Iterating on the adopted protocol instead of iterating on not used one, risking less adoption in future.
- No need to reimplement parsers.
- The most efficient option and we know even existing proto parsing has a lot of overhead (until we fix magic suffixes).
- Clear state of PrometheusProto.
- Less work?
Cons:
- Losing "OM" badge for protobuf protocol, although OM is Prometheus since last year.
- Inconsistency between OM 1.0 and 2.0.
- Impacting existing OM 1.0 Proto users (we don't know of any, but there might be some).
Alternatives considered
- Iterate on OM Proto 1.0 in OM 2.0, deprecate PrometheusProto.
We could add native histograms in OM 2.0. For efficiency we could introduce delimited format. Then we kind of reimplement PrometheusProto though under OM umbrella, which is Prometheus umbrella now. Perhaps not worth it?
Iterating on adopted protocol feels better for the ecosystem too.
- Develop a completely new OM Proto 2.0 in OM 2.0, deprecate PrometheusProto.
Interesting, but do we have resources for this. The only benefit I see is the opportunity to rethink "MetricFamily" concept that does not exists (and does not make sense) in Prometheus. That would be only readability improvement, nothing more 🤔
- Deprecate all proto protocols
At some point that was an intention. However protobuf was useful for experiments (it's the only protocol that has practical native histograms for the last few years) and it's likely to be more efficient once Prometheus switches to complex types and we finalize the gogo/custom generator aspect.
Activity