-
Notifications
You must be signed in to change notification settings - Fork 164
Conversation
The alternative, which uses one metric instrument per producer outcome | ||
and one metric instrument per consumer outcome, has known | ||
difficulties. To define a ratio between any one outcome and the total | ||
requires a metric formula defined by all the outcomes. On other hand, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires a metric formula defined by all the outcomes. On other hand, | |
requires a metric formula defined by all the outcomes. On the other hand, |
|
||
#### Producer and Consumer instruments | ||
|
||
We choose to specify two metric instruments for use counting outcomes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is what was intended but this doesn't read right to me.
We choose to specify two metric instruments for use counting outcomes, | |
We choose to specify two metric instruments for use in counting outcomes, |
- `otelcol_consumed_items`: Received and inserted data items (Collector) | ||
- `otelcol_produced_items`: Exported, dropped, and discarded items (Collector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The producer/consumer terminology makes these definitions a bit confusing for me. Intuitively I would expect inserted items to be a producer behavior, and dropped/discarded items to be a consumer behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- I had this same realization, the terms feel ambiguous.
How would you feel about
otelcol_input_items
and otelcol_output_items
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much clearer
consumer outcomes. In an ideal pipeline, a conservation rule exists | ||
between what goes in (i.e., is consumed) and what goes out (i.e., is | ||
produced). The use of producer and consumer metric instruments is | ||
designed to enable this form of consistency check. When the pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an accounting perspective, I see why we would want to group received + inserted items (so that this total matches exported + dropped + discarded). But the language here is difficult to reconcile with the external vs internal nature of the operations.
Taking a step back, I agree with the categories you've identified (received, exported, inserted, discarded, dropped), but there are several ways to organize them. This proposal organizes the categories in terms of incremental (received, inserted) vs decremental (discarded, dropped, exported) because it gives us the desirable property that the two instruments should be equal. However, I wonder if these same categories can be modeled in a different way while still giving us the ability to check consistency.
Would it be enough that all categories should sum to 0 by subtracting the decremental operations from the incremental ones? Organized according to real data flow, it would be received - discarded + inserted - (dropped + exported) = 0
. I think that by separating the incremental from the decremental, it allows this to work for backends, but alternately, could we require that the decremental categories are reported as negative numbers within the same instrument? To me this seems more intuitive but I'm not sure all backends can handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the equation received - discarded + inserted - (dropped + exported) = 0
.
I don't think I see a difference between received
and inserted
. If the telemetry has the component name, it'll be clear whether it was a processor or a receiver, and could be just a semantic question. If we added another attribute to identify the kind of component, or required it to be included in the otel.component
attribute, is that enough to distinguish received and inserted?
My thinking, in creating a discarded
and dropped
designation specifically was to have enough decomposition in the data that you could perform the equation as you wrote it, meaning to count received (receivers), subtract discarded, add received (processors), subtract dropped, leaving exported, which is the thing you'll compare with the next segment, potentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing --
Your suggestion about negative-values, as opposed to the positive-only expression I've used, brings to mind several related topics. I think this is the "best" way to do it from a metrics data model perspective, but I want to point out other ways we can map these metric events.
Consider each item of telemetry that enters the pipeline, has with it an associated trace context. There is:
a. The UpDownCounter formulation -- for every item arriving, add 1. for every item departing, subtract 1. this can tell us the number of items for attribute sets that are symmetric. If we add one for every item that is input/consumed, then subtract one for every item that is output/produced, the resulting tally is a number of in-flight items, but this mapping has to ignore the outcome/success labels for the +1/-1 to balance out.
b. The Span formulation -- when the receiver starts a new request (or the processor inserts some new data), there is an effective span start event (or a log about the arrival of some telemetry) for some items of telemetry. When the outcome is known for those points (having called the follower), there is a span finish event which can be annotated w/ the subtotal for each outcome/success matching the number of items consumed.
c. The LogRecord formulation -- (same as span formula, but one log record per event, vs span start/end events).
I'm afraid to keep adding text to the document, but I would go further with the above suggestions. If we are using metrics to monitor the health of all the SDKs, then we will be missing a signal when the metrics SDK itself is failing. I want the metrics SDK to have a span encapsulating each export operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I see a difference between received and inserted. If the telemetry has the component name, it'll be clear whether it was a processor or a receiver, and could be just a semantic question. If we added another attribute to identify the kind of component, or required it to be included in the otel.component attribute, is that enough to distinguish received and inserted?
Looks like I missed an important part of the design: processors are responsible for counting items only when the number changes while passing through a processor
I was thinking was that we should report "received" and "exported" for processors in order to account for situations where data streams are merged. For example, a collector pipeline with two receivers will combine streams into the first processor, so from that processor's perspective it seems important to report the total "received". Likewise, a similar problems could arise from receivers or exporters used in multiple pipelines.
To use a concrete example:
pipelines:
logs/1:
receivers: [R1, R2]
processors: [P1]
exporters: [E1, E2]
logs/2:
receivers: [R1]
processors: [P2]
exporters: [E1]
component | received | discarded | inserted | dropped | exported |
---|---|---|---|---|---|
R1 | 10 | - | - | - | - |
R2 | 20 | - | - | - | - |
P1 | 30 | 25 | 0 | - | 5 |
P2 | 10 | 10 | 2 | - | 2 |
E1 | - | - | - | 0 | 7 |
E2 | - | - | - | 0 | 5 |
In this example, it seems much easier to understand what's going on with P1
when it reports receiving 30.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My earlier design for this proposal included what you're suggesting -- the idea that every processor in the pipeline will independently report complete totals. I think is excessive, there is a lot of redundancy, but the problem can be framed this way. In fact, the current design can be applied the way you describe by a simple redefinition rule -- if you consider a pipeline segment to be an individual receiver, an individual processor, or an individual exporter, you'll get the metrics you're expecting. I think this might even be appropriate in complex pipelines.
The defect I'm aware of, when each processor counts independent totals, is that it becomes easy to aggregate adjacent pipeline segments together, which results in overcounting from a pipeline perspective. This is not a unique problem to processor metrics -- the problem arises when a metric query aggregates more than one collector belonging to the same pipeline, or more than one exporter, or more than one processor. My goal is to make it easy to write queries that encompass
In my current proposal, if you aggregate the total for otelcol_consumed_items
grouping by all attributes to a single total, the result will be the number of collector pipeline segments times the number of items. If you restrict your query to one segment (meaning one pipeline and one collector), then the aggregate equals the number of items. This property holds because each segment has one exporter and one receiver.
Since there are multiple processors in a pipeline segment, if each processor counts a total, then the aggregate for that segment will equal the number of processors times the number of items, which is not a useful measure to compare against adjacent pipeline segments. When each processor reports a total, you have to aggregate down to an individual processor to understand its behavior. But then, the logic to check whether the receiver and exporter are consistent, given processor behavior, becomes complicated at best--the aggregation would have to filter the dropped
and discarded
categories from the processor metrics, and then we'd be able to recover the pipeline equations in this proposal.
This is why I ended up proposing that processors count changes in item count, because the changes in item count aggregate correctly despite multiple processors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining further. The tradeoffs are tough here but if we're defining a segment as having only one receiver and one exporter, it excludes a large percentage (maybe substantial majority?) of collector configurations. Even in a simple pipeline like below change counts for P1
have little meaning.
receivers: [R1, R2]
processors: [P1]
exporters: [E1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about the example, specifically.
Why are there two paths between R1 and E1? This fact will make it difficult to monitor the pipeline, because it appears to double the input on purpose. The pipeline equations will show this happening, but it will be up to interpretation to say whether it's on purpose or not.
The way I would monitor the setup in your example is to compute all the paths for which I expect the conservation rule to hold. They are:
(R1 + R2) -> P1 -> E1
(R1 + R2) -> P1 -> E2
R1 -> P2 -> E1
Since two paths lead to E1, the pipeline equations have to be combined. For E1, the equation will include a factor of 2 for R1.
2*Received(R1) + Received(R2) = Dropped(P1) + Dropped(P2) + Exported(E1)
This kind of calculation can be automated and derived from the metrics I'm proposing, if you have the graph. I mean, if you want to know that P1 received 30 items of telemetry, just add R1 and R2's consumed item totals, that should be easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're defining a segment as having only one receiver and one exporter
This is an interesting statement -- I've definitely not been clear on this topic. I didn't mean to say one receiver and one exporter. I meant all receivers and one exporter, because that's where the conservation principle holds. The sum of all receivers reaches every exporter, and that is a pipeline segment, so your second example,
receivers: [R1, R2]
processors: [P1]
exporters: [E1]
is exactly the kind of simple pipeline segment that will be easy to monitor, and it will be easy to monitor even if it has a bunch of processors too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there two paths between R1 and E1?
I agree it's likely not useful. It's a contrived example but I wanted to include the full set of possible stream merges and fan outs:
- single pipeline concerns
- merge before first processor
- fan out after last processors
- inter-pipeline concerns
- fan out after receiver shared by multiple pipelines
- merge before exporter shared by multiple pipelines
(R1 + R2) -> P1 -> E1
(R1 + R2) -> P1 -> E2
I think is perhaps where I'm getting tripped up. Could we define a segment as being able to have more than one receiver? This still aggregates correctly. I see why we cannot include multiple exporters, because data is fanned out within the segment, but the fanout that occurs when a receiver is shared between pipelines does not affect the counts for an individual pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to say one receiver and one exporter. I meant all receivers and one exporter, because that's where the conservation principle holds.
I commented before seeing this but I see we arrived at the same conclusion. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for re-writing this @jmacd, just a few comments. Will the diagram included in this PR be updated to represent the concepts of producers/consumers?
|
||
## Explanation | ||
|
||
This document proposes two metric instrument semantics threefour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it three or four?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to say I've only defined two semantics, consumed and produced. Then, I prefix the SDK or Collector part to make 4 logical metric instruments, but then I exclude one (reasons stated in "SDK-specific considerations"), leaving three.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @codeboten is referring to the threefour
on line 16
Pipeline components included in this specification are: | ||
|
||
- OpenTelemetry SDKs: As telemetry producers, these components are the | ||
start of a pipeline. These components also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing end of the sentence
The first equation: | ||
|
||
``` | ||
Consumed(Segment) == Recieved(Segment) + Inserted(Segment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consumed(Segment) == Recieved(Segment) + Inserted(Segment) | |
Consumed(Segment) == Received(Segment) + Inserted(Segment) |
|
||
The producer categories, leading to the second pipeline segment equation: | ||
|
||
- **Exported**: An attempt was made to export the telemetry to a following pipeline segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an attempt or rather the data was successfully exported to the follower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An attempt. When the attempt is made, there is at least some expectation that the next pipeline segment has seen the data. Exported includes success and failed cases, and I'm not sure how I can change the words to improve this understanding. I mean to count cases where an RPC was made, essentially, whether it fails or not, because it sets up our expectation for the next segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, just to be clear, which metric do we use for an exporter that failed to even establish a connection to a downstream receiver?
For example, if I configure the collector with an OTLP exporter with a bad endpoint, and the HTTP/GRPC connection cannot be made, the export will "fail" but there is no expectation that any following receiver will ever see the data (so won't count it).
It seems Exported
doesn't fit here by your definition. Would it be Dropped
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems Exported doesn't fit here by your definition. Would it be Dropped?
Yes
possible to verify this and warn about improper accounting during | ||
shutdown. | ||
|
||
These equations allow are useful in the abstract, because , without ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These equations allow are useful in the abstract, because , without ordering | |
These equations allow are useful in the abstract, without ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmacd thanks for working on this. After yesterday's collector SIG meeting it is important to move this work forward so we can get to a stable semantic convention the collector can rely on so we can sort out its metric names once and for all. Let me know how I can help.
|
||
## Explanation | ||
|
||
This document proposes two metric instrument semantics threefour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @codeboten is referring to the threefour
on line 16
- `otelcol_consumed_items`: The number of items received or inserted into a pipeline. | ||
- `otelcol_produced_items`: The number of items discarded, dropped, or exported by a Collector pipeline segment. | ||
- `otelsdk_produced_items`: The number of items discarded, dropped, or exported by a SDK pipeline segment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If otelcol
and otelsdk
are namespacing these metrics, should the names be:
otelcol.consumed_items
otelcol.produced_items
otelsdk.produced_items
|
||
### Recommended conventional attributes | ||
|
||
- `otel.success` (boolean): This is true or false depending on whether the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is otel
being used to namespace these attributes so they wouldn't conflict with other attribute names? I think we should add some more clarity in the name to make it clear these are attributes of an otel pipeline: how do you feel about the otel.pipeline.
prefix?
@kristinapathak is taking over this work from me. (I thought that I had already stated this!) |
|
||
## Explanation | ||
|
||
This document proposes two metric instrument semantics threefour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document proposes two metric instrument semantics threefour | |
This document proposes two metric instrument semantics three |
An arrangement of pipeline components acting as a single unit, such as | ||
implemented by the OpenTelemetry Collector, is called a segment. Each | ||
segment consists of a receiver, zero or more processors, and an | ||
exporter. The terms "following" and "preceding" apply to pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a OTel Collector pipeline is configured with more than one receiver / exporter, is this then considered to be multiple, logical segments
?
How about when the routingconnector
is used? Will this be multiple segments
contained within a single Collector instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0x006EA1E5, I would enjoy to continue this discussion on this new PR, but my short response is:
If a OTel Collector pipeline is configured with more than one receiver / exporter, is this then considered to be multiple, logical segments?
yes! A single Collector pipeline can have multiple segments
How about when the routingconnector is used? Will this be multiple segments contained within a single Collector instance?
My new PR includes an example with the spanmetrics connector, but the short answer is also yes. 🙂 A connector is both the end of one segment and the start of the following one. I'm not as familiar with the routing connector so will look into it more to get a better understanding. It looks like it would be a good example to include.
pipeline. The preceding component ("preceder") produces data that is | ||
consumed by the following component ("follower"). | ||
|
||
An arrangement of pipeline components acting as a single unit, such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention that there will be similar otelcol_*_items
metrics for the segments as well as the components? It's not clear to me here how these two concepts apply here.
When it comes to "data loss", I am often more interested in the network boundary between "segments", e.g., when using the loadbalancingexporter to route to a following Collector instance.
Currently, I compare the component level loadbalancingexporter
and following otlpreceiver
metrics to try to understand data loss, but really what I care about is segment level view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0x006EA1E5, I'm working on writing out more details on data loss between segments. Here is my current scribble that looks at how a resource exhausted response would look.
Closing in favor of #259. |
Derived from #238. Ready for interested reviewers.
Still needs examples.
Diagram needs to be updated.