text

open-telemetry · jmacd · Oct 28, 2023 · Oct 31, 2023 · Oct 31, 2023 · Dec 15, 2023
commit 38fbc0883a14ba5cd8a7322b4886e7143634aac0
diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md
@@ -4,30 +4,127 @@ Propose a uniform standard for telemetry pipeline metrics generated by
 OpenTelemetry SDKs and Collectors with support for several levels of
 detail.
 
-**WIP**: This document has been edited recently, based on reviewer
-feedback.  Since it has changed substantially, I removed a lot of
-text.  I will restore this document after sharing the revisions with
-reviewers.
-
 ## Motivation
 
 OpenTelemetry desires to standardize conventions for the metrics
 emitted by SDKs about success and failure of telemetry reporting. At
-the same time, the OpenTelemetry Collector is becoming a stable and
-critical part of the ecosystem, and it has existing conventions which
-are expected to connect with metrics emitted by SDKs.
+the same time, the OpenTelemetry Collector has existing conventions
+which are expected to connect with metrics emitted by SDKs and have
+similar definitions.
+
+## Explanation
 
 We use the term "pipeline" to describe an arrangement of system
 components which produce, consume, and process telemetry on its way
-from the point of origin to the endpoint(s) in its journey.
-
-## Explanation
+from the point of origin to the endpoint(s) in its journey.  Pipeline
+components included in this specification are:
+
+- OpenTelemetry SDKs: As telemetry producers, these components are the
+  start of a pipeline.
+- OpenTelemetry Collectors: The OpenTelemetry collector contains an
+  arrangement of components which may act as both consumers and
+  producers.
+
+The term "ahead of" in reference to a pipeline, refers to the
+component or chain of components that produced data being consumed by
+the component in question.  The "next stage" of the pipeline refers to
+the component in the pipeline that consumes the data produced by the
+component in question.
 
 ### Detailed design
 
-The proposed metric instrument would be named distinctly depending on
+#### Pipeline outcome is not response status
+
+In this specification, it is important to recognize that the outcome
+registered by a component in its pipeline metrics does not necessarily
+match the response code that it returns from Export to the component
+ahead of it in the pipeline.
+
+For example, a memory-limiter component may drop data and return a
+resource-exhausted status to the component ahead of it, a receiver.
+The receiver will indicate `exhausted`, while the memory limter will
+indicate `dropped` because it was the origin of the exhausted
+condition.
+
+For example, an exporter involved in a fan-out arrangement may be
+configured to suppress errors, aware that the producer will see an
+error if any of the fanned-out components returns an error.  The
+exporter will be configured return success immediately and later,
+depending on the actual outcome, may use one of `deferred:` outcomes
+to report failures which the component ahead of it did not see.
+
+#### Single metric vs. Many
+
+We choose to specify a single metric instrument for use counting
+outcomes (with metric attributes), as opposed to the use of
+per-outcome metric instruments (without metric attributes).
+
+The alternative, which uses one metric instrument per outcome, has
+known difficulties.  To define a ratio between any one outcome and the
+total requires a metric formula defined by all the outcomes.  On other
+hand, it is common practice using OpenTelemetry metrics to aggrgater
+by attribute.  It possible and convenient, when a single metric
+instrument is used, to define ratios and build area charts from a
+single metric instruments.
+
+The use of exclusive counters, one per outcome, is also logically
+confusing.  Existing OpenTelemetry collector metrics for exporters
+have both `sent` and `send_failed` metrics.  A user could easily
+believe that the failure ratio is defined as `send_failed / sent`,
+since (logically) something has to be sent before the send can fail.
+The correct failure ratio, using exclusive counters, is `send_failed /
+(sent + send_failed)`, but from experience, users can easily miss this
+detail.  Moreover, when exclusive counters have been defined in this
+manner, it is impossible to define new outcomes, every formula would
+need to be updated.
+
+#### Distinct names for SDKs and Collectors
+
+Because these run in the same place.
+
+#### Pipeline equations: Consumed = Dropped-or-Discarded + Produced
+
+Dropped and discarded are special
+
+#### Timeout is special
+
+Prefer timeout to dropped, e.g., you may drop because timeout expired
+on arrival.  Call this timeout, not dropped.
+
+#### Resource-exhausted is special
+
+Be specific about this one, it impacts SLOs.  Dot apply to "this"
+component.
+
+#### Deferred outcomes are special
+
+Important because users are blind to these outcomes, so "terminal" in
+a sense.
+
+#### Coallescing Processor metrics
+
+The defintions for the `discarded`, `dropped` (and `deferred:dropped`)
+outcomes are special because they are terminal in the pipeline.  When
+an item is discarded or dropped, only the component counts these
+outcomes, while components ahead of this component will see success or
+a retryable response code.
+
+The normal behavior of a processor component in the OpenTelemetry
+Collector, except when it decides to drop or discard, is to pass
+telemetry through to the next stage in the pipeline.  It would lead to
+substantial redundancy for a sequence of processors to individually
+count pass-through outcomes, since for outcomes other than `discarded`, `dropped`, and `deferred:dropped` 
+
+Considering a sequence of adjacent components
+
+[WIP]
+
+The proposed metric instruments are named differently, depending on
 whether it is a collector or an SDK, to prevent accidental aggregation
-of these timeseries.  The specified counter names would be:
+of these timeseries.
+
+
+The specified counter names are:
 
 - `otelsdk.producer.items`: count of successful and failed items of
   telemetry produced, by signal type, by an OpenTelemetry SDK.
@@ -36,7 +133,7 @@ of these timeseries.  The specified counter names would be:
   receiver component.
 - `otelcol.processor.items`: count of successful and failed items of
   telemetry processed, by signal type, by an OpenTelemetry Collector
-  receiver component.
+  receiver component.  Two mode alternatives are specified, see below.
 - `otelcol.exporter.items`: count of successful and failed items of
   telemetry processed, by signal type, by an OpenTelemetry Collector
   receiver component.
@@ -49,8 +146,9 @@ of these timeseries.  The specified counter names would be:
   way than `otel.success`, with recommended values specified below.
 - `otel.signal` (string): This is the name of the signal (e.g., "logs",
   "metrics", "traces")
-- `otel.name` (string): Name of the component in a pipeline.
-- `otel.pipeline` (string): Name of the pipeline in a collector.
+- `otel.component` (string): Name of the component in a pipeline.
+- `otel.pipeline` 
+(string): Name of the pipeline in a collector.
 
 ### Specified `otel.outcome` attribute values
 
@@ -64,64 +162,66 @@ For success=true:
 
 - `accepted`: Indicates a normal, synchronous request success case.
   The item was consumed by the next stage of the pipeline, which
-  returned success.  Note the item could have been suppressed by a
+  returned success.  Note the item could have been deferred by a
   subsequent component, but as far as this component knows, the 
   request successful.
-- `suppressed:<any other outcome>`: When the true
-  outcome is not known at the time of counting, and the compnent
-  intentionally returns success to its producer.  Examples are given
-  below.
-
-For both success=true and success=false, there is a special outcome
-indicating items did not reach the next stage in the pipeline,
-considered "dropped".  When comparing pipeline metrics from one stage
-to the next, those which are dropped by a component are expected not
-to appear in totals of the subequent pipeline.
-
-- `dropped`: Processors may use this to indicate both success and
-  failure, for example include sampling processors and filtering
-  processors, which successfully avoid sending data based on
-  configuration.  For all components, dropped with success=false
-  indicates that the component introduced an original failure and did
-  not send to the next stage in the pipeline.
-
-For success=false, transient and potentially retryable:
-
-- `deadline_exceeded`: The item was in the process of being sent but the request
-  timed out, or its deadline was exceeded.
-- `resource_exhausted`: The item was handled by the next stage of the
-  pipeline, which returned an error code indicating that it was
-  overloaded.  If the resource being exhausted is local and the item
-  was not handled by the next stage of the pipeline, use `dropped`.
+- `discarded`: Indicates a successful outcome in which the next stage
+  of the pipeline does not handle the event, as by a sampling
+  processor.
+- `deferred:<failure outcome>`: Deferred cases are where the
+  caller receives a success response and the true outcome is failure,
+  but this is not known until later.  The item is counted as
+  `deferred:` combined with the failure outcome that would otherwise
+  have been counted.
+
+For success=false, transient and potentially retryable cases:
+
+- `dropped`: The component introduced an original failure and did not
+  send to the next stage in the pipeline.
+- `timeout`: The item was in the process of being sent but the request
+  timed out, or its deadline was exceeded.  In this case, it
+  undetermined whether the consuming pipeline saw the item or not.
+- `exhausted`: The item was handled by the next stage of the pipeline,
+  which returned an error code indicating that it was overloaded.  If
+  the resource being exhausted is local and the item was not handled
+  by the next stage of the pipeline, record the item `dropped` and
+  return a resource-exhausted status code to the producer, who will
+  record a `exhausted` outcome.
 - `retryable`: The item was handled by the next stage of the pipeline,
   which returned a retryable error status not covered by any of the
   above values.
 
-For success=false, permanent category:
+For success=false, permanent cases:
 
 - `rejected`: The item was handled by the next stage of the pipeline,
   which returned a permanent error status or partial success status
   indicating that some items could not be accepted.
+- `unknown`: May be used when the component is suppressing errors and
+  not actually counting successes and failures.  As a special case,
+  the outcome `deferred:unknown` indicates that a success response 
+  was given and no information about the actual outcome is available.
+
+##
+
 
 
 #### Success, Outcome matrix
 
-| Caller Success | Metrics Success | Outcome                      | Meaning                                                           |
-|----------------|-----------------|------------------------------|-------------------------------------------------------------------|
-| true           | true            | accepted                     | Send succeeded (synchronous or not)                               |
-| true           | true            | dropped                      | Dropped by intention                                              |
-| false          | false           | dropped                      | Producer saw the component return failure, request was not sent   |
-| false          | false           | deadline_exceeded            | Producer saw the component return failure, request timed out      |
-| false          | false           | resource_exhausted           | Producer saw the component return failure, insufficient resources |
-| false          | false           | retryable                    | Producer saw the component return other non-permanent condition   |
-| false          | false           | rejected                     | Producer saw the component return a permanent condition           |
-| true           | false           | supressed:accepted           | Producer saw success; eventually accepted                         |
-| true           | false           | supressed:dropped            | Producer saw success; request was not sent                        |
-| true           | false           | supressed:deadline_exceeded  | Producer saw success; request sent, timed out                     |
-| true           | false           | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources        |
-| true           | false           | supressed:retryable          | Producer saw success; request sent, other non-permanent condition |
-| true           | false           | supressed:rejected           | Producer saw success; request sent, permanent condition           |
-| true           | false           | supressed:unknown            | Producer saw success; no effort to report true outcome            |
+| Outcome            | Export Attempted? | Caller Success? | Metrics Success? | Meaning                                                       |
+|--------------------|-------------------|-----------------|------------------|---------------------------------------------------------------|
+| accepted           | true              | true            | true             | Data (successfully) sent                                      |
+| discarded          | false             | true            | true             | Data (successfully) discarded                                 |
+| dropped            | false             | false           | false            | Request never started, error returned                         |
+| timeout            | true              | false           | false            | Request started, timed out, error returned                    |
+| exhausted          | true              | false           | false            | Request started, insufficient resources, error returned       |
+| retryable          | true              | false           | false            | Request started, retryable error status, error returned       |
+| rejected           | true              | false           | false            | Request completed, permanent error status, error returned     |
+| deferred:dropped   | false             | true            | false            | Request never started, error NOT returned                     |
+| deferred:timeout   | true              | true            | false            | Request started, timed out, error NOT returned                |
+| deferred:exhausted | true              | true            | false            | Request started, insufficient resources, error NOT returned   |
+| deferred:retryable | true              | true            | false            | Request started, retryable error status, error NOT returned   |
+| deferred:rejected  | true              | true            | false            | Request completed, permanent error status, error NOT returned |
+| deferred:unknown   | true              | true            | false            | Request has unknown outcome, error NOT returned               |
 
 #### Examples of each outcome
 
@@ -134,48 +234,48 @@ stage in the pipeline while blocking the producer.
 
 A processor was configured with instructions not to pass certain data.
 
-##### Success, Suppressed-Accepted
+##### Success, Deferred-Accepted
 
 A component returned success to its producer, and later the outcome
 was successful.
 
-##### Failure, Dropped and Success, Suppressed-Dropped
+##### Failure, Dropped and Success, Deferred-Dropped
 
-(If suppressed: A component returned success to its producer, then ...)
+(If deferred: A component returned success to its producer, then ...)
 
 The component never sent the item(s) due to limits in effect.  For
 example, shutdown was ordered and the queue could not be drained in
 time due to a limit on parallelism.
 
-##### Failure, Deadline exceeded and Success, Suppressed-Deadline exceeded
+##### Failure, Deadline exceeded and Success, Deferred-Deadline exceeded
 
-(If suppressed: A component returned success to its producer, then ...)
+(If deferred: A component returned success to its producer, then ...)
 
 The component attempted sending the item(s), but the item(s) did not
 succeed before the deadline expired.  If there were attempts to retry,
 this is outcome of the final attempt.
 
-##### Failure, Resource exhausted and Success, Suppressed-Resource exhausted
+##### Failure, Resource exhausted and Success, Deferred-Resource exhausted
 
-(If suppressed: A component returned success to its producer, then ...)
+(If deferred: A component returned success to its producer, then ...)
 
 The component attempted sending the item(s), but the consumer
 indicated its (or its consumers') resources were exceeded.  If there
 were attempts to retry, this is outcome of the final attempt.
 
-##### Failure, Retryable and Success, Suppressed-Retryable
+##### Failure, Retryable and Success, Deferred-Retryable
 
-(If suppressed: A component returned success to its producer, then ...)
+(If deferred: A component returned success to its producer, then ...)
 
 A component returned success to its producer, and then it attempted
 sending the item(s), but the consumer indicated some kind of transient
 condition other than deadline- or resource-related (e.g., connection
 not accepted).  If there were attempts to retry, this is outcome of
 the final attempt.
 
-##### Failure, Rejected and Success, Suppressed-Rejected
+##### Failure, Rejected and Success, Deferred-Rejected
 
-(If suppressed: A component returned success to its producer, then ...)
+(If deferred: A component returned success to its producer, then ...)
 
 A compmnent returned success to its producer, and then it attempted
 sending the item(s), but the consumer returned a permanent error.