Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

WIP: Pipeline monitoring metrics #249

Closed
wants to merge 21 commits into from
Prev Previous commit
Next Next commit
text
  • Loading branch information
jmacd committed Feb 6, 2024
commit 38fbc0883a14ba5cd8a7322b4886e7143634aac0
242 changes: 171 additions & 71 deletions text/metrics/0238-pipeline-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,127 @@ Propose a uniform standard for telemetry pipeline metrics generated by
OpenTelemetry SDKs and Collectors with support for several levels of
detail.

**WIP**: This document has been edited recently, based on reviewer
feedback. Since it has changed substantially, I removed a lot of
text. I will restore this document after sharing the revisions with
reviewers.

## Motivation

OpenTelemetry desires to standardize conventions for the metrics
emitted by SDKs about success and failure of telemetry reporting. At
the same time, the OpenTelemetry Collector is becoming a stable and
critical part of the ecosystem, and it has existing conventions which
are expected to connect with metrics emitted by SDKs.
the same time, the OpenTelemetry Collector has existing conventions
which are expected to connect with metrics emitted by SDKs and have
similar definitions.

## Explanation

We use the term "pipeline" to describe an arrangement of system
components which produce, consume, and process telemetry on its way
from the point of origin to the endpoint(s) in its journey.

## Explanation
from the point of origin to the endpoint(s) in its journey. Pipeline
components included in this specification are:

- OpenTelemetry SDKs: As telemetry producers, these components are the
start of a pipeline.
- OpenTelemetry Collectors: The OpenTelemetry collector contains an
arrangement of components which may act as both consumers and
producers.

The term "ahead of" in reference to a pipeline, refers to the
component or chain of components that produced data being consumed by
the component in question. The "next stage" of the pipeline refers to
the component in the pipeline that consumes the data produced by the
component in question.

### Detailed design

The proposed metric instrument would be named distinctly depending on
#### Pipeline outcome is not response status

In this specification, it is important to recognize that the outcome
registered by a component in its pipeline metrics does not necessarily
match the response code that it returns from Export to the component
ahead of it in the pipeline.

For example, a memory-limiter component may drop data and return a
resource-exhausted status to the component ahead of it, a receiver.
The receiver will indicate `exhausted`, while the memory limter will
indicate `dropped` because it was the origin of the exhausted
condition.

For example, an exporter involved in a fan-out arrangement may be
configured to suppress errors, aware that the producer will see an
error if any of the fanned-out components returns an error. The
exporter will be configured return success immediately and later,
depending on the actual outcome, may use one of `deferred:` outcomes
to report failures which the component ahead of it did not see.

#### Single metric vs. Many

We choose to specify a single metric instrument for use counting
outcomes (with metric attributes), as opposed to the use of
per-outcome metric instruments (without metric attributes).

The alternative, which uses one metric instrument per outcome, has
known difficulties. To define a ratio between any one outcome and the
total requires a metric formula defined by all the outcomes. On other
hand, it is common practice using OpenTelemetry metrics to aggrgater
by attribute. It possible and convenient, when a single metric
instrument is used, to define ratios and build area charts from a
single metric instruments.

The use of exclusive counters, one per outcome, is also logically
confusing. Existing OpenTelemetry collector metrics for exporters
have both `sent` and `send_failed` metrics. A user could easily
believe that the failure ratio is defined as `send_failed / sent`,
since (logically) something has to be sent before the send can fail.
The correct failure ratio, using exclusive counters, is `send_failed /
(sent + send_failed)`, but from experience, users can easily miss this
detail. Moreover, when exclusive counters have been defined in this
manner, it is impossible to define new outcomes, every formula would
need to be updated.

#### Distinct names for SDKs and Collectors

Because these run in the same place.

#### Pipeline equations: Consumed = Dropped-or-Discarded + Produced

Dropped and discarded are special

#### Timeout is special

Prefer timeout to dropped, e.g., you may drop because timeout expired
on arrival. Call this timeout, not dropped.

#### Resource-exhausted is special

Be specific about this one, it impacts SLOs. Dot apply to "this"
component.

#### Deferred outcomes are special

Important because users are blind to these outcomes, so "terminal" in
a sense.

#### Coallescing Processor metrics

The defintions for the `discarded`, `dropped` (and `deferred:dropped`)
outcomes are special because they are terminal in the pipeline. When
an item is discarded or dropped, only the component counts these
outcomes, while components ahead of this component will see success or
a retryable response code.

The normal behavior of a processor component in the OpenTelemetry
Collector, except when it decides to drop or discard, is to pass
telemetry through to the next stage in the pipeline. It would lead to
substantial redundancy for a sequence of processors to individually
count pass-through outcomes, since for outcomes other than `discarded`, `dropped`, and `deferred:dropped`

Considering a sequence of adjacent components

[WIP]

The proposed metric instruments are named differently, depending on
whether it is a collector or an SDK, to prevent accidental aggregation
of these timeseries. The specified counter names would be:
of these timeseries.


The specified counter names are:

- `otelsdk.producer.items`: count of successful and failed items of
telemetry produced, by signal type, by an OpenTelemetry SDK.
Expand All @@ -36,7 +133,7 @@ of these timeseries. The specified counter names would be:
receiver component.
- `otelcol.processor.items`: count of successful and failed items of
telemetry processed, by signal type, by an OpenTelemetry Collector
receiver component.
receiver component. Two mode alternatives are specified, see below.
- `otelcol.exporter.items`: count of successful and failed items of
telemetry processed, by signal type, by an OpenTelemetry Collector
receiver component.
Expand All @@ -49,8 +146,9 @@ of these timeseries. The specified counter names would be:
way than `otel.success`, with recommended values specified below.
- `otel.signal` (string): This is the name of the signal (e.g., "logs",
"metrics", "traces")
- `otel.name` (string): Name of the component in a pipeline.
- `otel.pipeline` (string): Name of the pipeline in a collector.
- `otel.component` (string): Name of the component in a pipeline.
- `otel.pipeline`
(string): Name of the pipeline in a collector.

### Specified `otel.outcome` attribute values

Expand All @@ -64,64 +162,66 @@ For success=true:

- `accepted`: Indicates a normal, synchronous request success case.
The item was consumed by the next stage of the pipeline, which
returned success. Note the item could have been suppressed by a
returned success. Note the item could have been deferred by a
subsequent component, but as far as this component knows, the
request successful.
- `suppressed:<any other outcome>`: When the true
outcome is not known at the time of counting, and the compnent
intentionally returns success to its producer. Examples are given
below.

For both success=true and success=false, there is a special outcome
indicating items did not reach the next stage in the pipeline,
considered "dropped". When comparing pipeline metrics from one stage
to the next, those which are dropped by a component are expected not
to appear in totals of the subequent pipeline.

- `dropped`: Processors may use this to indicate both success and
failure, for example include sampling processors and filtering
processors, which successfully avoid sending data based on
configuration. For all components, dropped with success=false
indicates that the component introduced an original failure and did
not send to the next stage in the pipeline.

For success=false, transient and potentially retryable:

- `deadline_exceeded`: The item was in the process of being sent but the request
timed out, or its deadline was exceeded.
- `resource_exhausted`: The item was handled by the next stage of the
pipeline, which returned an error code indicating that it was
overloaded. If the resource being exhausted is local and the item
was not handled by the next stage of the pipeline, use `dropped`.
- `discarded`: Indicates a successful outcome in which the next stage
of the pipeline does not handle the event, as by a sampling
processor.
- `deferred:<failure outcome>`: Deferred cases are where the
caller receives a success response and the true outcome is failure,
but this is not known until later. The item is counted as
`deferred:` combined with the failure outcome that would otherwise
have been counted.

For success=false, transient and potentially retryable cases:

- `dropped`: The component introduced an original failure and did not
send to the next stage in the pipeline.
- `timeout`: The item was in the process of being sent but the request
timed out, or its deadline was exceeded. In this case, it
undetermined whether the consuming pipeline saw the item or not.
- `exhausted`: The item was handled by the next stage of the pipeline,
which returned an error code indicating that it was overloaded. If
the resource being exhausted is local and the item was not handled
by the next stage of the pipeline, record the item `dropped` and
return a resource-exhausted status code to the producer, who will
record a `exhausted` outcome.
- `retryable`: The item was handled by the next stage of the pipeline,
which returned a retryable error status not covered by any of the
above values.

For success=false, permanent category:
For success=false, permanent cases:

- `rejected`: The item was handled by the next stage of the pipeline,
which returned a permanent error status or partial success status
indicating that some items could not be accepted.
- `unknown`: May be used when the component is suppressing errors and
not actually counting successes and failures. As a special case,
the outcome `deferred:unknown` indicates that a success response
was given and no information about the actual outcome is available.

##



#### Success, Outcome matrix

| Caller Success | Metrics Success | Outcome | Meaning |
|----------------|-----------------|------------------------------|-------------------------------------------------------------------|
| true | true | accepted | Send succeeded (synchronous or not) |
| true | true | dropped | Dropped by intention |
| false | false | dropped | Producer saw the component return failure, request was not sent |
| false | false | deadline_exceeded | Producer saw the component return failure, request timed out |
| false | false | resource_exhausted | Producer saw the component return failure, insufficient resources |
| false | false | retryable | Producer saw the component return other non-permanent condition |
| false | false | rejected | Producer saw the component return a permanent condition |
| true | false | supressed:accepted | Producer saw success; eventually accepted |
| true | false | supressed:dropped | Producer saw success; request was not sent |
| true | false | supressed:deadline_exceeded | Producer saw success; request sent, timed out |
| true | false | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources |
| true | false | supressed:retryable | Producer saw success; request sent, other non-permanent condition |
| true | false | supressed:rejected | Producer saw success; request sent, permanent condition |
| true | false | supressed:unknown | Producer saw success; no effort to report true outcome |
| Outcome | Export Attempted? | Caller Success? | Metrics Success? | Meaning |
|--------------------|-------------------|-----------------|------------------|---------------------------------------------------------------|
| accepted | true | true | true | Data (successfully) sent |
| discarded | false | true | true | Data (successfully) discarded |
| dropped | false | false | false | Request never started, error returned |
| timeout | true | false | false | Request started, timed out, error returned |
| exhausted | true | false | false | Request started, insufficient resources, error returned |
| retryable | true | false | false | Request started, retryable error status, error returned |
| rejected | true | false | false | Request completed, permanent error status, error returned |
| deferred:dropped | false | true | false | Request never started, error NOT returned |
| deferred:timeout | true | true | false | Request started, timed out, error NOT returned |
| deferred:exhausted | true | true | false | Request started, insufficient resources, error NOT returned |
| deferred:retryable | true | true | false | Request started, retryable error status, error NOT returned |
| deferred:rejected | true | true | false | Request completed, permanent error status, error NOT returned |
| deferred:unknown | true | true | false | Request has unknown outcome, error NOT returned |

#### Examples of each outcome

Expand All @@ -134,48 +234,48 @@ stage in the pipeline while blocking the producer.

A processor was configured with instructions not to pass certain data.

##### Success, Suppressed-Accepted
##### Success, Deferred-Accepted

A component returned success to its producer, and later the outcome
was successful.

##### Failure, Dropped and Success, Suppressed-Dropped
##### Failure, Dropped and Success, Deferred-Dropped

(If suppressed: A component returned success to its producer, then ...)
(If deferred: A component returned success to its producer, then ...)

The component never sent the item(s) due to limits in effect. For
example, shutdown was ordered and the queue could not be drained in
time due to a limit on parallelism.

##### Failure, Deadline exceeded and Success, Suppressed-Deadline exceeded
##### Failure, Deadline exceeded and Success, Deferred-Deadline exceeded

(If suppressed: A component returned success to its producer, then ...)
(If deferred: A component returned success to its producer, then ...)

The component attempted sending the item(s), but the item(s) did not
succeed before the deadline expired. If there were attempts to retry,
this is outcome of the final attempt.

##### Failure, Resource exhausted and Success, Suppressed-Resource exhausted
##### Failure, Resource exhausted and Success, Deferred-Resource exhausted

(If suppressed: A component returned success to its producer, then ...)
(If deferred: A component returned success to its producer, then ...)

The component attempted sending the item(s), but the consumer
indicated its (or its consumers') resources were exceeded. If there
were attempts to retry, this is outcome of the final attempt.

##### Failure, Retryable and Success, Suppressed-Retryable
##### Failure, Retryable and Success, Deferred-Retryable

(If suppressed: A component returned success to its producer, then ...)
(If deferred: A component returned success to its producer, then ...)

A component returned success to its producer, and then it attempted
sending the item(s), but the consumer indicated some kind of transient
condition other than deadline- or resource-related (e.g., connection
not accepted). If there were attempts to retry, this is outcome of
the final attempt.

##### Failure, Rejected and Success, Suppressed-Rejected
##### Failure, Rejected and Success, Deferred-Rejected

(If suppressed: A component returned success to its producer, then ...)
(If deferred: A component returned success to its producer, then ...)

A compmnent returned success to its producer, and then it attempted
sending the item(s), but the consumer returned a permanent error.