Skip to content

Commit

Permalink
Add troubleshooting notes (#660)
Browse files Browse the repository at this point in the history
## Which problem is this PR solving?

Document the troubleshooting steps taken while investigating [this
support
question](https://cloud-native.slack.com/archives/CGG7NFUJ3/p1699539239671519).

One of the problems faced by users was caused by [OpenTelemetry
Collector Contrib
v0.85.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.85.0)
introducing a breaking change to enable normalized metric names by
default:

> `prometheusexporters`: Append prometheus type and unit suffixes by
default in prometheus exporters.
(open-telemetry/opentelemetry-collector-contrib#26488)
Suffixes can be disabled by setting add_metric_suffixes to false on the
exporter.

Relates to: jaegertracing/jaeger#4957

## Description of the changes
- Adds the following troubleshooting guides:
- Inspecting the Prometheus queries that Jaeger makes to fetch data for
the Monitor tab.
- Inspecting OpenTelemetry config to troubleshoot a possible cause for
missing error metrics.
- Updates only made from when SPM defaulted to supporting the
spanmetrics connector, which was
[v1.49.0](https://github.com/jaegertracing/jaeger/releases/tag/v1.49.0).

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
~- [ ] I have added unit tests for the new functionality~
~- [ ] I have run lint and test steps successfully~

---------

Signed-off-by: Albert Teoh <albert@packsmith.io>
Co-authored-by: Albert Teoh <albert@packsmith.io>
  • Loading branch information
albertteoh and Albert Teoh authored Nov 18, 2023
1 parent aacddd0 commit 855ca70
Show file tree
Hide file tree
Showing 4 changed files with 260 additions and 8 deletions.
67 changes: 65 additions & 2 deletions content/docs/1.49/spm.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,14 +262,77 @@ the problem.
### Query Prometheus

Graphs may still appear empty even when the above Jaeger metrics indicate successful reads
from Prometheus. In this case, query Prometheus directly on any one of these metrics:
from Prometheus. In this case, query Prometheus directly on any of these metrics:

- `latency_bucket`
- `duration_bucket`
- `duration_milliseconds_bucket`
- `duration_seconds_bucket`
- `calls`
- `calls_total`

You should expect to see these counters increasing as spans are being emitted
by services to the OpenTelemetry Collector.

### Viewing Logs

If the above metrics are present in Prometheus, but not appearing in the Monitor
tab, it means there is a discrepancy between what metrics Jaeger expects to see in
Prometheus and what metrics are actually available.

This can be confirmed by increasing the log level by setting the following
environment variable:

```shell
LOG_LEVEL=debug
```

Outputting logs that resemble the following:
```json
{
"level": "debug",
"ts": 1688042343.4464543,
"caller": "metricsstore/reader.go:245",
"msg": "Prometheus query results",
"results": "",
"query": "sum(rate(calls{service_name =~ \"driver\", span_kind =~ \"SPAN_KIND_SERVER\"}[10m])) by (service_name,span_name)",
"range":
{
"Start": "2023-06-29T12:34:03.081Z",
"End": "2023-06-29T12:39:03.081Z",
"Step": 60000000000
}
}
```

In this instance, let's say OpenTelemetry Collector's `prometheusexporter` introduced
a breaking change that appends a `_total` suffix to counter metrics and the duration units within
histogram metrics (e.g. `duration_milliseconds_bucket`). As we discovered,
Jaeger is looking for the `calls` (and `duration_bucket`) metric names,
while the OpenTelemetry Collector is writing `calls_total` (and `duration_milliseconds_bucket`).

The resolution, in this specific case, is to set environment variables telling Jaeger
to normalize the metric names such that it knows to search for `calls_total` and
`duration_milliseconds_bucket` instead, like so:

```shell
PROMETHEUS_QUERY_NORMALIZE_CALLS=true
PROMETHEUS_QUERY_NORMALIZE_DURATION=true
```

### Checking OpenTelemetry Collector Config

If there are error spans appearing in Jaeger, but no corresponding error metrics:

- Check that raw metrics in Prometheus generated by the spanmetrics connector
(as listed above: `calls`, `calls_total`, `duration_bucket`, etc.) contain
the `status.code` label in the metric that the span should belong to.
- If there are no `status.code` labels, check the OpenTelemetry Collector
configuration file, particularly for the presence of the following configuration:
```yaml
exclude_dimensions: ['status.code']
```
This label is used by Jaeger to determine if a request is erroneous.
### Inspect the OpenTelemetry Collector
If the above `latency_bucket` and `calls_total` metrics are empty, then it could
Expand Down
67 changes: 65 additions & 2 deletions content/docs/1.50/spm.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,14 +262,77 @@ the problem.
### Query Prometheus

Graphs may still appear empty even when the above Jaeger metrics indicate successful reads
from Prometheus. In this case, query Prometheus directly on any one of these metrics:
from Prometheus. In this case, query Prometheus directly on any of these metrics:

- `latency_bucket`
- `duration_bucket`
- `duration_milliseconds_bucket`
- `duration_seconds_bucket`
- `calls`
- `calls_total`

You should expect to see these counters increasing as spans are being emitted
by services to the OpenTelemetry Collector.

### Viewing Logs

If the above metrics are present in Prometheus, but not appearing in the Monitor
tab, it means there is a discrepancy between what metrics Jaeger expects to see in
Prometheus and what metrics are actually available.

This can be confirmed by increasing the log level by setting the following
environment variable:

```shell
LOG_LEVEL=debug
```

Outputting logs that resemble the following:
```json
{
"level": "debug",
"ts": 1688042343.4464543,
"caller": "metricsstore/reader.go:245",
"msg": "Prometheus query results",
"results": "",
"query": "sum(rate(calls{service_name =~ \"driver\", span_kind =~ \"SPAN_KIND_SERVER\"}[10m])) by (service_name,span_name)",
"range":
{
"Start": "2023-06-29T12:34:03.081Z",
"End": "2023-06-29T12:39:03.081Z",
"Step": 60000000000
}
}
```

In this instance, let's say OpenTelemetry Collector's `prometheusexporter` introduced
a breaking change that appends a `_total` suffix to counter metrics and the duration units within
histogram metrics (e.g. `duration_milliseconds_bucket`). As we discovered,
Jaeger is looking for the `calls` (and `duration_bucket`) metric names,
while the OpenTelemetry Collector is writing `calls_total` (and `duration_milliseconds_bucket`).

The resolution, in this specific case, is to set environment variables telling Jaeger
to normalize the metric names such that it knows to search for `calls_total` and
`duration_milliseconds_bucket` instead, like so:

```shell
PROMETHEUS_QUERY_NORMALIZE_CALLS=true
PROMETHEUS_QUERY_NORMALIZE_DURATION=true
```

### Checking OpenTelemetry Collector Config

If there are error spans appearing in Jaeger, but no corresponding error metrics:

- Check that raw metrics in Prometheus generated by the spanmetrics connector
(as listed above: `calls`, `calls_total`, `duration_bucket`, etc.) contain
the `status.code` label in the metric that the span should belong to.
- If there are no `status.code` labels, check the OpenTelemetry Collector
configuration file, particularly for the presence of the following configuration:
```yaml
exclude_dimensions: ['status.code']
```
This label is used by Jaeger to determine if a request is erroneous.
### Inspect the OpenTelemetry Collector
If the above `latency_bucket` and `calls_total` metrics are empty, then it could
Expand Down
67 changes: 65 additions & 2 deletions content/docs/1.51/spm.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,14 +262,77 @@ the problem.
### Query Prometheus

Graphs may still appear empty even when the above Jaeger metrics indicate successful reads
from Prometheus. In this case, query Prometheus directly on any one of these metrics:
from Prometheus. In this case, query Prometheus directly on any of these metrics:

- `latency_bucket`
- `duration_bucket`
- `duration_milliseconds_bucket`
- `duration_seconds_bucket`
- `calls`
- `calls_total`

You should expect to see these counters increasing as spans are being emitted
by services to the OpenTelemetry Collector.

### Viewing Logs

If the above metrics are present in Prometheus, but not appearing in the Monitor
tab, it means there is a discrepancy between what metrics Jaeger expects to see in
Prometheus and what metrics are actually available.

This can be confirmed by increasing the log level by setting the following
environment variable:

```shell
LOG_LEVEL=debug
```

Outputting logs that resemble the following:
```json
{
"level": "debug",
"ts": 1688042343.4464543,
"caller": "metricsstore/reader.go:245",
"msg": "Prometheus query results",
"results": "",
"query": "sum(rate(calls{service_name =~ \"driver\", span_kind =~ \"SPAN_KIND_SERVER\"}[10m])) by (service_name,span_name)",
"range":
{
"Start": "2023-06-29T12:34:03.081Z",
"End": "2023-06-29T12:39:03.081Z",
"Step": 60000000000
}
}
```

In this instance, let's say OpenTelemetry Collector's `prometheusexporter` introduced
a breaking change that appends a `_total` suffix to counter metrics and the duration units within
histogram metrics (e.g. `duration_milliseconds_bucket`). As we discovered,
Jaeger is looking for the `calls` (and `duration_bucket`) metric names,
while the OpenTelemetry Collector is writing `calls_total` (and `duration_milliseconds_bucket`).

The resolution, in this specific case, is to set environment variables telling Jaeger
to normalize the metric names such that it knows to search for `calls_total` and
`duration_milliseconds_bucket` instead, like so:

```shell
PROMETHEUS_QUERY_NORMALIZE_CALLS=true
PROMETHEUS_QUERY_NORMALIZE_DURATION=true
```

### Checking OpenTelemetry Collector Config

If there are error spans appearing in Jaeger, but no corresponding error metrics:

- Check that raw metrics in Prometheus generated by the spanmetrics connector
(as listed above: `calls`, `calls_total`, `duration_bucket`, etc.) contain
the `status.code` label in the metric that the span should belong to.
- If there are no `status.code` labels, check the OpenTelemetry Collector
configuration file, particularly for the presence of the following configuration:
```yaml
exclude_dimensions: ['status.code']
```
This label is used by Jaeger to determine if a request is erroneous.
### Inspect the OpenTelemetry Collector
If the above `latency_bucket` and `calls_total` metrics are empty, then it could
Expand Down
67 changes: 65 additions & 2 deletions content/docs/next-release/spm.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,14 +262,77 @@ the problem.
### Query Prometheus

Graphs may still appear empty even when the above Jaeger metrics indicate successful reads
from Prometheus. In this case, query Prometheus directly on any one of these metrics:
from Prometheus. In this case, query Prometheus directly on any of these metrics:

- `latency_bucket`
- `duration_bucket`
- `duration_milliseconds_bucket`
- `duration_seconds_bucket`
- `calls`
- `calls_total`

You should expect to see these counters increasing as spans are being emitted
by services to the OpenTelemetry Collector.

### Viewing Logs

If the above metrics are present in Prometheus, but not appearing in the Monitor
tab, it means there is a discrepancy between what metrics Jaeger expects to see in
Prometheus and what metrics are actually available.

This can be confirmed by increasing the log level by setting the following
environment variable:

```shell
LOG_LEVEL=debug
```

Outputting logs that resemble the following:
```json
{
"level": "debug",
"ts": 1688042343.4464543,
"caller": "metricsstore/reader.go:245",
"msg": "Prometheus query results",
"results": "",
"query": "sum(rate(calls{service_name =~ \"driver\", span_kind =~ \"SPAN_KIND_SERVER\"}[10m])) by (service_name,span_name)",
"range":
{
"Start": "2023-06-29T12:34:03.081Z",
"End": "2023-06-29T12:39:03.081Z",
"Step": 60000000000
}
}
```

In this instance, let's say OpenTelemetry Collector's `prometheusexporter` introduced
a breaking change that appends a `_total` suffix to counter metrics and the duration units within
histogram metrics (e.g. `duration_milliseconds_bucket`). As we discovered,
Jaeger is looking for the `calls` (and `duration_bucket`) metric names,
while the OpenTelemetry Collector is writing `calls_total` (and `duration_milliseconds_bucket`).

The resolution, in this specific case, is to set environment variables telling Jaeger
to normalize the metric names such that it knows to search for `calls_total` and
`duration_milliseconds_bucket` instead, like so:

```shell
PROMETHEUS_QUERY_NORMALIZE_CALLS=true
PROMETHEUS_QUERY_NORMALIZE_DURATION=true
```

### Checking OpenTelemetry Collector Config

If there are error spans appearing in Jaeger, but no corresponding error metrics:

- Check that raw metrics in Prometheus generated by the spanmetrics connector
(as listed above: `calls`, `calls_total`, `duration_bucket`, etc.) contain
the `status.code` label in the metric that the span should belong to.
- If there are no `status.code` labels, check the OpenTelemetry Collector
configuration file, particularly for the presence of the following configuration:
```yaml
exclude_dimensions: ['status.code']
```
This label is used by Jaeger to determine if a request is erroneous.
### Inspect the OpenTelemetry Collector
If the above `latency_bucket` and `calls_total` metrics are empty, then it could
Expand Down

0 comments on commit 855ca70

Please sign in to comment.