Add troubleshooting notes (#660)

## Which problem is this PR solving? Document the troubleshooting steps taken while investigating [this support question](https://cloud-native.slack.com/archives/CGG7NFUJ3/p1699539239671519). One of the problems faced by users was caused by [OpenTelemetry Collector Contrib v0.85.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.85.0) introducing a breaking change to enable normalized metric names by default: > `prometheusexporters`: Append prometheus type and unit suffixes by default in prometheus exporters. (open-telemetry/opentelemetry-collector-contrib#26488) Suffixes can be disabled by setting add_metric_suffixes to false on the exporter. Relates to: jaegertracing/jaeger#4957 ## Description of the changes - Adds the following troubleshooting guides: - Inspecting the Prometheus queries that Jaeger makes to fetch data for the Monitor tab. - Inspecting OpenTelemetry config to troubleshoot a possible cause for missing error metrics. - Updates only made from when SPM defaulted to supporting the spanmetrics connector, which was [v1.49.0](https://github.com/jaegertracing/jaeger/releases/tag/v1.49.0). ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits ~- [ ] I have added unit tests for the new functionality~ ~- [ ] I have run lint and test steps successfully~ --------- Signed-off-by: Albert Teoh <albert@packsmith.io> Co-authored-by: Albert Teoh <albert@packsmith.io>
jaegertracing · Nov 18, 2023 · 855ca70 · 855ca70
1 parent aacddd0
commit 855ca70
Show file tree

Hide file tree

Showing 4 changed files with 260 additions and 8 deletions.
diff --git a/content/docs/1.49/spm.md b/content/docs/1.49/spm.md
@@ -262,14 +262,77 @@ the problem.
 ### Query Prometheus
 
 Graphs may still appear empty even when the above Jaeger metrics indicate successful reads
-from Prometheus. In this case, query Prometheus directly on any one of these metrics:
+from Prometheus. In this case, query Prometheus directly on any of these metrics:
 
-- `latency_bucket`
+- `duration_bucket`
+- `duration_milliseconds_bucket`
+- `duration_seconds_bucket`
+- `calls`
 - `calls_total`
 
 You should expect to see these counters increasing as spans are being emitted
 by services to the OpenTelemetry Collector.
 
+### Viewing Logs
+
+If the above metrics are present in Prometheus, but not appearing in the Monitor
+tab, it means there is a discrepancy between what metrics Jaeger expects to see in
+Prometheus and what metrics are actually available.
+
+This can be confirmed by increasing the log level by setting the following
+environment variable:
+
+```shell
+LOG_LEVEL=debug
+```
+
+Outputting logs that resemble the following:
+```json
+{
+    "level": "debug",
+    "ts": 1688042343.4464543,
+    "caller": "metricsstore/reader.go:245",
+    "msg": "Prometheus query results",
+    "results": "",
+    "query": "sum(rate(calls{service_name =~ \"driver\", span_kind =~ \"SPAN_KIND_SERVER\"}[10m])) by (service_name,span_name)",
+    "range":
+    {
+        "Start": "2023-06-29T12:34:03.081Z",
+        "End": "2023-06-29T12:39:03.081Z",
+        "Step": 60000000000
+    }
+}
+```
+
+In this instance, let's say OpenTelemetry Collector's `prometheusexporter` introduced
+a breaking change that appends a `_total` suffix to counter metrics and the duration units within
+histogram metrics (e.g. `duration_milliseconds_bucket`). As we discovered,
+Jaeger is looking for the `calls` (and `duration_bucket`) metric names,
+while the OpenTelemetry Collector is writing `calls_total` (and `duration_milliseconds_bucket`).
+
+The resolution, in this specific case, is to set environment variables telling Jaeger
+to normalize the metric names such that it knows to search for `calls_total` and
+`duration_milliseconds_bucket` instead, like so:
+
+```shell
+PROMETHEUS_QUERY_NORMALIZE_CALLS=true
+PROMETHEUS_QUERY_NORMALIZE_DURATION=true
+```
+
+### Checking OpenTelemetry Collector Config
+
+If there are error spans appearing in Jaeger, but no corresponding error metrics:
+
+- Check that raw metrics in Prometheus generated by the spanmetrics connector
+  (as listed above: `calls`, `calls_total`, `duration_bucket`, etc.) contain
+  the `status.code` label in the metric that the span should belong to.
+- If there are no `status.code` labels, check the OpenTelemetry Collector
+  configuration file, particularly for the presence of the following configuration:
+  ```yaml
+  exclude_dimensions: ['status.code']
+  ```
+  This label is used by Jaeger to determine if a request is erroneous.
+
 ### Inspect the OpenTelemetry Collector
 
 If the above `latency_bucket` and `calls_total` metrics are empty, then it could

diff --git a/content/docs/1.50/spm.md b/content/docs/1.50/spm.md
@@ -262,14 +262,77 @@ the problem.
 ### Query Prometheus
 
 Graphs may still appear empty even when the above Jaeger metrics indicate successful reads
-from Prometheus. In this case, query Prometheus directly on any one of these metrics:
+from Prometheus. In this case, query Prometheus directly on any of these metrics:
 
-- `latency_bucket`
+- `duration_bucket`
+- `duration_milliseconds_bucket`
+- `duration_seconds_bucket`
+- `calls`
 - `calls_total`
 
 You should expect to see these counters increasing as spans are being emitted
 by services to the OpenTelemetry Collector.
 
+### Viewing Logs
+
+If the above metrics are present in Prometheus, but not appearing in the Monitor
+tab, it means there is a discrepancy between what metrics Jaeger expects to see in
+Prometheus and what metrics are actually available.
+
+This can be confirmed by increasing the log level by setting the following
+environment variable:
+
+```shell
+LOG_LEVEL=debug
+```
+
+Outputting logs that resemble the following:
+```json
+{
+    "level": "debug",
+    "ts": 1688042343.4464543,
+    "caller": "metricsstore/reader.go:245",
+    "msg": "Prometheus query results",
+    "results": "",
+    "query": "sum(rate(calls{service_name =~ \"driver\", span_kind =~ \"SPAN_KIND_SERVER\"}[10m])) by (service_name,span_name)",
+    "range":
+    {
+        "Start": "2023-06-29T12:34:03.081Z",
+        "End": "2023-06-29T12:39:03.081Z",
+        "Step": 60000000000
+    }
+}
+```
+
+In this instance, let's say OpenTelemetry Collector's `prometheusexporter` introduced
+a breaking change that appends a `_total` suffix to counter metrics and the duration units within
+histogram metrics (e.g. `duration_milliseconds_bucket`). As we discovered,
+Jaeger is looking for the `calls` (and `duration_bucket`) metric names,
+while the OpenTelemetry Collector is writing `calls_total` (and `duration_milliseconds_bucket`).
+
+The resolution, in this specific case, is to set environment variables telling Jaeger
+to normalize the metric names such that it knows to search for `calls_total` and
+`duration_milliseconds_bucket` instead, like so:
+
+```shell
+PROMETHEUS_QUERY_NORMALIZE_CALLS=true
+PROMETHEUS_QUERY_NORMALIZE_DURATION=true
+```
+
+### Checking OpenTelemetry Collector Config
+
+If there are error spans appearing in Jaeger, but no corresponding error metrics:
+
+- Check that raw metrics in Prometheus generated by the spanmetrics connector
+  (as listed above: `calls`, `calls_total`, `duration_bucket`, etc.) contain
+  the `status.code` label in the metric that the span should belong to.
+- If there are no `status.code` labels, check the OpenTelemetry Collector
+  configuration file, particularly for the presence of the following configuration:
+  ```yaml
+  exclude_dimensions: ['status.code']
+  ```
+  This label is used by Jaeger to determine if a request is erroneous.
+
 ### Inspect the OpenTelemetry Collector
 
 If the above `latency_bucket` and `calls_total` metrics are empty, then it could

diff --git a/content/docs/1.51/spm.md b/content/docs/1.51/spm.md
@@ -262,14 +262,77 @@ the problem.
 ### Query Prometheus
 
 Graphs may still appear empty even when the above Jaeger metrics indicate successful reads
-from Prometheus. In this case, query Prometheus directly on any one of these metrics:
+from Prometheus. In this case, query Prometheus directly on any of these metrics:
 
-- `latency_bucket`
+- `duration_bucket`
+- `duration_milliseconds_bucket`
+- `duration_seconds_bucket`
+- `calls`
 - `calls_total`
 
 You should expect to see these counters increasing as spans are being emitted
 by services to the OpenTelemetry Collector.
 
+### Viewing Logs
+
+If the above metrics are present in Prometheus, but not appearing in the Monitor
+tab, it means there is a discrepancy between what metrics Jaeger expects to see in
+Prometheus and what metrics are actually available.
+
+This can be confirmed by increasing the log level by setting the following
+environment variable:
+
+```shell
+LOG_LEVEL=debug
+```
+
+Outputting logs that resemble the following:
+```json
+{
+    "level": "debug",
+    "ts": 1688042343.4464543,
+    "caller": "metricsstore/reader.go:245",
+    "msg": "Prometheus query results",
+    "results": "",
+    "query": "sum(rate(calls{service_name =~ \"driver\", span_kind =~ \"SPAN_KIND_SERVER\"}[10m])) by (service_name,span_name)",
+    "range":
+    {
+        "Start": "2023-06-29T12:34:03.081Z",
+        "End": "2023-06-29T12:39:03.081Z",
+        "Step": 60000000000
+    }
+}
+```
+
+In this instance, let's say OpenTelemetry Collector's `prometheusexporter` introduced
+a breaking change that appends a `_total` suffix to counter metrics and the duration units within
+histogram metrics (e.g. `duration_milliseconds_bucket`). As we discovered,
+Jaeger is looking for the `calls` (and `duration_bucket`) metric names,
+while the OpenTelemetry Collector is writing `calls_total` (and `duration_milliseconds_bucket`).
+
+The resolution, in this specific case, is to set environment variables telling Jaeger
+to normalize the metric names such that it knows to search for `calls_total` and
+`duration_milliseconds_bucket` instead, like so:
+
+```shell
+PROMETHEUS_QUERY_NORMALIZE_CALLS=true
+PROMETHEUS_QUERY_NORMALIZE_DURATION=true
+```
+
+### Checking OpenTelemetry Collector Config
+
+If there are error spans appearing in Jaeger, but no corresponding error metrics:
+
+- Check that raw metrics in Prometheus generated by the spanmetrics connector
+  (as listed above: `calls`, `calls_total`, `duration_bucket`, etc.) contain
+  the `status.code` label in the metric that the span should belong to.
+- If there are no `status.code` labels, check the OpenTelemetry Collector
+  configuration file, particularly for the presence of the following configuration:
+  ```yaml
+  exclude_dimensions: ['status.code']
+  ```
+  This label is used by Jaeger to determine if a request is erroneous.
+
 ### Inspect the OpenTelemetry Collector
 
 If the above `latency_bucket` and `calls_total` metrics are empty, then it could

diff --git a/content/docs/next-release/spm.md b/content/docs/next-release/spm.md
@@ -262,14 +262,77 @@ the problem.
 ### Query Prometheus
 
 Graphs may still appear empty even when the above Jaeger metrics indicate successful reads
-from Prometheus. In this case, query Prometheus directly on any one of these metrics:
+from Prometheus. In this case, query Prometheus directly on any of these metrics:
 
-- `latency_bucket`
+- `duration_bucket`
+- `duration_milliseconds_bucket`
+- `duration_seconds_bucket`
+- `calls`
 - `calls_total`
 
 You should expect to see these counters increasing as spans are being emitted
 by services to the OpenTelemetry Collector.
 
+### Viewing Logs
+
+If the above metrics are present in Prometheus, but not appearing in the Monitor
+tab, it means there is a discrepancy between what metrics Jaeger expects to see in
+Prometheus and what metrics are actually available.
+
+This can be confirmed by increasing the log level by setting the following
+environment variable:
+
+```shell
+LOG_LEVEL=debug
+```
+
+Outputting logs that resemble the following:
+```json
+{
+    "level": "debug",
+    "ts": 1688042343.4464543,
+    "caller": "metricsstore/reader.go:245",
+    "msg": "Prometheus query results",
+    "results": "",
+    "query": "sum(rate(calls{service_name =~ \"driver\", span_kind =~ \"SPAN_KIND_SERVER\"}[10m])) by (service_name,span_name)",
+    "range":
+    {
+        "Start": "2023-06-29T12:34:03.081Z",
+        "End": "2023-06-29T12:39:03.081Z",
+        "Step": 60000000000
+    }
+}
+```
+
+In this instance, let's say OpenTelemetry Collector's `prometheusexporter` introduced
+a breaking change that appends a `_total` suffix to counter metrics and the duration units within
+histogram metrics (e.g. `duration_milliseconds_bucket`). As we discovered,
+Jaeger is looking for the `calls` (and `duration_bucket`) metric names,
+while the OpenTelemetry Collector is writing `calls_total` (and `duration_milliseconds_bucket`).
+
+The resolution, in this specific case, is to set environment variables telling Jaeger
+to normalize the metric names such that it knows to search for `calls_total` and
+`duration_milliseconds_bucket` instead, like so:
+
+```shell
+PROMETHEUS_QUERY_NORMALIZE_CALLS=true
+PROMETHEUS_QUERY_NORMALIZE_DURATION=true
+```
+
+### Checking OpenTelemetry Collector Config
+
+If there are error spans appearing in Jaeger, but no corresponding error metrics:
+
+- Check that raw metrics in Prometheus generated by the spanmetrics connector
+  (as listed above: `calls`, `calls_total`, `duration_bucket`, etc.) contain
+  the `status.code` label in the metric that the span should belong to.
+- If there are no `status.code` labels, check the OpenTelemetry Collector
+  configuration file, particularly for the presence of the following configuration:
+  ```yaml
+  exclude_dimensions: ['status.code']
+  ```
+  This label is used by Jaeger to determine if a request is erroneous.
+
 ### Inspect the OpenTelemetry Collector
 
 If the above `latency_bucket` and `calls_total` metrics are empty, then it could