fix(grafana): use osm_request_duration_ms for latency graphs #4297

jaellio · 2021-10-21T06:38:51Z

Description:

The envoy_cluster_upstream_rq_time metric was referenced in the pre-configured OSM
Grafana dashboard, but was not being scraped by Prometheus which resulted in
graphs with no data.

This PR replaces envoy_cluster_upstream_rq_time with the existing
osm_request_duration_ms SMI metric to display latency in the mesh. Currently, the latency
graphs are present in the pod to service, service to service, and workload to service
dashboards. Unlike envoy_cluster_upstream_rq_time, the osm_request_duration_ms metric
does not capture the src or destination service. Therefore, the latency graphs no longer fit
on the dashboards that allow the user to specify a source service or see the latencies
labeled with the envoy cluster name (which includes the destination service name).

This PR removes the latency graphs from the pod to service, service to service, and
workload to service dashboards and creates a new dashboard for workload to workload
metrics. Additionally, to improve clarity "Source" is added to the appropriate dashboard
variables.

Note:
An earlier version of this PR added envoy_cluster_upstream_rq_time to the Prometheus
ConfigMap. The discussion surrounding this initial change can be found below.

OSM Workload to Workload Metrics (New):

OSM Workload to Service Metrics:

OSM Pod to Service Metrics:

OSM Service to Service Metrics:

Testing done:

Latency graphs that depended on the envoy_cluster_upstream_rq_time_bucket
histogram rendered as expected with the osm_request_duration_ms histogram. The
functionality of the variales on the new dashboard were also verified.

Affected area:

Functional Area
New Functionality	[ ]
CI System	[ ]
CLI Tool	[ ]
Certificate Management	[ ]
Control Plane	[ ]
Demo	[ ]
Documentation	[ ]
Egress	[ ]
Ingress	[ ]
Install	[ ]
Networking	[ ]
Observability	[ x ]
Performance	[ ]
SMI Policy	[ ]
Security	[ ]
Sidecar Injection	[ ]
Tests	[ ]
Upgrade	[ ]
Other	[ ]

Please answer the following questions with yes/no.

Does this change contain code from or inspired by another project? No
- Did you notify the maintainers and provide attribution?
Is this a breaking change? No

codecov-commenter · 2021-10-21T06:50:56Z

Codecov Report

Merging #4297 (fef7817) into main (30a2f05) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #4297      +/-   ##
==========================================
- Coverage   69.16%   69.11%   -0.05%     
==========================================
  Files         211      211              
  Lines       14251    14251              
==========================================
- Hits         9856     9849       -7     
- Misses       4347     4354       +7     
  Partials       48       48

Flag	Coverage Δ
unittests	`69.11% <ø> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/crdconversion/crdconversion.go	`69.17% <0.00%> (-5.27%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30a2f05...fef7817. Read the comment docs.

michelleN

@jaellio did you already manually test this?

shashankram

Adding @eduser25 to comment in case this was intentionally removed. I recollect a few metrics that were removed to not blow up the Prometheus storage.

snehachhabria · 2021-10-22T09:52:03Z

Adding @eduser25 to comment in case this was intentionally removed. I recollect a few metrics that were removed to not blow up the Prometheus storage.

as far as I recollect most of the bucket type metrics were removed for this reason

shashankram · 2021-10-22T15:36:50Z

Adding @eduser25 to comment in case this was intentionally removed. I recollect a few metrics that were removed to not blow up the Prometheus storage.

as far as I recollect most of the bucket type metrics were removed for this reason

@snehachhabria That seems to be the case based on commit dc51517.

eduser25 · 2021-10-22T18:40:40Z

If memory serves me, this was really heavy on mem consumption per pod on prom.
I'd suggest to run the numbers to see this is feasible before merging.

This PR replaces envoy_cluster_upstream_rq_time with the existing osm_request_duration_ms SMI metric to display latency in the mesh. Currently, the latency graphs are present in the pod to service, service to service, and workload to service dashboards. Unlike envoy_cluster_upstream_rq_time, the osm_request_duration_ms metric does not capture the src or destination service. Therefore, the latency graphs no longer fit on the dashboards that allow the user to specify a source service or see the latencies labeled with the envoy cluster name (which includes the destination service name). This PR removes the latency graphs from the pod to service, service to service, and workload to service dashboards and creates a new dashboard for workload to workload metrics. Signed-off-by: jaellio <jaellio@microsoft.com>

jaellio marked this pull request as ready for review October 21, 2021 17:35

jaellio requested a review from a team as a code owner October 21, 2021 17:35

michelleN approved these changes Oct 21, 2021

View reviewed changes

shashankram reviewed Oct 21, 2021

View reviewed changes

shashankram added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2021

jaellio requested a review from a team as a code owner October 27, 2021 02:12

jaellio force-pushed the fixMetricsBugs branch 3 times, most recently from fef7817 to 84a8d48 Compare October 28, 2021 04:19

jaellio linked an issue Oct 28, 2021 that may be closed by this pull request

Collect request latency metrics #4314

Closed

jaellio changed the title ~~fix(prometheus): add envoy metric to prometheus configmap~~ fix(grafana): use osm_request_duration_ms for latency graphs Oct 28, 2021

jaellio force-pushed the fixMetricsBugs branch from 84a8d48 to d2389d6 Compare October 28, 2021 14:04

jaellio marked this pull request as draft October 28, 2021 14:04

jaellio force-pushed the fixMetricsBugs branch from d2389d6 to e7951c6 Compare October 28, 2021 17:37

jaellio force-pushed the fixMetricsBugs branch from e7951c6 to 659f5ed Compare October 28, 2021 20:09

jaellio marked this pull request as ready for review October 28, 2021 21:29

shashankram approved these changes Nov 2, 2021

View reviewed changes

shashankram added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 2, 2021

snehachhabria approved these changes Nov 2, 2021

View reviewed changes

snehachhabria merged commit 1515ee4 into openservicemesh:main Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(grafana): use osm_request_duration_ms for latency graphs #4297

fix(grafana): use osm_request_duration_ms for latency graphs #4297

jaellio commented Oct 21, 2021 •

edited

Loading

codecov-commenter commented Oct 21, 2021 •

edited

Loading

michelleN left a comment

shashankram left a comment

snehachhabria commented Oct 22, 2021

shashankram commented Oct 22, 2021

eduser25 commented Oct 22, 2021

fix(grafana): use osm_request_duration_ms for latency graphs #4297

fix(grafana): use osm_request_duration_ms for latency graphs #4297

Conversation

jaellio commented Oct 21, 2021 • edited Loading

codecov-commenter commented Oct 21, 2021 • edited Loading

Codecov Report

michelleN left a comment

Choose a reason for hiding this comment

shashankram left a comment

Choose a reason for hiding this comment

snehachhabria commented Oct 22, 2021

shashankram commented Oct 22, 2021

eduser25 commented Oct 22, 2021

jaellio commented Oct 21, 2021 •

edited

Loading

codecov-commenter commented Oct 21, 2021 •

edited

Loading