Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

fix(grafana): use osm_request_duration_ms for latency graphs #4297

Merged
merged 1 commit into from
Nov 2, 2021

Conversation

jaellio
Copy link
Contributor

@jaellio jaellio commented Oct 21, 2021

Description:

The envoy_cluster_upstream_rq_time metric was referenced in the pre-configured OSM
Grafana dashboard, but was not being scraped by Prometheus which resulted in
graphs with no data.

This PR replaces envoy_cluster_upstream_rq_time with the existing
osm_request_duration_ms SMI metric to display latency in the mesh. Currently, the latency
graphs are present in the pod to service, service to service, and workload to service
dashboards. Unlike envoy_cluster_upstream_rq_time, the osm_request_duration_ms metric
does not capture the src or destination service. Therefore, the latency graphs no longer fit
on the dashboards that allow the user to specify a source service or see the latencies
labeled with the envoy cluster name (which includes the destination service name).

This PR removes the latency graphs from the pod to service, service to service, and
workload to service dashboards and creates a new dashboard for workload to workload
metrics. Additionally, to improve clarity "Source" is added to the appropriate dashboard
variables.

Note:
An earlier version of this PR added envoy_cluster_upstream_rq_time to the Prometheus
ConfigMap. The discussion surrounding this initial change can be found below.

OSM Workload to Workload Metrics (New):
image

OSM Workload to Service Metrics:
image

OSM Pod to Service Metrics:
image

OSM Service to Service Metrics:
image

Testing done:

Latency graphs that depended on the envoy_cluster_upstream_rq_time_bucket
histogram rendered as expected with the osm_request_duration_ms histogram. The
functionality of the variales on the new dashboard were also verified.

Affected area:

Functional Area
New Functionality [ ]
CI System [ ]
CLI Tool [ ]
Certificate Management [ ]
Control Plane [ ]
Demo [ ]
Documentation [ ]
Egress [ ]
Ingress [ ]
Install [ ]
Networking [ ]
Observability [ x ]
Performance [ ]
SMI Policy [ ]
Security [ ]
Sidecar Injection [ ]
Tests [ ]
Upgrade [ ]
Other [ ]

Please answer the following questions with yes/no.

  1. Does this change contain code from or inspired by another project? No

    • Did you notify the maintainers and provide attribution?
  2. Is this a breaking change? No

@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2021

Codecov Report

Merging #4297 (fef7817) into main (30a2f05) will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4297      +/-   ##
==========================================
- Coverage   69.16%   69.11%   -0.05%     
==========================================
  Files         211      211              
  Lines       14251    14251              
==========================================
- Hits         9856     9849       -7     
- Misses       4347     4354       +7     
  Partials       48       48              
Flag Coverage Δ
unittests 69.11% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/crdconversion/crdconversion.go 69.17% <0.00%> (-5.27%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30a2f05...fef7817. Read the comment docs.

@jaellio jaellio marked this pull request as ready for review October 21, 2021 17:35
@jaellio jaellio requested a review from a team as a code owner October 21, 2021 17:35
Copy link
Contributor

@michelleN michelleN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaellio did you already manually test this?

Copy link
Member

@shashankram shashankram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding @eduser25 to comment in case this was intentionally removed. I recollect a few metrics that were removed to not blow up the Prometheus storage.

@snehachhabria
Copy link
Contributor

Adding @eduser25 to comment in case this was intentionally removed. I recollect a few metrics that were removed to not blow up the Prometheus storage.

as far as I recollect most of the bucket type metrics were removed for this reason

@shashankram
Copy link
Member

Adding @eduser25 to comment in case this was intentionally removed. I recollect a few metrics that were removed to not blow up the Prometheus storage.

as far as I recollect most of the bucket type metrics were removed for this reason

@snehachhabria That seems to be the case based on commit dc51517.

@shashankram shashankram added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2021
@eduser25
Copy link
Contributor

If memory serves me, this was really heavy on mem consumption per pod on prom.
I'd suggest to run the numbers to see this is feasible before merging.

@jaellio jaellio requested a review from a team as a code owner October 27, 2021 02:12
@jaellio jaellio force-pushed the fixMetricsBugs branch 3 times, most recently from fef7817 to 84a8d48 Compare October 28, 2021 04:19
@jaellio jaellio linked an issue Oct 28, 2021 that may be closed by this pull request
@jaellio jaellio changed the title fix(prometheus): add envoy metric to prometheus configmap fix(grafana): use osm_request_duration_ms for latency graphs Oct 28, 2021
@jaellio jaellio marked this pull request as draft October 28, 2021 14:04
This PR replaces envoy_cluster_upstream_rq_time with the existing
osm_request_duration_ms SMI metric to display latency in the mesh.
Currently, the latency graphs are present in the pod to service,
service to service, and workload to service dashboards. Unlike
envoy_cluster_upstream_rq_time, the osm_request_duration_ms metric
does not capture the src or destination service. Therefore, the
latency graphs no longer fit on the dashboards that allow the user
to specify a source service or see the latencies labeled with the
envoy cluster name (which includes the destination service name).

This PR removes the latency graphs from the pod to service, service
to service, and workload to service dashboards and creates a new
dashboard for workload to workload metrics.

Signed-off-by: jaellio <jaellio@microsoft.com>
@jaellio jaellio marked this pull request as ready for review October 28, 2021 21:29
@shashankram shashankram added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 2, 2021
@snehachhabria snehachhabria merged commit 1515ee4 into openservicemesh:main Nov 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collect request latency metrics
6 participants