GitHub

Note: For the screenshots, you can store all of your answer images in the answer-img directory.

Verify the monitoring installation

TODO: 1 run kubectl command to show the running pods and services for all components. Take a screenshot of the output and include it here to verify the installation. NOTE: As instructed by serveral Knowledgebase posts, I put all of the Jaeger services into the default namespace along with the apps themselves. Otherwise Jaeger cannot see the apps. That is why Jaeger is not in Observability. (see : below in default namespace where the apps are located)

Setup the Jaeger and Prometheus source

*TODO: Expose Grafana to the internet and then setup Prometheus as a data source. Provide a screenshot of the home page after logging into Grafana. |

Create a Basic Dashboard

TODO: : Create a dashboard in Grafana that shows Prometheus as a source. Take a screenshot and include it here.

Describe SLO/SLI

TODO: Describe, in your own words, what the SLIs are, based on an SLO of monthly uptime and request response time.

Actual monthly uptime and request response time are SLIs (service level indicators, or metrics) that can be used to determine if an SLO has been met. For example, let's say that our SLO for monthly uptime is at least 99.99% per month, and our SLO for request response time is less than 3 seconds on average.

The following general SLIs can have a bearing on whether the SLOs will be met directly or indirectly. These can be further broken down and visualized via specific metrics.

% uptime that a service is active
Failure rate for a service. This could potentially affect monthly uptime.
Latency - How long does it take to respond to a request? This directly measures the request response time SLI.
Saturation - how heavy a load the servers have (saturation). This could cause downtime and slow response times
Traffic - how much traffic are the servers getting? Perhaps there are times when servers get a spike in traffic. This could directly impact response times and uptime as well.

Creating SLI (KPI) metrics.

TODO: It is important to know why we want to measure certain metrics for our customer. Describe in detail 5 metrics to measure these SLIs.

4 Golden Signals to measure service level

Latency — request time (in ms)
Traffic — how stressed is the system (based on no of HTTP requests/sec)
Errors — how many failed HTTP responses are there? 4xx & 5xx errors.
Saturation — is too much memory or CPI being used compared to the the overall capacity of a service or its configuration?

** SLIs to measure the 4 Golden Signals. These SLIs can overlap in terms of what signals are affecting them. To determine what is affecting an SLI, further research into tracing on the apps will need to be done, as well as analyzing dependencies in the cluster and its environment. They are all interdependent.

Measure by response type and service: Flask HTTP requests status 200, 500, 400 (Errors)
Failed responses per second (Errors, Traffic, Saturation)
Uptime: frontend, trial, backend (Latency, Traffic, Saturation, Errors)
Pods health: Pods not ready (Latency / Errors)
Pods health: Pod restarts by namespace (Could be caused by any number of things : Errors in applications, traffic)
Average Response time (Latency, Traffic, Saturation)

Create a Dashboard to measure our SLIs

TODO: Create a dashboard to measure the uptime of the frontend and backend services We will also want to measure to measure 40x and 50x errors. Create a dashboard that show these values over a 24 hour period and take a screenshot. ()

Tracing our Flask App

TODO: We will create a Jaeger span to measure the processes on the backend. Once you fill in the span, provide a screenshot of it here. ( | )

Jaeger in Dashboards

TODO: Now that the trace is running, let's add the metric to our current Grafana dashboard. Once this is completed, provide a screenshot of it here. ()

Report Error

TODO: Using the template below, write a trouble ticket for the developers, to explain the errors that you are seeing (400, 500, latency) and to let them know the file that is causing the issue. ()

TROUBLE TICKET

Name: 500 Error on backend/app/app.py

Date: November 23 2021, 13:24:50

Subject: 500 Error in my-api endpoint, failed to resolve

Affected Area: File "/app/app.py", line 107, in my_api answer = something # This will create an error

Severity: Severe

Description: class 'NameError' error - name 'something' is not defined

Creating SLIs and SLOs

TODO: We want to create an SLO guaranteeing that our application has a 99.95% uptime per month. Name three SLIs that you would use to measure the success of this SLO.

Latency
Errors
Saturation

Building KPIs for our plan

TODO: Now that we have our SLIs and SLOs, create KPIs to accurately measure these metrics. We will make a dashboard for this, but first write them down here:

2-3 KPIs per SLI

Latency

request time (in ms) for successful requests
request time (in ms) for failed requests
round trip request time in network - using ping and traceroute

Errors : It is important to understand what kind of errors are happening in the application. This can be done with Jaeger Tracing.

500 errors - 500 errors are more severe: the application is unable to start or completely crashes during execution of a request.
400 errors - 404 errors are less severe but also need urgent attention.
What percentage of overall requests result in 200 as opposed to 400 or 500 erros

Saturation

% CPU usage allocated per service as configured in yaml for example
% CPU usage available on host
Total number of requests recived over time. Are there spikes in usage?

Example PromQL Queries for Some Metrics. Some metrics I was unable to find queries for (ping and traceroute). Others I could not direct 1 - 1 relationships from KPI => PromQL.

Response types : Flask HTTP requests status 200, 500, 400

sum(flask_http_request_total{container=~~"backend|frontend|trial",status=~~"500"}) by (status,container)
sum(flask_http_request_total{container=~~"backend|frontend|trial",status=~~"400"}) by (status,container)
sum(flask_http_request_total{container=~~"backend|frontend|trial",status=~~"200"}) by (status,container)

Failed responses per second

sum(rate(flask_http_request_duration_seconds_count{status!="200"}[30s]))

Uptime : frontend, trial, backend

sum(up{container=~"frontend"}) by (pod)
sum(up{container=~"trial"}) by (pod)
sum(up{container=~"backend"}) by (pod)

Pods health : Pods not ready

sum by (namespace) (kube_pod_status_ready{condition="false"})

Pods health : Pod restarts by namespace

sum by (namespace)(changes(kube_pod_status_ready{condition="true"}[5m]))

Average Response time (Latency)

rate(flask_http_request_duration_seconds_sum{status="200"}[1d])/ rate(flask_http_request_duration_seconds_count{status="200"}[1d])

Final Dashboard

TODO: Create a Dashboard containing graphs that capture all the metrics of your KPIs and adequately representing your SLIs and SLOs. Include a screenshot of the dashboard here, and write a text description of what graphs are represented in the dashboard.

Dashboard : See () Panels listed are :

Flask HTTP request total: Status "200
Flask HTTP request exceptions
5xx Errors last 24 hours
4xx Errors last 24 hours
Failed responses per second
Uptime Frontend Service Last 24 hours
Uptime Backend Service Last 24 hours
Uptime Trial Service Last 24 hours
Pods Per Namespace
Pods that were not ready
Pod restarts per namespace
CPU Usage
Latency: Average response time

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
answer-img		answer-img
manifests		manifests
reference-app		reference-app
reference-dashboards		reference-dashboards
README.md		README.md
Vagrantfile		Vagrantfile
Vagrantfile2		Vagrantfile2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verify the monitoring installation

Setup the Jaeger and Prometheus source

Create a Basic Dashboard

Describe SLO/SLI

Creating SLI (KPI) metrics.

Create a Dashboard to measure our SLIs

Tracing our Flask App

Jaeger in Dashboards

Report Error

Creating SLIs and SLOs

Building KPIs for our plan

2-3 KPIs per SLI

Example PromQL Queries for Some Metrics. Some metrics I was unable to find queries for (ping and traceroute). Others I could not direct 1 - 1 relationships from KPI => PromQL.

Final Dashboard

About

Releases

Packages

Languages

jdubya747/grafana_prom_jaeger

Folders and files

Latest commit

History

Repository files navigation

Verify the monitoring installation

Setup the Jaeger and Prometheus source

Create a Basic Dashboard

Describe SLO/SLI

Creating SLI (KPI) metrics.

Create a Dashboard to measure our SLIs

Tracing our Flask App

Jaeger in Dashboards

Report Error

Creating SLIs and SLOs

Building KPIs for our plan

2-3 KPIs per SLI

Example PromQL Queries for Some Metrics. Some metrics I was unable to find queries for (ping and traceroute). Others I could not direct 1 - 1 relationships from KPI => PromQL.

Final Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages