Service level operator abstracts and automates the service level of Kubernetes applications by generation SLI & SLOs to be consumed easily by dashboards and alerts and allow that the SLI/SLO's live with the application flow.
This operator interacts with Kubernetes using the CRDs as a way to define application service levels and generating output service level metrics.
Although this operator is though to interact with different backends and generate different output backends, at this moment only uses Prometheus as input and output backend.
For this example the output and input backend will be Prometheus.
First you will need to define a CRD with your service SLI & SLOs. In this case we have a service that has an SLO on 99.99 availability, and the SLI is that 5xx are considered errors.
apiVersion: measure.slok.xyz/v1alpha1
kind: ServiceLevel
metadata:
name: awesome-service
spec:
serviceLevelObjectives:
- name: "9999_http_request_lt_500"
description: 99.99% of requests must be served with <500 status code.
disable: false
availabilityObjectivePercent: 99.99
serviceLevelIndicator:
prometheus:
address: http://myprometheus:9090
totalQuery: sum(increase(http_request_total{host="awesome_service_io"}[2m]))
errorQuery: sum(increase(http_request_total{host="awesome_service_io", code=~"5.."}[2m]))
output:
prometheus:
labels:
team: a-team
iteration: "3"
The Operator will generate the SLI and SLO in this prometheus format:
# HELP service_level_sli_result_count_total Is the number of times an SLI result has been processed.
# TYPE service_level_sli_result_count_total counter
service_level_sli_result_count_total{service_level="awesome-service",slo="9999_http_request_lt_500"} 1708
# HELP service_level_sli_result_error_ratio_total Is the error or failure ratio of an SLI result.
# TYPE service_level_sli_result_error_ratio_total counter
service_level_sli_result_error_ratio_total{service_level="awesome-service",slo="9999_http_request_lt_500"} 0.40508550763795764
# HELP service_level_slo_objective_ratio Is the objective of the SLO in ratio unit.
# TYPE service_level_slo_objective_ratio gauge
service_level_slo_objective_ratio{service_level="awesome-service",slo="9999_http_request_lt_500"} 0.9998999999999999
The operator will query and create new metrics based on the SLOs caulculations at regular intervals (see --resync-seconds
flag).
The approach that has been taken to generate the SLI results is based on how Google uses and manages SLIs, SLOs and error budgets
In the manifest the SLI is made of 2 prometheus metrics:
- The total of requests:
sum(increase(http_request_total{host="awesome_service_io"}[2m]))
- The total number of failed requests:
sum(increase(http_request_total{host="awesome_service_io", code=~"5.."}[2m]))
By expresing what are the total count on SLI result processing and the error ratio processed the operator will generate the SLO metrics for this service.
Like is seen in the above output the operator generates 3 metrics:
service_level_sli_result_error_ratio_total
: The downtime/error ratio (0-1) of the service.service_level_sli_result_count_total
: The total count of SLI processed total, in other words, what would be the ratio if the service would be 100% correct all the time becasue ratios are from 0 to 1.service_level_slo_objective_ratio
: The objective of the SLO in ratio. This metrics is't processed at all (only changed to ratio unit), but is important to create error budget quries, alerts...
With these metrics we can build availability graphs based on % and error budget burns.
The approach of using counters (instead of gauges) to store the total counts and the error/downtime total gives us the ability to get SLO/SLI rates, increments, speed... in the different time ranges (check query examples section) and is safer in case of missed scrapes, SLI calculation errors... In other words this approach gives us flexibility and safety.
Is important to note that like every metrics this is not exact and is a aproximation (good one but an approximation after all)
There is a grafana dashboard] to show the SLO's status.
This will output the availability rate of a service based.
1 - (
rate(service_level_sli_result_error_ratio_total[1m])
/
rate(service_level_sli_result_count_total[1m])
) * 100
This will output the availability rate of a service based.
1 - (
increase(service_level_sli_result_error_ratio_total[24h])
/
increase(service_level_sli_result_count_total[24h])
) * 100
The way this operator abstracts the SLI results it's easy to get the error budget burn rate without a time range projection, this is because the calculation is constant and based on ratios (0-1) instead of duration, rps, processed messages...
To know the error budget burn rate we need to get the errors in a interval (eg 5m
):
increase(service_level_sli_result_error_ratio_total{service_level="${service_level}", slo="${slo}"}[5m])
And to get the maximum burn rate that we can afford so we don't consume all the error budget would be:
(1 - service_level_slo_objective_ratio{service_level="${service_level}", slo="${slo}"})
* increase(service_level_sli_result_count_total{service_level="${service_level}", slo="${slo}"}[${interval}])
This query gets the error budget ratio that we have (eg: for a 99.99% SLO would be 0.0001
ratio) and multiplies for the total SLI result counts that we could get in the same range as the previous query (5m
), this gives us the max error ratio that we can burn for the given SLO.
Calculating the burndown charts is a little bit more tricky.
- Taking the previous example we are calculating error budget based on 1 month, this are 43200m (30 * 24 * 60).
- Our SLO objective is 99.99 (in ratio: 0.9998999999999999)
- Error budget is based in a 100% for 30d that decrements when availability is less than 99.99% (like the SLO specifies).
(
(
(1 - service_level_slo_objective_ratio) * 43200 * increase(service_level_sli_result_count_total[1m])
-
increase(service_level_sli_result_error_ratio_total[${range}])
)
/
(
(1 - service_level_slo_objective_ratio) * 43200 * increase(service_level_sli_result_count_total[1m])
)
) * 100
Let's decompose the query.
(1 - service_level_slo_objective_ratio) * 43200 * increase(service_level_sli_result_count_total[1m])
is the total ratio measured in 1m (sucess + failures) multiplied by the number of minutes in a month and the error budget ratio(1-0.9998999999999999). In other words this is the total (sum) number of error budget for 1 month we have.
increase(service_level_sli_result_error_ratio_total[${range}])
this is the SLO error sum that we had in ${range} (range changes over time, the first day of the month will be 1d, the 15th of the month will be 15d).
So (1 - service_level_slo_objective_ratio) * 43200 * increase(service_level_sli_result_count_total[1m]) - increase(service_level_sli_result_error_ratio_total[${range}])
returns the number of remaining error budget we have after ${range}
.
If we take that last part and divide for the total error budget we have for the month ((1 - service_level_slo_objective_ratio) * 43200 * increase(service_level_sli_result_count_total[1m])
) this returns us a ratio of the error budget consumed. Multiply by 100 and we have the percent of error budget consumed after ${range}
.
The operator gives the SLI/SLOs in the same format so we could create 1 alert for all of our SLOs, or be more specific and filter by labels.
groups:
- name: slo.rules
rules:
- alert: SLOErrorBudgetBurnRateTooFast
expr: |
increase(service_level_sli_result_error_ratio_total[1h])
>
(
(1 - service_level_slo_objective_ratio)
* increase(service_level_sli_result_count_total[1h])
)
for: 10m
labels:
severity: critical
team: a-team
annotations:
summary: The SLO error budget burn rate is too fast
description: The error rate in 24h for the {{$labels.service_level}}/{{$labels.slo}} SLO error budget is too fast, at this rate the service error budget will be fully consumed.
This alert alerts when the total errors in 1h is greater than the specified error budget based on the SLO. In other words this would mean that if we continue with this error rate we will consume the error budget in less time that we want.