Record error budget metrics
The error budget for an SLO defines the acceptable amount of unreliability that can occur within a given period without violating the SLO.
This document describes how to write an SLO spec to record remaining error budgets.
Update the spec to record remaining error budget ratio
We assume that the example
service has a rolling four-week availability SLO with 99% compliance target.
To record the remaining eror budget, we update the previous SLO spec to define objective
field and a rolling four-week window.
name: example
slos:
- name: availability
objective: # (1)!
ratio: 0.99
windowRef: window-4w
indicator:
prometheus:
errorRatio: >-
sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
sum by (job) (rate(http_requests_total{job="example"}[$window]))
level:
- job
windows:
- name: window-5m
rolling:
duration: 5m
- name: window-4w # (2)!
rolling:
duration: 4w
- UPDATED: Added
objective
field to define 99% SLO with rolling four-week window. - UPDATED: Added a four-week rolling window
window-4w
to record error rate and error budget.
Note that the new objective
field in the availability
SLO specifies a ratio
field with a target compliance ratio 99%, and a windowRef
field that refers to a rolling four-week window defined in the windows
field.
Generate a Prometheus rule file
After updating the SLO spec file, run slom generate prometheus-rule
command to generate a Prometheus rule file based on the SLO spec.
Then, the following output will be displayed.
groups:
- name: slom:example-availability:default
rules:
- record: job:slom_error:ratio_rate5m
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate4w
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error_budget:ratio_rate4w
expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- name: slom:example-availability:meta
rules:
- record: slom_slo
expr: 0.99
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
You can find a new recording rule job:slom_error_budget:ratio_rate4w
that records the ratio of the remaining error budget to the initial budget (99 - SLO)%.
The value of this metric provides the following insights:
- A value of
1
indicates that no failures have impacted reliability within the time window. - A positive value means the service is compliant with the SLO at that point in time.
- A negative value signals that the SLO has been breached at that point in time.
For more details about the generated Prometheus rules, please refer to the Prometheus reference.