Record error budget metrics

The error budget for an SLO defines the acceptable amount of unreliability that can occur within a given period without violating the SLO.

This document describes how to write an SLO spec to record remaining error budgets.

Update the spec to record remaining error budget ratio

We assume that the example service has a rolling four-week availability SLO with 99% compliance target.

To record the remaining eror budget, we update the previous SLO spec to define objective field and a rolling four-week window.

example.yaml

name: example

slos:
  - name: availability
    objective: # (1)!
      ratio: 0.99
      windowRef: window-4w
    indicator:
      prometheus:
        errorRatio: >-
          sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
          sum by (job) (rate(http_requests_total{job="example"}[$window]))
        level:
          - job
    windows:
      - name: window-5m
        rolling:
          duration: 5m
      - name: window-4w # (2)!
        rolling:
          duration: 4w

UPDATED: Added objective field to define 99% SLO with rolling four-week window.
UPDATED: Added a four-week rolling window window-4w to record error rate and error budget.

Note that the new objective field in the availability SLO specifies a ratio field with a target compliance ratio 99%, and a windowRef field that refers to a rolling four-week window defined in the windows field.

Generate a Prometheus rule file

After updating the SLO spec file, run slom generate prometheus-rule command to generate a Prometheus rule file based on the SLO spec.

slom generate prometheus-rule example.yaml

Then, the following output will be displayed.

groups:
  - name: slom:example-availability:default
    rules:
      - record: job:slom_error:ratio_rate5m
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate4w
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error_budget:ratio_rate4w
        expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
  - name: slom:example-availability:meta
    rules:
      - record: slom_slo
        expr: 0.99
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example

You can find a new recording rule job:slom_error_budget:ratio_rate4w that records the ratio of the remaining error budget to the initial budget (99 - SLO)%. The value of this metric provides the following insights:

A value of 1 indicates that no failures have impacted reliability within the time window.
A positive value means the service is compliant with the SLO at that point in time.
A negative value signals that the SLO has been breached at that point in time.

For more details about the generated Prometheus rules, please refer to the Prometheus reference.