Record SLI metrics

This document describes how to write an SLO spec to record SLI metrics of the example service.

Create a spec to record the error rate

The below example is a minimum SLO spec file that records the error rate of the HTTP requests from http_requests_total metrics. This is equivalent to the SLI for an availability SLO.

example.yaml
name: example

slos:
  - name: availability
    indicator:
      prometheus:
        errorRatio: >-
          sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
          sum by (job) (rate(http_requests_total{job="example"}[$window]))
        level:
          - job
    windows:
      - name: window-5m
        rolling:
          duration: 5m

The name field contains the name of the SLO spec. An SLO spec file corresponds to a single service or critical user journey, so we set the service name example to this field.

The slos field contains the list of SLO declarations. Here, we first declare a single SLO availability for this service.

The indicator contains the configurations for the service level indicator (SLI). As we are using Prometheus to monitor our service, we write the indicator configurations under prometheus field.

The errorRatio field in the prometheus indicator specifies a PromQL query to calculate the ratio of errors to all requests. The range for the rate operator is left unspecified using the $window placeholder. This allows slom to generate rules for deriving error rates over different time windows. Additionally, this query retains the job label using sum by. This is because Prometheus may be monitoring other services as well.

The level field in the prometheus indicator specifies the aggregation level of the query. We set [job] to the field, since we retain job label as described above.

The windows field specifies a list of time windows for recording the SLI. For now, specify only a 5-minute rolling window to record the current error rate. Later, we will add longer time windows for recording error budgets and issuing alerts.

Generate a Prometheus rule file

Run slom generate prometheus-rule command to generate a Prometheus rule file based on the SLO spec.

slom generate prometheus-rule example.yaml

The following output will be displayed.

groups:
  - name: slom:example-availability:default
    rules:
      - record: job:slom_error:ratio_rate5m
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
  - name: slom:example-availability:meta
    rules:
      - record: slom_slo
        expr: 0
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example

This is a Prometheus rule file that evaluates recording rules or alerting rules. You can find recording rules that

record the 5-minute error rate metric as job:slom_error:ratio_rate5m
record the metadata for the SLO as slom_slo.

For more details about the generated Prometheus rules, please refer to the Prometheus reference.