Alert on error budget consumption

slom supports specifying alerting rules that trigger when a certain portion of the error budget has been consumed within the SLO window.

Update the spec to alert on error budget consumption

Suppose we want to trigger an alert when 90% of the error budget has been consumed within the current four-week rolling SLO window. To achieve this, update the previous SLO spec to include a new alert with errorBudget field.

example.yaml

name: example

slos:
  - name: availability
    objective:
      ratio: 0.99
      windowRef: window-4w
    indicator:
      prometheus:
        errorRatio: >-
          sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
          sum by (job) (rate(http_requests_total{job="example"}[$window]))
        level:
          - job
    alerts:
      - burnRate:
          consumedBudgetRatio: 0.02
          multiWindows:
            shortWindowRef: window-5m
            longWindowRef: window-1h
        alerter:
          prometheus:
            name: SLOHighBurnRate
            labels:
              severity: page
            annotations:
              description: 2% of the error budget has been consumed within 1 hour
      - burnRate:
          consumedBudgetRatio: 0.1
          multiWindows:
            shortWindowRef: window-6h
            longWindowRef: window-3d
        alerter:
          prometheus:
            name: SLOHighBurnRate
            labels:
              severity: ticket
            annotations:
              description: 10% of the error budget has been consumed within 3 days
      - errorBudget: # (1)!
          consumedBudgetRatio: 0.9
        alerter:
          prometheus:
            name: SLOTooMuchErrorBudgetConsumed
            labels:
              severity: page
            annotations:
              description: 90% of the error budget has been consumed in the current SLO window
    windows:
      - name: window-5m
        rolling:
          duration: 5m
      - name: window-1h
        rolling:
          duration: 1h
      - name: window-6h
        rolling:
          duration: 6h
      - name: window-3d
        rolling:
          duration: 3d
      - name: window-4w
        rolling:
          duration: 4w

UPDATED: Added errorBudget alert

Generate a Prometheus rule file

After updating the SLO spec file, run slom generate prometheus-rule command to generate a Prometheus rule file based on the SLO spec.

slom generate prometheus-rule example.yaml

Then, the following output will be displayed.

groups:
  - name: slom:example-availability:default
    rules:
      - record: job:slom_error:ratio_rate5m
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate1h
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[1h])) / sum by (job) (rate(http_requests_total{job="example"}[1h]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate6h
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[6h])) / sum by (job) (rate(http_requests_total{job="example"}[6h]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate3d
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[3d])) / sum by (job) (rate(http_requests_total{job="example"}[3d]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate4w
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error_budget:ratio_rate4w
        expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - alert: SLOHighBurnRate
        expr: job:slom_error:ratio_rate1h{slom_id="example-availability"} > 13.44 * 0.010000000000000009 and job:slom_error:ratio_rate5m{slom_id="example-availability"} > 13.44 * 0.010000000000000009
        labels:
          severity: page
        annotations:
          description: 2% of the error budget has been consumed within 1 hour
      - alert: SLOHighBurnRate
        expr: job:slom_error:ratio_rate3d{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009 and job:slom_error:ratio_rate6h{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009
        labels:
          severity: ticket
        annotations:
          description: 10% of the error budget has been consumed within 3 days
      - alert: SLOTooMuchErrorBudgetConsumed
        expr: job:slom_error_budget:ratio_rate4w{slom_id="example-availability"} <= 1 - 0.9
        labels:
          severity: page
        annotations:
          description: 90% of the error budget has been consumed in the current SLO window
  - name: slom:example-availability:meta
    rules:
      - record: slom_slo
        expr: 0.99
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example

You can find that there is a new alerting rule SLOTooMuchErrorBudgetConsumed, which is triggered when 90% of the error budget has been consumed in the current SLO window.