Alert on error budget burn rate

It is generally recommended to trigger an alert when a certain portion of the SLO error budget is consumed, in order to prevent the budget from being fully depleted. Google's SRE Workbook introduces the concept of burn rate as a technique to implement alerting mechanisms like this.

slom also supports specifying alerting rules based on the burn rate or error budget consumption over a certain lookback period.

Alert on single burn rate

Suppose we want to trigger a page when 2% of the error budget is consumed within an hour. This corresponds to setting an alert for a burn rate of 2% * 28 days / 1 hour = 13.44.

To generate a Prometheus rule file for such alert, we update the previous SLO spec like below.

example.yaml

name: example

slos:
  - name: availability
    objective:
      ratio: 0.99
      windowRef: window-4w
    indicator:
      prometheus:
        errorRatio: >-
          sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
          sum by (job) (rate(http_requests_total{job="example"}[$window]))
        level:
          - job
    alerts:
      - burnRate: # (1)!
          consumedBudgetRatio: 0.02
          singleWindow:
            windowRef: window-1h
        alerter:
          prometheus:
            name: SLOHighBurnRate
            labels:
              severity: page
            annotations:
              description: 2% of the error budget has been consumed within 1 hour
    windows:
      - name: window-5m
        rolling:
          duration: 5m
      - name: window-1h # (2)!
        rolling:
          duration: 1h
      - name: window-4w
        rolling:
          duration: 4w

UPDATED: Added a burnRate alert to page someone when 2% of the error budget is consumed within one hour.
UPDATED: Added one hour rolling window window-1h for the alert.

The alerts field contains alert specifications.

The name field in alerts items specifies the name of the alert.

The burnRate field in alerts items signifies a burn rate-based alert. We set consumedBudgetRatio to 0.02, so that an alert is triggered when 2% of the error budget is consumed within the window. Also, we set the window period to 1 hour by specifying window-1h in the singleWindow field.

The alerter field in alerts items defines how alerts are triggered. In this case, we use the prometheus alerter, which allows you to configure the alert name, labels and annotations as described in the official guide.

After updating the SLO spec file, run slom generate prometheus-rule command to generate a Prometheus rule file based on the SLO spec.

slom generate prometheus-rule example.yaml

Then, the following output will be displayed.

groups:
  - name: slom:example-availability:default
    rules:
      - record: job:slom_error:ratio_rate5m
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate1h
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[1h])) / sum by (job) (rate(http_requests_total{job="example"}[1h]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate4w
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error_budget:ratio_rate4w
        expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - alert: SLOHighBurnRate
        expr: job:slom_error:ratio_rate1h{slom_id="example-availability"} > 13.44 * 0.010000000000000009
        labels:
          severity: page
        annotations:
          description: 2% of the error budget has been consumed within 1 hour
  - name: slom:example-availability:meta
    rules:
      - record: slom_slo
        expr: 0.99
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example

You can find that there is a new alerting rule SLOHighBurnRate, which is triggered when the burn rate reaches 13.44.

Alert on multiple burn rates

It is a good idea to trigger alerts over different time windows to capture two kinds of issues: fast burn and slow burn.

Fast burn: Alerts are triggered when a significant portion the of error budget is consumed in a short time window (e.g., one hour). This detects urgent issues and typically pages someone for immediate attention.
Slow burn: Alerts are triggered when a portion of the error budget is consumed over a longer window (e.g., three days). This detects issues that tend to go unnoticed but can gradually exhaust the error budget. A ticket is usually filed for resolution during regular working hours.

You can configure alerts on multiple burn rates by adding more items to the alerts field. The example code below configures alerts for:

Fast burn: Triggered when 2% of the error budget is consumed within 1 hour (burn rate = 2% * 28 days / 1 hour = 13.44).
Slow burn: Triggered when 10% of the error budget is consumed within 3 days (burn rate = 10% * 28 days / 3 days = 0.933).

example.yaml

name: example

slos:
  - name: availability
    objective:
      ratio: 0.99
      windowRef: window-4w
    indicator:
      prometheus:
        errorRatio: >-
          sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
          sum by (job) (rate(http_requests_total{job="example"}[$window]))
        level:
          - job
    alerts:
      - burnRate:
          consumedBudgetRatio: 0.02
          singleWindow:
            windowRef: window-1h
        alerter:
          prometheus:
            name: SLOHighBurnRate
            labels:
              severity: page
            annotations:
              description: 2% of the error budget has been consumed within 1 hour
      - burnRate: # (1)!
          consumedBudgetRatio: 0.1
          singleWindow:
            windowRef: window-3d
        alerter:
          prometheus:
            name: SLOHighBurnRate
            labels:
              severity: ticket
            annotations:
              description: 10% of the error budget has been consumed within 3 days
    windows:
      - name: window-5m
        rolling:
          duration: 5m
      - name: window-1h
        rolling:
          duration: 1h
      - name: window-3d # (2)!
        rolling:
          duration: 3d
      - name: window-4w
        rolling:
          duration: 4w

UPDATED: Added slow burn alert
UPDATED: Added three days rolling window window-3d for the alert on slow burn

After running slom generate prometheus-rule for the updated spec file, you can find that a new alerting rule for slow burn is added.

groups:
  - name: slom:example-availability:default
    rules:
      - record: job:slom_error:ratio_rate5m
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate1h
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[1h])) / sum by (job) (rate(http_requests_total{job="example"}[1h]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate3d
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[3d])) / sum by (job) (rate(http_requests_total{job="example"}[3d]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate4w
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error_budget:ratio_rate4w
        expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - alert: SLOHighBurnRate
        expr: job:slom_error:ratio_rate1h{slom_id="example-availability"} > 13.44 * 0.010000000000000009
        labels:
          severity: page
        annotations:
          description: 2% of the error budget has been consumed within 1 hour
      - alert: SLOHighBurnRate
        expr: job:slom_error:ratio_rate3d{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009
        labels:
          severity: ticket
        annotations:
          description: 10% of the error budget has been consumed within 3 days
  - name: slom:example-availability:meta
    rules:
      - record: slom_slo
        expr: 0.99
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example

Alert with multiple windows

In 6: Multiwindow, Multi-Burn-Rate Alerts section of Google's SRE Workbook, it is recommended to combine the burn rate alerting rule with a shorter window (e.g., 1/12 of the original window). This approach reduces the alert reset time and minimizes the number of false positives.

The code below provides an updated example that configures:

A 5-minute window for the fast burn alert (1 hour).
A 6-hour window for the slow burn alert (3 days).

# example.yaml
name: example

slos:
  - name: availability
    objective:
      ratio: 0.99
      windowRef: window-4w
    indicator:
      prometheus:
        errorRatio: >-
          sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
          sum by (job) (rate(http_requests_total{job="example"}[$window]))
        level:
          - job
    alerts:
      - burnRate:
          consumedBudgetRatio: 0.02
          multiWindows: # (1)!
            shortWindowRef: window-5m
            longWindowRef: window-1h
        alerter:
          prometheus:
            name: SLOHighBurnRate
            labels:
              severity: page
            annotations:
              description: 2% of the error budget has been consumed within 1 hour
      - burnRate:
          consumedBudgetRatio: 0.1
          multiWindows: # (2)!
            shortWindowRef: window-6h
            longWindowRef: window-3d
        alerter:
          prometheus:
            name: SLOHighBurnRate
            labels:
              severity: ticket
            annotations:
              description: 10% of the error budget has been consumed within 3 days
    windows:
      - name: window-5m
        rolling:
          duration: 5m
      - name: window-1h
        rolling:
          duration: 1h
      - name: window-6h # (3)!
        rolling:
          duration: 6h
      - name: window-3d
        rolling:
          duration: 3d
      - name: window-4w
        rolling:
          duration: 4w

UPDATED: Added 5m short window window-5m with multiWindows.
UPDATED: Added 6h short window window-6h with multiWindows.
UPDATED: Added a short window window-6h for slow burn.

You will notice that the updated alert specifications utilize multiWindows instead of singleWindow to configure short time windows.

After running slom generate prometheus-rule for the updated spec file, you can find that the alerting rules recognize short windows.

groups:
  - name: slom:example-availability:default
    rules:
      - record: job:slom_error:ratio_rate5m
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate1h
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[1h])) / sum by (job) (rate(http_requests_total{job="example"}[1h]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate6h
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[6h])) / sum by (job) (rate(http_requests_total{job="example"}[6h]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate3d
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[3d])) / sum by (job) (rate(http_requests_total{job="example"}[3d]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error:ratio_rate4w
        expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - record: job:slom_error_budget:ratio_rate4w
        expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example
      - alert: SLOHighBurnRate
        expr: job:slom_error:ratio_rate1h{slom_id="example-availability"} > 13.44 * 0.010000000000000009 and job:slom_error:ratio_rate5m{slom_id="example-availability"} > 13.44 * 0.010000000000000009
        labels:
          severity: page
        annotations:
          description: 2% of the error budget has been consumed within 1 hour
      - alert: SLOHighBurnRate
        expr: job:slom_error:ratio_rate3d{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009 and job:slom_error:ratio_rate6h{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009
        labels:
          severity: ticket
        annotations:
          description: 10% of the error budget has been consumed within 3 days
  - name: slom:example-availability:meta
    rules:
      - record: slom_slo
        expr: 0.99
        labels:
          slom_id: example-availability
          slom_slo: availability
          slom_spec: example