Alert on error budget burn rate
It is generally recommended to trigger an alert when a certain portion of the SLO error budget is consumed, in order to prevent the budget from being fully depleted. Google's SRE Workbook introduces the concept of burn rate as a technique to implement alerting mechanisms like this.
slom also supports specifying alerting rules based on the burn rate or error budget consumption over a certain lookback period.
Alert on single burn rate
Suppose we want to trigger a page when 2% of the error budget is consumed within an hour. This corresponds to setting an alert for a burn rate of 2% * 28 days / 1 hour = 13.44.
To generate a Prometheus rule file for such alert, we update the previous SLO spec like below.
name: example
slos:
- name: availability
objective:
ratio: 0.99
windowRef: window-4w
indicator:
prometheus:
errorRatio: >-
sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
sum by (job) (rate(http_requests_total{job="example"}[$window]))
level:
- job
alerts:
- burnRate: # (1)!
consumedBudgetRatio: 0.02
singleWindow:
windowRef: window-1h
alerter:
prometheus:
name: SLOHighBurnRate
labels:
severity: page
annotations:
description: 2% of the error budget has been consumed within 1 hour
windows:
- name: window-5m
rolling:
duration: 5m
- name: window-1h # (2)!
rolling:
duration: 1h
- name: window-4w
rolling:
duration: 4w
- UPDATED: Added a
burnRate
alert to page someone when 2% of the error budget is consumed within one hour. - UPDATED: Added one hour rolling window
window-1h
for the alert.
The alerts
field contains alert specifications.
The name
field in alerts
items specifies the name of the alert.
The burnRate
field in alerts
items signifies a burn rate-based alert. We set consumedBudgetRatio
to 0.02, so that an alert is triggered when 2% of the error budget is consumed within the window. Also, we set the window period to 1 hour by specifying window-1h
in the singleWindow
field.
The alerter
field in alerts
items defines how alerts are triggered. In this case, we use the prometheus
alerter, which allows you to configure the alert name, labels and annotations as described in the official guide.
After updating the SLO spec file, run slom generate prometheus-rule
command to generate a Prometheus rule file based on the SLO spec.
Then, the following output will be displayed.
groups:
- name: slom:example-availability:default
rules:
- record: job:slom_error:ratio_rate5m
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate1h
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[1h])) / sum by (job) (rate(http_requests_total{job="example"}[1h]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate4w
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error_budget:ratio_rate4w
expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- alert: SLOHighBurnRate
expr: job:slom_error:ratio_rate1h{slom_id="example-availability"} > 13.44 * 0.010000000000000009
labels:
severity: page
annotations:
description: 2% of the error budget has been consumed within 1 hour
- name: slom:example-availability:meta
rules:
- record: slom_slo
expr: 0.99
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
You can find that there is a new alerting rule SLOHighBurnRate
, which is triggered when the burn rate reaches 13.44.
Alert on multiple burn rates
It is a good idea to trigger alerts over different time windows to capture two kinds of issues: fast burn and slow burn.
- Fast burn: Alerts are triggered when a significant portion the of error budget is consumed in a short time window (e.g., one hour). This detects urgent issues and typically pages someone for immediate attention.
- Slow burn: Alerts are triggered when a portion of the error budget is consumed over a longer window (e.g., three days). This detects issues that tend to go unnoticed but can gradually exhaust the error budget. A ticket is usually filed for resolution during regular working hours.
You can configure alerts on multiple burn rates by adding more items to the alerts
field.
The example code below configures alerts for:
- Fast burn: Triggered when 2% of the error budget is consumed within 1 hour (burn rate = 2% * 28 days / 1 hour = 13.44).
- Slow burn: Triggered when 10% of the error budget is consumed within 3 days (burn rate = 10% * 28 days / 3 days = 0.933).
name: example
slos:
- name: availability
objective:
ratio: 0.99
windowRef: window-4w
indicator:
prometheus:
errorRatio: >-
sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
sum by (job) (rate(http_requests_total{job="example"}[$window]))
level:
- job
alerts:
- burnRate:
consumedBudgetRatio: 0.02
singleWindow:
windowRef: window-1h
alerter:
prometheus:
name: SLOHighBurnRate
labels:
severity: page
annotations:
description: 2% of the error budget has been consumed within 1 hour
- burnRate: # (1)!
consumedBudgetRatio: 0.1
singleWindow:
windowRef: window-3d
alerter:
prometheus:
name: SLOHighBurnRate
labels:
severity: ticket
annotations:
description: 10% of the error budget has been consumed within 3 days
windows:
- name: window-5m
rolling:
duration: 5m
- name: window-1h
rolling:
duration: 1h
- name: window-3d # (2)!
rolling:
duration: 3d
- name: window-4w
rolling:
duration: 4w
- UPDATED: Added slow burn alert
- UPDATED: Added three days rolling window
window-3d
for the alert on slow burn
After running slom generate prometheus-rule
for the updated spec file, you can find that a new alerting rule for slow burn is added.
groups:
- name: slom:example-availability:default
rules:
- record: job:slom_error:ratio_rate5m
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate1h
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[1h])) / sum by (job) (rate(http_requests_total{job="example"}[1h]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate3d
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[3d])) / sum by (job) (rate(http_requests_total{job="example"}[3d]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate4w
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error_budget:ratio_rate4w
expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- alert: SLOHighBurnRate
expr: job:slom_error:ratio_rate1h{slom_id="example-availability"} > 13.44 * 0.010000000000000009
labels:
severity: page
annotations:
description: 2% of the error budget has been consumed within 1 hour
- alert: SLOHighBurnRate
expr: job:slom_error:ratio_rate3d{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009
labels:
severity: ticket
annotations:
description: 10% of the error budget has been consumed within 3 days
- name: slom:example-availability:meta
rules:
- record: slom_slo
expr: 0.99
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
Alert with multiple windows
In 6: Multiwindow, Multi-Burn-Rate Alerts section of Google's SRE Workbook, it is recommended to combine the burn rate alerting rule with a shorter window (e.g., 1/12 of the original window). This approach reduces the alert reset time and minimizes the number of false positives.
The code below provides an updated example that configures:
- A 5-minute window for the fast burn alert (1 hour).
- A 6-hour window for the slow burn alert (3 days).
# example.yaml
name: example
slos:
- name: availability
objective:
ratio: 0.99
windowRef: window-4w
indicator:
prometheus:
errorRatio: >-
sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
sum by (job) (rate(http_requests_total{job="example"}[$window]))
level:
- job
alerts:
- burnRate:
consumedBudgetRatio: 0.02
multiWindows: # (1)!
shortWindowRef: window-5m
longWindowRef: window-1h
alerter:
prometheus:
name: SLOHighBurnRate
labels:
severity: page
annotations:
description: 2% of the error budget has been consumed within 1 hour
- burnRate:
consumedBudgetRatio: 0.1
multiWindows: # (2)!
shortWindowRef: window-6h
longWindowRef: window-3d
alerter:
prometheus:
name: SLOHighBurnRate
labels:
severity: ticket
annotations:
description: 10% of the error budget has been consumed within 3 days
windows:
- name: window-5m
rolling:
duration: 5m
- name: window-1h
rolling:
duration: 1h
- name: window-6h # (3)!
rolling:
duration: 6h
- name: window-3d
rolling:
duration: 3d
- name: window-4w
rolling:
duration: 4w
- UPDATED: Added 5m short window
window-5m
withmultiWindows
. - UPDATED: Added 6h short window
window-6h
withmultiWindows
. - UPDATED: Added a short window
window-6h
for slow burn.
You will notice that the updated alert specifications utilize multiWindows
instead of singleWindow
to configure short time windows.
After running slom generate prometheus-rule
for the updated spec file, you can find that the alerting rules recognize short windows.
groups:
- name: slom:example-availability:default
rules:
- record: job:slom_error:ratio_rate5m
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate1h
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[1h])) / sum by (job) (rate(http_requests_total{job="example"}[1h]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate6h
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[6h])) / sum by (job) (rate(http_requests_total{job="example"}[6h]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate3d
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[3d])) / sum by (job) (rate(http_requests_total{job="example"}[3d]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate4w
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error_budget:ratio_rate4w
expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- alert: SLOHighBurnRate
expr: job:slom_error:ratio_rate1h{slom_id="example-availability"} > 13.44 * 0.010000000000000009 and job:slom_error:ratio_rate5m{slom_id="example-availability"} > 13.44 * 0.010000000000000009
labels:
severity: page
annotations:
description: 2% of the error budget has been consumed within 1 hour
- alert: SLOHighBurnRate
expr: job:slom_error:ratio_rate3d{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009 and job:slom_error:ratio_rate6h{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009
labels:
severity: ticket
annotations:
description: 10% of the error budget has been consumed within 3 days
- name: slom:example-availability:meta
rules:
- record: slom_slo
expr: 0.99
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example