Alert on error budget consumption
slom supports specifying alerting rules that trigger when a certain portion of the error budget has been consumed within the SLO window.
Update the spec to alert on error budget consumption
Suppose we want to trigger an alert when 90% of the error budget has been consumed within the current four-week rolling SLO window.
To achieve this, update the previous SLO spec to include a new alert with errorBudget
field.
example.yaml
name: example
slos:
- name: availability
objective:
ratio: 0.99
windowRef: window-4w
indicator:
prometheus:
errorRatio: >-
sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[$window])) /
sum by (job) (rate(http_requests_total{job="example"}[$window]))
level:
- job
alerts:
- burnRate:
consumedBudgetRatio: 0.02
multiWindows:
shortWindowRef: window-5m
longWindowRef: window-1h
alerter:
prometheus:
name: SLOHighBurnRate
labels:
severity: page
annotations:
description: 2% of the error budget has been consumed within 1 hour
- burnRate:
consumedBudgetRatio: 0.1
multiWindows:
shortWindowRef: window-6h
longWindowRef: window-3d
alerter:
prometheus:
name: SLOHighBurnRate
labels:
severity: ticket
annotations:
description: 10% of the error budget has been consumed within 3 days
- errorBudget: # (1)!
consumedBudgetRatio: 0.9
alerter:
prometheus:
name: SLOTooMuchErrorBudgetConsumed
labels:
severity: page
annotations:
description: 90% of the error budget has been consumed in the current SLO window
windows:
- name: window-5m
rolling:
duration: 5m
- name: window-1h
rolling:
duration: 1h
- name: window-6h
rolling:
duration: 6h
- name: window-3d
rolling:
duration: 3d
- name: window-4w
rolling:
duration: 4w
- UPDATED: Added
errorBudget
alert
Generate a Prometheus rule file
After updating the SLO spec file, run slom generate prometheus-rule
command to generate a Prometheus rule file based on the SLO spec.
Then, the following output will be displayed.
groups:
- name: slom:example-availability:default
rules:
- record: job:slom_error:ratio_rate5m
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[5m])) / sum by (job) (rate(http_requests_total{job="example"}[5m]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate1h
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[1h])) / sum by (job) (rate(http_requests_total{job="example"}[1h]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate6h
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[6h])) / sum by (job) (rate(http_requests_total{job="example"}[6h]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate3d
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[3d])) / sum by (job) (rate(http_requests_total{job="example"}[3d]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error:ratio_rate4w
expr: sum by (job) (rate(http_requests_total{job="example", code!~"2.."}[4w])) / sum by (job) (rate(http_requests_total{job="example"}[4w]))
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- record: job:slom_error_budget:ratio_rate4w
expr: 1 - job:slom_error:ratio_rate4w{slom_id="example-availability"} / (1 - 0.99)
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
- alert: SLOHighBurnRate
expr: job:slom_error:ratio_rate1h{slom_id="example-availability"} > 13.44 * 0.010000000000000009 and job:slom_error:ratio_rate5m{slom_id="example-availability"} > 13.44 * 0.010000000000000009
labels:
severity: page
annotations:
description: 2% of the error budget has been consumed within 1 hour
- alert: SLOHighBurnRate
expr: job:slom_error:ratio_rate3d{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009 and job:slom_error:ratio_rate6h{slom_id="example-availability"} > 0.9333333333333333 * 0.010000000000000009
labels:
severity: ticket
annotations:
description: 10% of the error budget has been consumed within 3 days
- alert: SLOTooMuchErrorBudgetConsumed
expr: job:slom_error_budget:ratio_rate4w{slom_id="example-availability"} <= 1 - 0.9
labels:
severity: page
annotations:
description: 90% of the error budget has been consumed in the current SLO window
- name: slom:example-availability:meta
rules:
- record: slom_slo
expr: 0.99
labels:
slom_id: example-availability
slom_slo: availability
slom_spec: example
You can find that there is a new alerting rule SLOTooMuchErrorBudgetConsumed
, which is triggered when 90% of the error budget has been consumed in the current SLO window.