Defining SLOs and Writing Burn Rate Alerts in Prometheus
Turn a reliability goal into an alerting rule. This tutorial shows you how to express an SLO as a Prometheus query, calculate burn rates, and write multiwindow alerts that page you before users notice.
Before you begin
- Prometheus running with application metrics
- Basic PromQL knowledge
- PrometheusRule CRD (kube-prometheus-stack or raw Prometheus)
Most teams write threshold alerts: "page me if error rate > 5%." The problem: a 1% error rate sustained for three days consumes your entire monthly error budget, but never fires the alert. Burn rate alerts fix this.
The Concepts
SLO (Service Level Objective): A target for reliability. Example: 99.9% of requests succeed over 30 days.
Error budget: The allowable failure. 99.9% SLO = 0.1% errors = 43.2 minutes of downtime or 432 bad requests per 100,000 in 30 days.
Burn rate: How fast you're consuming the error budget. Burn rate 1 means you'll exhaust the budget exactly at the end of the window. Burn rate 10 means you'll exhaust it in 1/10th the time (3 days for a 30-day budget).
Why multiwindow: A spike that lasts 2 minutes at burn rate 100 consumes 0.2% of a 30-day budget — real but not page-worthy. A sustained burn rate of 5 for 6 hours is serious. Comparing a short and a long window separates spikes from sustained degradation.
Step 1: Define Your SLI (Service Level Indicator)
Start with what you're measuring. For an HTTP service, availability SLI:
# SLI: ratio of successful requests
sum(rate(http_requests_total{status_code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
For an HTTP service, latency SLI (P99 < 500ms):
# Ratio of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
Pick one to start. Most teams begin with availability.
Step 2: Express the SLO
For a 99.9% availability SLO over 30 days:
- Error rate threshold: 1 - 0.999 = 0.001 (0.1%)
- Error budget in seconds: 30 * 24 * 3600 * 0.001 = 2592 seconds ≈ 43 minutes
- Burn rate 1 error rate: 0.001 (the SLO target)
Step 3: Calculate Burn Rates for Alert Windows
The Google SRE Workbook recommends two pairs of windows:
| Alert tier | Short window | Long window | Burn rate | Budget consumed |
|---|---|---|---|---|
| Critical (page) | 5m | 1h | 14.4x | 2% in 1h |
| Warning (ticket) | 30m | 6h | 6x | 5% in 6h |
At burn rate 14.4 with a 99.9% SLO:
- Error rate = 0.001 × 14.4 = 1.44%
- You'd exhaust the 30-day budget in 30/14.4 ≈ 2 days
The 5m/1h pair fires when you have a current spike (5m) that's sustained (1h), preventing false positives from brief spikes.
Step 4: Write the Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-slo
namespace: production
labels:
release: kube-prometheus-stack
spec:
groups:
- name: slo-my-app-availability
interval: 30s
rules:
# --- Recording rules for reuse ---
# Error ratio over 5 minutes
- record: job:http_error_ratio:rate5m
expr: |
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="production"}[5m]))
# Error ratio over 30 minutes
- record: job:http_error_ratio:rate30m
expr: |
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[30m]))
/
sum(rate(http_requests_total{namespace="production"}[30m]))
# Error ratio over 1 hour
- record: job:http_error_ratio:rate1h
expr: |
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total{namespace="production"}[1h]))
# Error ratio over 6 hours
- record: job:http_error_ratio:rate6h
expr: |
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[6h]))
/
sum(rate(http_requests_total{namespace="production"}[6h]))
# --- Alert rules ---
# Critical: fast burn — 2% of monthly budget in 1 hour
# Fires when 5m burn AND 1h burn are both above threshold
- alert: AvailabilitySLOBurnRateCritical
expr: |
job:http_error_ratio:rate5m > (14.4 * 0.001)
and
job:http_error_ratio:rate1h > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "SLO burn rate critical — fast error budget exhaustion"
description: |
Error rate {{ $value | humanizePercentage }} over 5m window.
At this rate, the 30-day error budget will be exhausted in
{{ printf "%.1f" (div 30.0 (div $value 0.001)) }} days.
Runbook: https://wiki.internal/runbooks/availability-slo
# Warning: slow burn — 5% of monthly budget in 6 hours
- alert: AvailabilitySLOBurnRateWarning
expr: |
job:http_error_ratio:rate30m > (6 * 0.001)
and
job:http_error_ratio:rate6h > (6 * 0.001)
for: 15m
labels:
severity: warning
slo: availability
annotations:
summary: "SLO burn rate elevated — error budget draining"
description: |
Error rate {{ $value | humanizePercentage }} over 30m window.
Current burn rate will exhaust budget in approximately
{{ printf "%.1f" (div 30.0 (div $value 0.001)) }} days.
Apply it:
kubectl apply -f slo-rules.yaml
Step 5: Track Remaining Error Budget
Add a recording rule that computes remaining budget as a percentage:
# Remaining error budget (percent) over 30-day rolling window
- record: job:slo_error_budget_remaining:ratio
expr: |
1 - (
sum(increase(http_requests_total{namespace="production",status_code=~"5.."}[30d]))
/
sum(increase(http_requests_total{namespace="production"}[30d]))
) / 0.001
A value of 1.0 means full budget. 0 means exhausted. Negative means you've overshot.
Alert when budget is nearly exhausted:
- alert: ErrorBudgetNearlyExhausted
expr: job:slo_error_budget_remaining:ratio < 0.1
labels:
severity: warning
annotations:
summary: "Error budget below 10%"
description: "Only {{ $value | humanizePercentage }} of the 30-day error budget remains"
Step 6: Build a Grafana SLO Dashboard
Panel 1: Current error rate vs. SLO threshold
# Error rate (red line)
job:http_error_ratio:rate5m
# SLO threshold (green dashed line) — add as constant at 0.001
Panel 2: Burn rate over time
job:http_error_ratio:rate1h / 0.001
Add thresholds at 1 (burn rate 1 = sustainable), 6 (warning), 14.4 (critical).
Panel 3: Remaining error budget
job:slo_error_budget_remaining:ratio * 100
Stat panel with color thresholds: green > 50%, yellow > 10%, red ≤ 10%.
Step 7: Validate the Rules
Trigger a brief error spike to test:
# If you have a test endpoint that returns 500s
for i in $(seq 1 100); do
curl -s -o /dev/null http://my-app.production.svc.cluster.local/error
done
Watch the burn rate panels in Grafana. The critical alert has a for: 2m clause, so it won't fire for a 10-second spike — that's intentional.
Check Prometheus alerts at http://localhost:9090 → Alerts. You should see the rules in INACTIVE state normally, and PENDING/FIRING during a real degradation.
Choosing SLO Targets
Don't start with 99.99%. Start with what you can actually measure historically:
# What was your actual availability over the last 30 days?
sum(increase(http_requests_total{namespace="production",status_code!~"5.."}[30d]))
/
sum(increase(http_requests_total{namespace="production"}[30d]))
If you've been at 99.7%, set the SLO at 99.5% for the first quarter. Tighten it as you improve reliability.
We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.
Struggling with this in production?
We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.