Most teams write threshold alerts: "page me if error rate > 5%." The problem: a 1% error rate sustained for three days consumes your entire monthly error budget, but never fires the alert. Burn rate alerts fix this.

The Concepts

SLO (Service Level Objective): A target for reliability. Example: 99.9% of requests succeed over 30 days.

Error budget: The allowable failure. 99.9% SLO = 0.1% errors = 43.2 minutes of downtime or 432 bad requests per 100,000 in 30 days.

Burn rate: How fast you're consuming the error budget. Burn rate 1 means you'll exhaust the budget exactly at the end of the window. Burn rate 10 means you'll exhaust it in 1/10th the time (3 days for a 30-day budget).

Why multiwindow: A spike that lasts 2 minutes at burn rate 100 consumes 0.2% of a 30-day budget — real but not page-worthy. A sustained burn rate of 5 for 6 hours is serious. Comparing a short and a long window separates spikes from sustained degradation.

Step 1: Define Your SLI (Service Level Indicator)

Start with what you're measuring. For an HTTP service, availability SLI:

promql

# SLI: ratio of successful requests
sum(rate(http_requests_total{status_code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

For an HTTP service, latency SLI (P99 < 500ms):

promql

# Ratio of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Pick one to start. Most teams begin with availability.

Step 2: Express the SLO

For a 99.9% availability SLO over 30 days:

Error rate threshold: 1 - 0.999 = 0.001 (0.1%)
Error budget in seconds: 30 * 24 * 3600 * 0.001 = 2592 seconds ≈ 43 minutes
Burn rate 1 error rate: 0.001 (the SLO target)

Step 3: Calculate Burn Rates for Alert Windows

The Google SRE Workbook recommends two pairs of windows:

Alert tier	Short window	Long window	Burn rate	Budget consumed
Critical (page)	5m	1h	14.4x	2% in 1h
Warning (ticket)	30m	6h	6x	5% in 6h

At burn rate 14.4 with a 99.9% SLO:

Error rate = 0.001 × 14.4 = 1.44%
You'd exhaust the 30-day budget in 30/14.4 ≈ 2 days

The 5m/1h pair fires when you have a current spike (5m) that's sustained (1h), preventing false positives from brief spikes.

Step 4: Write the Alert Rules

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-slo
  namespace: production
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: slo-my-app-availability
      interval: 30s
      rules:

        # --- Recording rules for reuse ---

        # Error ratio over 5 minutes
        - record: job:http_error_ratio:rate5m
          expr: |
            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{namespace="production"}[5m]))

        # Error ratio over 30 minutes
        - record: job:http_error_ratio:rate30m
          expr: |
            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[30m]))
            /
            sum(rate(http_requests_total{namespace="production"}[30m]))

        # Error ratio over 1 hour
        - record: job:http_error_ratio:rate1h
          expr: |
            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{namespace="production"}[1h]))

        # Error ratio over 6 hours
        - record: job:http_error_ratio:rate6h
          expr: |
            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{namespace="production"}[6h]))

        # --- Alert rules ---

        # Critical: fast burn — 2% of monthly budget in 1 hour
        # Fires when 5m burn AND 1h burn are both above threshold
        - alert: AvailabilitySLOBurnRateCritical
          expr: |
            job:http_error_ratio:rate5m  > (14.4 * 0.001)
            and
            job:http_error_ratio:rate1h  > (14.4 * 0.001)
          for: 2m
          labels:
            severity: critical
            slo: availability
          annotations:
            summary: "SLO burn rate critical — fast error budget exhaustion"
            description: |
              Error rate {{ $value | humanizePercentage }} over 5m window.
              At this rate, the 30-day error budget will be exhausted in
              {{ printf "%.1f" (div 30.0 (div $value 0.001)) }} days.
              Runbook: https://wiki.internal/runbooks/availability-slo

        # Warning: slow burn — 5% of monthly budget in 6 hours
        - alert: AvailabilitySLOBurnRateWarning
          expr: |
            job:http_error_ratio:rate30m > (6 * 0.001)
            and
            job:http_error_ratio:rate6h  > (6 * 0.001)
          for: 15m
          labels:
            severity: warning
            slo: availability
          annotations:
            summary: "SLO burn rate elevated — error budget draining"
            description: |
              Error rate {{ $value | humanizePercentage }} over 30m window.
              Current burn rate will exhaust budget in approximately
              {{ printf "%.1f" (div 30.0 (div $value 0.001)) }} days.

Apply it:

bash

kubectl apply -f slo-rules.yaml

Step 5: Track Remaining Error Budget

Add a recording rule that computes remaining budget as a percentage:

yaml

        # Remaining error budget (percent) over 30-day rolling window
        - record: job:slo_error_budget_remaining:ratio
          expr: |
            1 - (
              sum(increase(http_requests_total{namespace="production",status_code=~"5.."}[30d]))
              /
              sum(increase(http_requests_total{namespace="production"}[30d]))
            ) / 0.001

A value of 1.0 means full budget. 0 means exhausted. Negative means you've overshot.

Alert when budget is nearly exhausted:

yaml

        - alert: ErrorBudgetNearlyExhausted
          expr: job:slo_error_budget_remaining:ratio < 0.1
          labels:
            severity: warning
          annotations:
            summary: "Error budget below 10%"
            description: "Only {{ $value | humanizePercentage }} of the 30-day error budget remains"

Step 6: Build a Grafana SLO Dashboard

Panel 1: Current error rate vs. SLO threshold

promql

# Error rate (red line)
job:http_error_ratio:rate5m

# SLO threshold (green dashed line) — add as constant at 0.001

Panel 2: Burn rate over time

promql

job:http_error_ratio:rate1h / 0.001

Add thresholds at 1 (burn rate 1 = sustainable), 6 (warning), 14.4 (critical).

Panel 3: Remaining error budget

promql

job:slo_error_budget_remaining:ratio * 100

Stat panel with color thresholds: green > 50%, yellow > 10%, red ≤ 10%.

Step 7: Validate the Rules

Trigger a brief error spike to test:

bash

# If you have a test endpoint that returns 500s
for i in $(seq 1 100); do
  curl -s -o /dev/null http://my-app.production.svc.cluster.local/error
done

Watch the burn rate panels in Grafana. The critical alert has a for: 2m clause, so it won't fire for a 10-second spike — that's intentional.

Check Prometheus alerts at http://localhost:9090 → Alerts. You should see the rules in INACTIVE state normally, and PENDING/FIRING during a real degradation.

Choosing SLO Targets

Don't start with 99.99%. Start with what you can actually measure historically:

promql

# What was your actual availability over the last 30 days?
sum(increase(http_requests_total{namespace="production",status_code!~"5.."}[30d]))
/
sum(increase(http_requests_total{namespace="production"}[30d]))

If you've been at 99.7%, set the SLO at 99.5% for the first quarter. Tighten it as you improve reliability.

Defining SLOs and Writing Burn Rate Alerts in Prometheus

Before you begin

The Concepts

Step 1: Define Your SLI (Service Level Indicator)

Step 2: Express the SLO

Step 3: Calculate Burn Rates for Alert Windows

Step 4: Write the Alert Rules

Step 5: Track Remaining Error Budget

Step 6: Build a Grafana SLO Dashboard

Step 7: Validate the Rules

Choosing SLO Targets

Struggling with this in production?