SRE Incident Management: Runbooks, On-Call, and Post-Mortems (2026)

An incident at 3 AM with no context is a nightmare. An incident at 3 AM with a runbook that tells you what to check first, what the blast radius is, and what the rollback procedure is — that's a recoverable situation. The difference between these two outcomes is almost entirely preparation: documented runbooks, a practiced on-call rotation, and post-mortems that improve the system rather than assign blame.

This is the operational framework for platform teams running production Kubernetes infrastructure.

Incident Severity Classification

Define severity upfront — it determines response time, escalation path, and communication cadence:

Severity	Definition	Response	Example
SEV-1	Complete service outage, revenue impact	Immediate (24/7), escalate to leadership	API gateway down, all requests failing
SEV-2	Significant degradation, partial outage	30 min response time, team coordination	20% error rate, latency degraded 3x
SEV-3	Minor degradation, single component	Next business day, ticket	One region slow, background job failing
SEV-4	No user impact, internal tooling	Scheduled	Monitoring gap, CI pipeline slow

Classification rules:

When in doubt, escalate severity (de-escalate later)
SEV-1/2 always get an incident channel, incident commander, and status page update
SEV-3/4 tracked in ticketing system (Jira/Linear), not paged

Alerting → Incident Pipeline

yaml

1# Alertmanager routing — SEV-1 alerts go to PagerDuty and Slack immediately
2alertmanager:
3  config:
4    route:
5      group_by: [alertname, namespace, severity]
6      group_wait: 30s
7      receiver: default
8      routes:
9        - matchers:
10            - severity=critical
11          receiver: pagerduty-sev1
12          repeat_interval: 30m    # Re-alert every 30 min if not acknowledged
13        - matchers:
14            - severity=warning
15          receiver: slack-sev2
16          group_wait: 5m          # Group warnings for 5 min before paging
17
18    receivers:
19      - name: pagerduty-sev1
20        pagerduty_configs:
21          - routing_key: "xxxxx"
22            description: "{{ .CommonAnnotations.summary }}"
23            severity: "{{ .CommonLabels.severity }}"
24            details:
25              runbook: "{{ .CommonAnnotations.runbook }}"    # Link to runbook in alert
26              namespace: "{{ .CommonLabels.namespace }}"
27              cluster: "{{ .CommonLabels.cluster }}"
28
29      - name: slack-sev2
30        slack_configs:
31          - channel: "#incidents"
32            title: "[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}"
33            text: |
34              *Summary:* {{ .CommonAnnotations.summary }}
35              *Runbook:* {{ .CommonAnnotations.runbook }}
36              *Namespace:* {{ .CommonLabels.namespace }}

Runbook Structure

A runbook has exactly one job: help the on-call engineer diagnose and resolve an incident under pressure, with limited sleep. Structure for maximum usability:

markdown

1# [Service Name] — [Alert Name]
2
3## Overview
4One sentence: what is this service and why does this alert fire?
5Firing condition: `<PromQL expression>` - fires when X exceeds Y for Z minutes.
6
7## Impact
8- Users affected: [who, estimated count]
9- Data loss risk: [yes/no, what kind]
10- Downstream dependencies: [services that depend on this one]
11
12## First Response (< 5 minutes)
131. Confirm alert is real: `kubectl get pods -n production -l app=payments-api`
142. Check error rate: [Grafana dashboard URL]
153. Check recent deployments: `kubectl rollout history deployment/payments-api -n production`
16
17## Diagnosis
18
19### Is it a deployment issue?
20- Recent rollout in last 2 hours?
21  ```bash
22  kubectl rollout history deployment/payments-api -n production
23  # If yes → consider rollback (see Recovery section)

Check pod logs:

bash

kubectl logs -n production -l app=payments-api --tail=100 | grep ERROR

Is it a dependency issue?

Database connectivity:

bash

kubectl exec -n production -it deploy/payments-api -- \
  pg_isready -h db.production.svc -U payments

External API status: [status page URL]

Is it a resource issue?

OOMKill events:

bash

kubectl describe pod -n production -l app=payments-api | grep -A5 OOMKilled

Node pressure:

bash

kubectl top nodes
kubectl describe nodes | grep -A5 Conditions

Recovery

Rollback deployment

bash

kubectl rollout undo deployment/payments-api -n production
kubectl rollout status deployment/payments-api -n production --timeout=5m

Scale up temporarily

bash

kubectl scale deployment payments-api --replicas=10 -n production

Restart pods (last resort — note this has brief traffic impact)

bash

kubectl rollout restart deployment/payments-api -n production

Escalation

If not resolved in 30 min: page [Team Lead name, PagerDuty profile URL]
If database involved: page [DBA contact]
If AWS infrastructure: [AWS support plan URL]

Post-Incident

Update status page
Create post-mortem issue in [Linear/Jira project URL]
Link incident timeline in Slack thread


Store runbooks in the same Git repository as the service (co-located, versioned alongside the code), linked from the alert annotation:

```yaml
# PrometheusRule alert — always include runbook annotation
- alert: PaymentsApiHighErrorRate
  expr: ...
  annotations:
    summary: "Payments API error rate > 1%"
    runbook: "https://github.com/my-org/payments-api/blob/main/docs/runbooks/high-error-rate.md"

On-Call Rotation Design

Rotation Principles

Follow-the-sun: Split primary on-call by timezone to avoid 3 AM pages. US team covers US business hours and US nights; EU team covers EU hours. Overlap window for handover.
Minimum team size for on-call: 4 people per rotation (pager fatigue with fewer). Below 4, consider escalation-only on-call (alert to Slack, escalate to primary only for SEV-1).
Weekly rotations balance load; bi-weekly reduces context-switching overhead for complex systems.
Shadow on-call for new team members: shadow a rotation before taking primary.

PagerDuty Configuration

yaml

# Example on-call schedule (via PagerDuty API or Terraform)
# Escalation policy:
# Layer 1: Primary on-call (15 min acknowledge window)
# Layer 2: Secondary on-call (if primary doesn't respond)
# Layer 3: Team lead (final escalation)
# Layer 4: Service owner / manager (for extended SEV-1)

On-Call Metrics

Track weekly to prevent burnout:

Pages per week per rotation member (target: < 5 actionable pages/week)
Time-to-acknowledge (target: < 5 min for SEV-1)
Time-to-resolve (target: SEV-1 < 1h, SEV-2 < 4h)
Alert noise ratio: actionable vs spurious pages (target: > 80% actionable)

High noise ratio (many non-actionable pages) means alerts need tuning — alerts that can be safely ignored train on-call engineers to ignore all alerts.

Post-Mortem Methodology

Goal: Understand what happened, why it happened, and how to prevent recurrence. Not to assign blame.

Template:

markdown

1## Post-Mortem: [Service] [Brief Description]
2
3**Date:** 2026-05-09
4**Duration:** 45 minutes (14:22 UTC - 15:07 UTC)
5**Severity:** SEV-2
6**Impact:** 23% increase in payment error rate, ~500 failed transactions
7
8---
9
10### Timeline
11| Time (UTC) | Event |
12|------------|-------|
13| 14:18 | payments-api deployment v1.4.3 rolled out |
14| 14:22 | PagerDuty alert: error rate > 5% |
15| 14:24 | Incident channel created, IC assigned |
16| 14:31 | Root cause identified: DB connection pool exhausted |
17| 14:45 | Mitigation: connection pool size increased via ConfigMap |
18| 15:07 | Error rate back to baseline, incident resolved |
19
20---
21
22### Root Cause
23v1.4.3 changed the database query pattern from batch to N+1, increasing connection usage by 8x.
24The default connection pool (maxConnections: 5) was insufficient for the new query pattern.
25
26### Contributing Factors
271. Load test did not cover the new query pattern under production-level concurrency
282. No alert for database connection pool saturation
293. ConfigMap for DB_POOL_SIZE not documented in the deployment runbook
30
31### Action Items
32| Action | Owner | Due | Priority |
33|--------|-------|-----|----------|
34| Add prometheus alert for connection pool usage > 80% | @alice | 2026-05-16 | P1 |
35| Add N+1 query detection to CI (sqlcommenter) | @bob | 2026-05-23 | P2 |
36| Update runbook with connection pool tuning instructions | @alice | 2026-05-12 | P2 |
37| Add DB connection pool metrics to Grafana dashboard | @charlie | 2026-05-16 | P2 |

Post-mortem principles:

Write within 48-72 hours while details are fresh
Include all timelines, including delays in detection
Action items must have owners and due dates — without these, they don't get done
Share with the broader engineering team (blameless transparency reduces future incidents)

Frequently Asked Questions

How do I reduce alert noise?

Set appropriate thresholds. An alert that fires for a 1% error rate spike during a deployment is noise; 5% sustained for 5 minutes is actionable.
Use for: in PrometheusRules. Most transient issues resolve themselves. for: 5m eliminates flap alerts.
Group related alerts. Alertmanager's group_by and group_wait batch related alerts into a single page. 10 alerts for the same root cause should page once.
Regular alert review. Monthly: look at pages from the last 30 days. Any alert that fired 5+ times without requiring action is a candidate for removal or threshold tuning.

What tools do platform teams use for incident management?

PagerDuty and OpsGenie for on-call routing and escalation. Slack for incident coordination (dedicated #incidents channel per incident). Statuspage.io or Atlassian Statuspage for customer communication. Linear or Jira for action item tracking from post-mortems. Grafana for real-time dashboards during incidents.

For the alerting infrastructure (PrometheusRules, Alertmanager routing) that triggers incidents, see Prometheus Operator: ServiceMonitor, AlertManager, and Production Monitoring. For SLOs and error budgets that define when an incident's impact exceeds acceptable thresholds, see SLOs, Error Budgets, and Burn Rate Alerts. For the Golden Paths that embed runbook links and playbooks into the developer self-service platform, see Platform Engineering: Golden Paths and Developer Self-Service.

Building an incident management practice for a platform engineering team? Talk to us at Coding Protocols — we help platform teams design on-call rotations, runbook libraries, and post-mortem processes that improve reliability over time.

SRE Incident Management: Runbooks, On-Call Rotations, and Post-Mortems

Incident Severity Classification

Alerting → Incident Pipeline

Runbook Structure

Is it a dependency issue?

Is it a resource issue?

Recovery

Rollback deployment

Scale up temporarily

Restart pods (last resort — note this has brief traffic impact)

Escalation

Post-Incident

On-Call Rotation Design

Rotation Principles

PagerDuty Configuration

On-Call Metrics

Post-Mortem Methodology

Frequently Asked Questions

How do I reduce alert noise?

What tools do platform teams use for incident management?

Related Topics

Read Next

Kubernetes Observability: Prometheus, Grafana, and OpenTelemetry in Production

Kubernetes Resource Management: Quotas, LimitRanges, and QoS Classes

Helm Advanced Patterns: Chart Development and Production Operations