SRE Incident Management: Runbooks, On-Call Rotations, and Post-Mortems
Incidents are inevitable. The difference between teams that recover in minutes and teams that recover in hours is preparation: runbooks that make diagnosis fast, on-call rotations that distribute load fairly, and post-mortems that prevent recurrence. This is the operational framework that mature platform engineering teams use.

An incident at 3 AM with no context is a nightmare. An incident at 3 AM with a runbook that tells you what to check first, what the blast radius is, and what the rollback procedure is — that's a recoverable situation. The difference between these two outcomes is almost entirely preparation: documented runbooks, a practiced on-call rotation, and post-mortems that improve the system rather than assign blame.
This is the operational framework for platform teams running production Kubernetes infrastructure.
Incident Severity Classification
Define severity upfront — it determines response time, escalation path, and communication cadence:
| Severity | Definition | Response | Example |
|---|---|---|---|
| SEV-1 | Complete service outage, revenue impact | Immediate (24/7), escalate to leadership | API gateway down, all requests failing |
| SEV-2 | Significant degradation, partial outage | 30 min response time, team coordination | 20% error rate, latency degraded 3x |
| SEV-3 | Minor degradation, single component | Next business day, ticket | One region slow, background job failing |
| SEV-4 | No user impact, internal tooling | Scheduled | Monitoring gap, CI pipeline slow |
Classification rules:
- When in doubt, escalate severity (de-escalate later)
- SEV-1/2 always get an incident channel, incident commander, and status page update
- SEV-3/4 tracked in ticketing system (Jira/Linear), not paged
Alerting → Incident Pipeline
1# Alertmanager routing — SEV-1 alerts go to PagerDuty and Slack immediately
2alertmanager:
3 config:
4 route:
5 group_by: [alertname, namespace, severity]
6 group_wait: 30s
7 receiver: default
8 routes:
9 - matchers:
10 - severity=critical
11 receiver: pagerduty-sev1
12 repeat_interval: 30m # Re-alert every 30 min if not acknowledged
13 - matchers:
14 - severity=warning
15 receiver: slack-sev2
16 group_wait: 5m # Group warnings for 5 min before paging
17
18 receivers:
19 - name: pagerduty-sev1
20 pagerduty_configs:
21 - routing_key: "xxxxx"
22 description: "{{ .CommonAnnotations.summary }}"
23 severity: "{{ .CommonLabels.severity }}"
24 details:
25 runbook: "{{ .CommonAnnotations.runbook }}" # Link to runbook in alert
26 namespace: "{{ .CommonLabels.namespace }}"
27 cluster: "{{ .CommonLabels.cluster }}"
28
29 - name: slack-sev2
30 slack_configs:
31 - channel: "#incidents"
32 title: "[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}"
33 text: |
34 *Summary:* {{ .CommonAnnotations.summary }}
35 *Runbook:* {{ .CommonAnnotations.runbook }}
36 *Namespace:* {{ .CommonLabels.namespace }}Runbook Structure
A runbook has exactly one job: help the on-call engineer diagnose and resolve an incident under pressure, with limited sleep. Structure for maximum usability:
1# [Service Name] — [Alert Name]
2
3## Overview
4One sentence: what is this service and why does this alert fire?
5Firing condition: `<PromQL expression>` - fires when X exceeds Y for Z minutes.
6
7## Impact
8- Users affected: [who, estimated count]
9- Data loss risk: [yes/no, what kind]
10- Downstream dependencies: [services that depend on this one]
11
12## First Response (< 5 minutes)
131. Confirm alert is real: `kubectl get pods -n production -l app=payments-api`
142. Check error rate: [Grafana dashboard URL]
153. Check recent deployments: `kubectl rollout history deployment/payments-api -n production`
16
17## Diagnosis
18
19### Is it a deployment issue?
20- Recent rollout in last 2 hours?
21 ```bash
22 kubectl rollout history deployment/payments-api -n production
23 # If yes → consider rollback (see Recovery section)- Check pod logs:
bash
kubectl logs -n production -l app=payments-api --tail=100 | grep ERROR
Is it a dependency issue?
- Database connectivity:
bash
kubectl exec -n production -it deploy/payments-api -- \ pg_isready -h db.production.svc -U payments - External API status: [status page URL]
Is it a resource issue?
- OOMKill events:
bash
kubectl describe pod -n production -l app=payments-api | grep -A5 OOMKilled - Node pressure:
bash
kubectl top nodes kubectl describe nodes | grep -A5 Conditions
Recovery
Rollback deployment
kubectl rollout undo deployment/payments-api -n production
kubectl rollout status deployment/payments-api -n production --timeout=5mScale up temporarily
kubectl scale deployment payments-api --replicas=10 -n productionRestart pods (last resort — note this has brief traffic impact)
kubectl rollout restart deployment/payments-api -n productionEscalation
- If not resolved in 30 min: page [Team Lead name, PagerDuty profile URL]
- If database involved: page [DBA contact]
- If AWS infrastructure: [AWS support plan URL]
Post-Incident
- Update status page
- Create post-mortem issue in [Linear/Jira project URL]
- Link incident timeline in Slack thread
Store runbooks in the same Git repository as the service (co-located, versioned alongside the code), linked from the alert annotation:
```yaml
# PrometheusRule alert — always include runbook annotation
- alert: PaymentsApiHighErrorRate
expr: ...
annotations:
summary: "Payments API error rate > 1%"
runbook: "https://github.com/my-org/payments-api/blob/main/docs/runbooks/high-error-rate.md"
On-Call Rotation Design
Rotation Principles
- Follow-the-sun: Split primary on-call by timezone to avoid 3 AM pages. US team covers US business hours and US nights; EU team covers EU hours. Overlap window for handover.
- Minimum team size for on-call: 4 people per rotation (pager fatigue with fewer). Below 4, consider escalation-only on-call (alert to Slack, escalate to primary only for SEV-1).
- Weekly rotations balance load; bi-weekly reduces context-switching overhead for complex systems.
- Shadow on-call for new team members: shadow a rotation before taking primary.
PagerDuty Configuration
# Example on-call schedule (via PagerDuty API or Terraform)
# Escalation policy:
# Layer 1: Primary on-call (15 min acknowledge window)
# Layer 2: Secondary on-call (if primary doesn't respond)
# Layer 3: Team lead (final escalation)
# Layer 4: Service owner / manager (for extended SEV-1)On-Call Metrics
Track weekly to prevent burnout:
- Pages per week per rotation member (target: < 5 actionable pages/week)
- Time-to-acknowledge (target: < 5 min for SEV-1)
- Time-to-resolve (target: SEV-1 < 1h, SEV-2 < 4h)
- Alert noise ratio: actionable vs spurious pages (target: > 80% actionable)
High noise ratio (many non-actionable pages) means alerts need tuning — alerts that can be safely ignored train on-call engineers to ignore all alerts.
Post-Mortem Methodology
Goal: Understand what happened, why it happened, and how to prevent recurrence. Not to assign blame.
Template:
1## Post-Mortem: [Service] [Brief Description]
2
3**Date:** 2026-05-09
4**Duration:** 45 minutes (14:22 UTC - 15:07 UTC)
5**Severity:** SEV-2
6**Impact:** 23% increase in payment error rate, ~500 failed transactions
7
8---
9
10### Timeline
11| Time (UTC) | Event |
12|------------|-------|
13| 14:18 | payments-api deployment v1.4.3 rolled out |
14| 14:22 | PagerDuty alert: error rate > 5% |
15| 14:24 | Incident channel created, IC assigned |
16| 14:31 | Root cause identified: DB connection pool exhausted |
17| 14:45 | Mitigation: connection pool size increased via ConfigMap |
18| 15:07 | Error rate back to baseline, incident resolved |
19
20---
21
22### Root Cause
23v1.4.3 changed the database query pattern from batch to N+1, increasing connection usage by 8x.
24The default connection pool (maxConnections: 5) was insufficient for the new query pattern.
25
26### Contributing Factors
271. Load test did not cover the new query pattern under production-level concurrency
282. No alert for database connection pool saturation
293. ConfigMap for DB_POOL_SIZE not documented in the deployment runbook
30
31### Action Items
32| Action | Owner | Due | Priority |
33|--------|-------|-----|----------|
34| Add prometheus alert for connection pool usage > 80% | @alice | 2026-05-16 | P1 |
35| Add N+1 query detection to CI (sqlcommenter) | @bob | 2026-05-23 | P2 |
36| Update runbook with connection pool tuning instructions | @alice | 2026-05-12 | P2 |
37| Add DB connection pool metrics to Grafana dashboard | @charlie | 2026-05-16 | P2 |Post-mortem principles:
- Write within 48-72 hours while details are fresh
- Include all timelines, including delays in detection
- Action items must have owners and due dates — without these, they don't get done
- Share with the broader engineering team (blameless transparency reduces future incidents)
Frequently Asked Questions
How do I reduce alert noise?
- Set appropriate thresholds. An alert that fires for a 1% error rate spike during a deployment is noise; 5% sustained for 5 minutes is actionable.
- Use
for:in PrometheusRules. Most transient issues resolve themselves.for: 5meliminates flap alerts. - Group related alerts. Alertmanager's
group_byandgroup_waitbatch related alerts into a single page. 10 alerts for the same root cause should page once. - Regular alert review. Monthly: look at pages from the last 30 days. Any alert that fired 5+ times without requiring action is a candidate for removal or threshold tuning.
What tools do platform teams use for incident management?
PagerDuty and OpsGenie for on-call routing and escalation. Slack for incident coordination (dedicated #incidents channel per incident). Statuspage.io or Atlassian Statuspage for customer communication. Linear or Jira for action item tracking from post-mortems. Grafana for real-time dashboards during incidents.
For the alerting infrastructure (PrometheusRules, Alertmanager routing) that triggers incidents, see Prometheus Operator: ServiceMonitor, AlertManager, and Production Monitoring. For SLOs and error budgets that define when an incident's impact exceeds acceptable thresholds, see SLOs, Error Budgets, and Burn Rate Alerts. For the Golden Paths that embed runbook links and playbooks into the developer self-service platform, see Platform Engineering: Golden Paths and Developer Self-Service.
Building an incident management practice for a platform engineering team? Talk to us at Coding Protocols — we help platform teams design on-call rotations, runbook libraries, and post-mortem processes that improve reliability over time.


