Platform Engineering
13 min readMay 6, 2026

SRE Incident Management: Runbooks, On-Call Rotations, and Post-Mortems

Incidents are inevitable. The difference between teams that recover in minutes and teams that recover in hours is preparation: runbooks that make diagnosis fast, on-call rotations that distribute load fairly, and post-mortems that prevent recurrence. This is the operational framework that mature platform engineering teams use.

CO
Coding Protocols Team
Platform Engineering
SRE Incident Management: Runbooks, On-Call Rotations, and Post-Mortems

An incident at 3 AM with no context is a nightmare. An incident at 3 AM with a runbook that tells you what to check first, what the blast radius is, and what the rollback procedure is — that's a recoverable situation. The difference between these two outcomes is almost entirely preparation: documented runbooks, a practiced on-call rotation, and post-mortems that improve the system rather than assign blame.

This is the operational framework for platform teams running production Kubernetes infrastructure.


Incident Severity Classification

Define severity upfront — it determines response time, escalation path, and communication cadence:

SeverityDefinitionResponseExample
SEV-1Complete service outage, revenue impactImmediate (24/7), escalate to leadershipAPI gateway down, all requests failing
SEV-2Significant degradation, partial outage30 min response time, team coordination20% error rate, latency degraded 3x
SEV-3Minor degradation, single componentNext business day, ticketOne region slow, background job failing
SEV-4No user impact, internal toolingScheduledMonitoring gap, CI pipeline slow

Classification rules:

  • When in doubt, escalate severity (de-escalate later)
  • SEV-1/2 always get an incident channel, incident commander, and status page update
  • SEV-3/4 tracked in ticketing system (Jira/Linear), not paged

Alerting → Incident Pipeline

yaml
1# Alertmanager routing — SEV-1 alerts go to PagerDuty and Slack immediately
2alertmanager:
3  config:
4    route:
5      group_by: [alertname, namespace, severity]
6      group_wait: 30s
7      receiver: default
8      routes:
9        - matchers:
10            - severity=critical
11          receiver: pagerduty-sev1
12          repeat_interval: 30m    # Re-alert every 30 min if not acknowledged
13        - matchers:
14            - severity=warning
15          receiver: slack-sev2
16          group_wait: 5m          # Group warnings for 5 min before paging
17
18    receivers:
19      - name: pagerduty-sev1
20        pagerduty_configs:
21          - routing_key: "xxxxx"
22            description: "{{ .CommonAnnotations.summary }}"
23            severity: "{{ .CommonLabels.severity }}"
24            details:
25              runbook: "{{ .CommonAnnotations.runbook }}"    # Link to runbook in alert
26              namespace: "{{ .CommonLabels.namespace }}"
27              cluster: "{{ .CommonLabels.cluster }}"
28
29      - name: slack-sev2
30        slack_configs:
31          - channel: "#incidents"
32            title: "[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}"
33            text: |
34              *Summary:* {{ .CommonAnnotations.summary }}
35              *Runbook:* {{ .CommonAnnotations.runbook }}
36              *Namespace:* {{ .CommonLabels.namespace }}

Runbook Structure

A runbook has exactly one job: help the on-call engineer diagnose and resolve an incident under pressure, with limited sleep. Structure for maximum usability:

markdown
1# [Service Name] — [Alert Name]
2
3## Overview
4One sentence: what is this service and why does this alert fire?
5Firing condition: `<PromQL expression>` - fires when X exceeds Y for Z minutes.
6
7## Impact
8- Users affected: [who, estimated count]
9- Data loss risk: [yes/no, what kind]
10- Downstream dependencies: [services that depend on this one]
11
12## First Response (< 5 minutes)
131. Confirm alert is real: `kubectl get pods -n production -l app=payments-api`
142. Check error rate: [Grafana dashboard URL]
153. Check recent deployments: `kubectl rollout history deployment/payments-api -n production`
16
17## Diagnosis
18
19### Is it a deployment issue?
20- Recent rollout in last 2 hours?
21  ```bash
22  kubectl rollout history deployment/payments-api -n production
23  # If yes → consider rollback (see Recovery section)
  • Check pod logs:
    bash
    kubectl logs -n production -l app=payments-api --tail=100 | grep ERROR

Is it a dependency issue?

  • Database connectivity:
    bash
    kubectl exec -n production -it deploy/payments-api -- \
      pg_isready -h db.production.svc -U payments
  • External API status: [status page URL]

Is it a resource issue?

  • OOMKill events:
    bash
    kubectl describe pod -n production -l app=payments-api | grep -A5 OOMKilled
  • Node pressure:
    bash
    kubectl top nodes
    kubectl describe nodes | grep -A5 Conditions

Recovery

Rollback deployment

bash
kubectl rollout undo deployment/payments-api -n production
kubectl rollout status deployment/payments-api -n production --timeout=5m

Scale up temporarily

bash
kubectl scale deployment payments-api --replicas=10 -n production

Restart pods (last resort — note this has brief traffic impact)

bash
kubectl rollout restart deployment/payments-api -n production

Escalation

  • If not resolved in 30 min: page [Team Lead name, PagerDuty profile URL]
  • If database involved: page [DBA contact]
  • If AWS infrastructure: [AWS support plan URL]

Post-Incident

  • Update status page
  • Create post-mortem issue in [Linear/Jira project URL]
  • Link incident timeline in Slack thread

Store runbooks in the same Git repository as the service (co-located, versioned alongside the code), linked from the alert annotation:

```yaml
# PrometheusRule alert — always include runbook annotation
- alert: PaymentsApiHighErrorRate
  expr: ...
  annotations:
    summary: "Payments API error rate > 1%"
    runbook: "https://github.com/my-org/payments-api/blob/main/docs/runbooks/high-error-rate.md"

On-Call Rotation Design

Rotation Principles

  • Follow-the-sun: Split primary on-call by timezone to avoid 3 AM pages. US team covers US business hours and US nights; EU team covers EU hours. Overlap window for handover.
  • Minimum team size for on-call: 4 people per rotation (pager fatigue with fewer). Below 4, consider escalation-only on-call (alert to Slack, escalate to primary only for SEV-1).
  • Weekly rotations balance load; bi-weekly reduces context-switching overhead for complex systems.
  • Shadow on-call for new team members: shadow a rotation before taking primary.

PagerDuty Configuration

yaml
# Example on-call schedule (via PagerDuty API or Terraform)
# Escalation policy:
# Layer 1: Primary on-call (15 min acknowledge window)
# Layer 2: Secondary on-call (if primary doesn't respond)
# Layer 3: Team lead (final escalation)
# Layer 4: Service owner / manager (for extended SEV-1)

On-Call Metrics

Track weekly to prevent burnout:

  • Pages per week per rotation member (target: < 5 actionable pages/week)
  • Time-to-acknowledge (target: < 5 min for SEV-1)
  • Time-to-resolve (target: SEV-1 < 1h, SEV-2 < 4h)
  • Alert noise ratio: actionable vs spurious pages (target: > 80% actionable)

High noise ratio (many non-actionable pages) means alerts need tuning — alerts that can be safely ignored train on-call engineers to ignore all alerts.


Post-Mortem Methodology

Goal: Understand what happened, why it happened, and how to prevent recurrence. Not to assign blame.

Template:

markdown
1## Post-Mortem: [Service] [Brief Description]
2
3**Date:** 2026-05-09
4**Duration:** 45 minutes (14:22 UTC - 15:07 UTC)
5**Severity:** SEV-2
6**Impact:** 23% increase in payment error rate, ~500 failed transactions
7
8---
9
10### Timeline
11| Time (UTC) | Event |
12|------------|-------|
13| 14:18 | payments-api deployment v1.4.3 rolled out |
14| 14:22 | PagerDuty alert: error rate > 5% |
15| 14:24 | Incident channel created, IC assigned |
16| 14:31 | Root cause identified: DB connection pool exhausted |
17| 14:45 | Mitigation: connection pool size increased via ConfigMap |
18| 15:07 | Error rate back to baseline, incident resolved |
19
20---
21
22### Root Cause
23v1.4.3 changed the database query pattern from batch to N+1, increasing connection usage by 8x.
24The default connection pool (maxConnections: 5) was insufficient for the new query pattern.
25
26### Contributing Factors
271. Load test did not cover the new query pattern under production-level concurrency
282. No alert for database connection pool saturation
293. ConfigMap for DB_POOL_SIZE not documented in the deployment runbook
30
31### Action Items
32| Action | Owner | Due | Priority |
33|--------|-------|-----|----------|
34| Add prometheus alert for connection pool usage > 80% | @alice | 2026-05-16 | P1 |
35| Add N+1 query detection to CI (sqlcommenter) | @bob | 2026-05-23 | P2 |
36| Update runbook with connection pool tuning instructions | @alice | 2026-05-12 | P2 |
37| Add DB connection pool metrics to Grafana dashboard | @charlie | 2026-05-16 | P2 |

Post-mortem principles:

  • Write within 48-72 hours while details are fresh
  • Include all timelines, including delays in detection
  • Action items must have owners and due dates — without these, they don't get done
  • Share with the broader engineering team (blameless transparency reduces future incidents)

Frequently Asked Questions

How do I reduce alert noise?

  1. Set appropriate thresholds. An alert that fires for a 1% error rate spike during a deployment is noise; 5% sustained for 5 minutes is actionable.
  2. Use for: in PrometheusRules. Most transient issues resolve themselves. for: 5m eliminates flap alerts.
  3. Group related alerts. Alertmanager's group_by and group_wait batch related alerts into a single page. 10 alerts for the same root cause should page once.
  4. Regular alert review. Monthly: look at pages from the last 30 days. Any alert that fired 5+ times without requiring action is a candidate for removal or threshold tuning.

What tools do platform teams use for incident management?

PagerDuty and OpsGenie for on-call routing and escalation. Slack for incident coordination (dedicated #incidents channel per incident). Statuspage.io or Atlassian Statuspage for customer communication. Linear or Jira for action item tracking from post-mortems. Grafana for real-time dashboards during incidents.


For the alerting infrastructure (PrometheusRules, Alertmanager routing) that triggers incidents, see Prometheus Operator: ServiceMonitor, AlertManager, and Production Monitoring. For SLOs and error budgets that define when an incident's impact exceeds acceptable thresholds, see SLOs, Error Budgets, and Burn Rate Alerts. For the Golden Paths that embed runbook links and playbooks into the developer self-service platform, see Platform Engineering: Golden Paths and Developer Self-Service.

Building an incident management practice for a platform engineering team? Talk to us at Coding Protocols — we help platform teams design on-call rotations, runbook libraries, and post-mortem processes that improve reliability over time.

Related Topics

SRE
Incident Management
Runbooks
On-Call
Post-Mortem
Platform Engineering
Operations
Reliability

Read Next