Roadmap · 2026

SRE Roadmap

From SRE principles to platform maturity — an opinionated path for engineers who own production reliability. Covers the mindset, the metrics, the tooling, and the team model.

8 stages44 skill nodes~5 months to cover all stages

1SRE Principles & Culture 2Operational Foundations 3Observability 4SLOs & Error Budgets 5Alerting & On-Call 6Incident Management 7Reliability Engineering 8Platform SRE Maturity

SRE Principles & Culture

1–2 weeks

SRE is a mindset before it's a toolset. Understand the contract between reliability and velocity before touching any tooling.

SRE vs DevOps vs Ops

What makes SRE distinct: engineering solutions to operations problems, not just running them.

Error budgets

The core insight: reliability has a cost, and spending the budget on features vs. reliability is a business decision.

Toil defined

Manual, repetitive, automatable work that scales with traffic. Measuring toil is the first step to eliminating it.

Blameless postmortems

Systems fail; cultures that punish individuals hide the real causes. How to run postmortems that actually fix things.

Production readiness

Launch checklists, capacity reviews, and handoff criteria that prevent on-call surprises.

Operational Foundations

2–3 weeks

SREs debug production under pressure. Linux internals, networking, and cluster tooling are the instruments you reach for first.

Linux internals for SREs

Process states, memory pressure signals (OOM scores), file descriptors, and I/O wait — what each tells you during an incident.

Networking under the hood

TCP connection states, TIME_WAIT storms, DNS TTLs, and reading ss/netstat output during a degradation.

Linux Networking Cheat Sheet

kubectl for debugging

Events, describe, exec, port-forward, top — the daily debugging toolkit for cluster-hosted services.

Kubectl Cheat Sheet

k9s for cluster ops

Navigate pods, stream logs, exec into containers, and check resource pressure without memorising every kubectl flag.

k9s Cheat Sheet

CLI data wrangling: jq & curl

Parse JSON API responses, query cluster metadata, and test endpoints from the terminal during incidents.

jq Cheat Sheet

Observability

3–4 weeks

Observability is the practice of making a system's internal state legible from its outputs. You can't respond to what you can't see.

The three pillars

Metrics answer 'what', logs answer 'why', traces answer 'where'. Each has a different cost and granularity tradeoff.

Prometheus architecture

Scrape intervals, TSDB retention, relabelling, and remote write for long-term storage.

PromQL in depth

rate(), irate(), histogram_quantile(), absent(), and recording rules for query performance.

PromQL Cheat Sheet

Structured logging & log parsing

JSON logs, field cardinality, and building filters that narrow to root cause without drowning in noise.

Log Parser Sandbox

Distributed tracing

OpenTelemetry instrumentation, trace context propagation, and how traces reveal the latency breakdown across services.

Blog: OpenTelemetry Migration

USE and RED methods

USE (Utilization, Saturation, Errors) for resources; RED (Rate, Errors, Duration) for services — systematic dashboard design.

SLOs & Error Budgets

2–3 weeks

SLOs are the contract between reliability and product velocity. Getting them right changes every conversation about risk and priority.

Defining meaningful SLIs

Request success rate, latency percentiles, and availability — picking indicators that reflect the user experience, not server health.

Setting SLO targets

Why 99.9% and 99.99% are different businesses. How to negotiate targets with product teams using historical data.

SLO/SLI Calculator

Error budget burn rates

Fast burn vs slow burn, 1-hour and 6-hour windows, and why a 5× burn rate needs a page at 2 AM.

Blog: SLOs, Error Budgets & Burn Rates

Toil budget tracking

Capping toil at 50%, tracking it per sprint, and using the remaining capacity for reliability projects.

The SLO review cycle

Quarterly reviews, when to tighten vs loosen targets, and how to present error budget status to stakeholders.

Alerting & On-Call

2–3 weeks

A page that wakes someone at 3 AM is a product decision. Every alert must be actionable, urgent, and tied to user impact.

Alert rule design principles

Actionable, urgent, and non-flappy. Every alert should have a runbook link and a clear remediation path.

Prometheus Alert Rule Builder

Multi-window burn rate alerts

The Google SRE book approach: 2% budget consumed in 1 hour triggers a page; 5% in 6 hours triggers a ticket.

Runbook design

Structured, machine-readable runbooks that a half-asleep on-call responder can follow without context.

Escalation & rotation design

Primary/secondary rotations, escalation timeouts, and business-hours vs 24×7 coverage strategies.

Alert fatigue elimination

Auditing alert volume, silencing symptom-only alerts, and raising the threshold for pages vs tickets.

Incident Management

2–3 weeks

Incident response is a skill. The teams that recover fastest have practised the process — not just the technology.

Severity classification

SEV1–SEV4 definitions, blast radius estimation, and escalation criteria that avoid both under- and over-declaring.

Incident Triage Playbook

The incident commander role

Separating diagnosis from communication. The IC keeps the bridge clear; engineers focus on mitigation.

Diagnosing common failures

Latency spikes, OOMKills, memory leaks, disk full — systematic triage for the most frequent production failure modes.

Latency Spike Playbook

Gateway errors: 502/503/504

Distinguishing upstream failures from proxy misconfig, and the checks that confirm which it is.

502/503/504 Debugger

Blameless postmortems

Timeline reconstruction, contributing factors, and action items with owners — the document that prevents recurrence.

Status pages & communication

What to say, when to say it, and how to communicate under uncertainty without making customers more anxious.

Reliability Engineering

3–4 weeks

Proactive reliability work — finding failure modes before users do — is what separates SRE from reactive ops.

Chaos engineering

Principles of controlled failure injection: blast radius, steady state, hypotheses, and rollback. Chaos Mesh and LitmusChaos for Kubernetes.

Load testing & capacity planning

k6, Locust, and wrk for baseline profiling. Translating load test results into resource provisioning decisions.

Dependency resilience

Circuit breakers, retries with jitter, timeouts, and bulkheads — the patterns that prevent cascading failures.

Autoscaling for reliability

HPA, VPA, KEDA for event-driven scaling — keeping headroom without over-provisioning.

Blog: KEDA Event-Driven Autoscaling

eBPF-based observability

Kernel-level tracing without instrumentation overhead — Cilium and Tetragon for network and security observability.

Blog: eBPF & Platform Engineering

Performance profiling

CPU flame graphs, memory allocation profiling, and identifying hot paths that don't show up in standard metrics.

Platform SRE Maturity

Ongoing

Mature SRE teams spend more time eliminating classes of failure than responding to individual incidents.

Toil elimination at scale

Automating runbooks, self-healing controllers, and building platform abstractions that remove whole categories of toil.

Multi-cluster reliability

Active-active vs active-passive, cross-cluster failover, and managing SLOs across multiple regions.

Blog: Multi-Cluster Patterns & Pitfalls

SLO-based capacity planning

Using error budget burn projections to predict when capacity will become a reliability risk — before it pages.

Embedded vs centralised SRE

The two team models, when each makes sense, and how to transition between them as the organisation grows.

Platform engineering overlap

Where SRE ends and platform engineering begins — golden paths, self-service infra, and shared reliability standards.

Platform Engineering Roadmap

Measuring reliability programme health

DORA metrics, alert-to-page ratios, postmortem action item completion rates, and toil percentage trends.

Put it into practice

The toolkit has SLO calculators, PromQL references, incident playbooks, and alert rule builders for the stages above — no account required.

Explore the Toolkit Platform Engineering Roadmap DevSecOps Roadmap