From SRE principles to platform maturity — an opinionated path for engineers who own production reliability. Covers the mindset, the metrics, the tooling, and the team model.
SRE is a mindset before it's a toolset. Understand the contract between reliability and velocity before touching any tooling.
SRE vs DevOps vs Ops
What makes SRE distinct: engineering solutions to operations problems, not just running them.
Error budgets
The core insight: reliability has a cost, and spending the budget on features vs. reliability is a business decision.
Toil defined
Manual, repetitive, automatable work that scales with traffic. Measuring toil is the first step to eliminating it.
Blameless postmortems
Systems fail; cultures that punish individuals hide the real causes. How to run postmortems that actually fix things.
Production readiness
Launch checklists, capacity reviews, and handoff criteria that prevent on-call surprises.
SREs debug production under pressure. Linux internals, networking, and cluster tooling are the instruments you reach for first.
Linux internals for SREs
Process states, memory pressure signals (OOM scores), file descriptors, and I/O wait — what each tells you during an incident.
Networking under the hood
TCP connection states, TIME_WAIT storms, DNS TTLs, and reading ss/netstat output during a degradation.
Linux Networking Cheat Sheetkubectl for debugging
Events, describe, exec, port-forward, top — the daily debugging toolkit for cluster-hosted services.
Kubectl Cheat Sheetk9s for cluster ops
Navigate pods, stream logs, exec into containers, and check resource pressure without memorising every kubectl flag.
k9s Cheat SheetCLI data wrangling: jq & curl
Parse JSON API responses, query cluster metadata, and test endpoints from the terminal during incidents.
jq Cheat SheetObservability is the practice of making a system's internal state legible from its outputs. You can't respond to what you can't see.
The three pillars
Metrics answer 'what', logs answer 'why', traces answer 'where'. Each has a different cost and granularity tradeoff.
Prometheus architecture
Scrape intervals, TSDB retention, relabelling, and remote write for long-term storage.
PromQL in depth
rate(), irate(), histogram_quantile(), absent(), and recording rules for query performance.
PromQL Cheat SheetStructured logging & log parsing
JSON logs, field cardinality, and building filters that narrow to root cause without drowning in noise.
Log Parser SandboxDistributed tracing
OpenTelemetry instrumentation, trace context propagation, and how traces reveal the latency breakdown across services.
Blog: OpenTelemetry MigrationUSE and RED methods
USE (Utilization, Saturation, Errors) for resources; RED (Rate, Errors, Duration) for services — systematic dashboard design.
SLOs are the contract between reliability and product velocity. Getting them right changes every conversation about risk and priority.
Defining meaningful SLIs
Request success rate, latency percentiles, and availability — picking indicators that reflect the user experience, not server health.
Setting SLO targets
Why 99.9% and 99.99% are different businesses. How to negotiate targets with product teams using historical data.
SLO/SLI CalculatorError budget burn rates
Fast burn vs slow burn, 1-hour and 6-hour windows, and why a 5× burn rate needs a page at 2 AM.
Blog: SLOs, Error Budgets & Burn RatesToil budget tracking
Capping toil at 50%, tracking it per sprint, and using the remaining capacity for reliability projects.
The SLO review cycle
Quarterly reviews, when to tighten vs loosen targets, and how to present error budget status to stakeholders.
A page that wakes someone at 3 AM is a product decision. Every alert must be actionable, urgent, and tied to user impact.
Alert rule design principles
Actionable, urgent, and non-flappy. Every alert should have a runbook link and a clear remediation path.
Prometheus Alert Rule BuilderMulti-window burn rate alerts
The Google SRE book approach: 2% budget consumed in 1 hour triggers a page; 5% in 6 hours triggers a ticket.
Runbook design
Structured, machine-readable runbooks that a half-asleep on-call responder can follow without context.
Escalation & rotation design
Primary/secondary rotations, escalation timeouts, and business-hours vs 24×7 coverage strategies.
Alert fatigue elimination
Auditing alert volume, silencing symptom-only alerts, and raising the threshold for pages vs tickets.
Incident response is a skill. The teams that recover fastest have practised the process — not just the technology.
Severity classification
SEV1–SEV4 definitions, blast radius estimation, and escalation criteria that avoid both under- and over-declaring.
Incident Triage PlaybookThe incident commander role
Separating diagnosis from communication. The IC keeps the bridge clear; engineers focus on mitigation.
Diagnosing common failures
Latency spikes, OOMKills, memory leaks, disk full — systematic triage for the most frequent production failure modes.
Latency Spike PlaybookGateway errors: 502/503/504
Distinguishing upstream failures from proxy misconfig, and the checks that confirm which it is.
502/503/504 DebuggerBlameless postmortems
Timeline reconstruction, contributing factors, and action items with owners — the document that prevents recurrence.
Status pages & communication
What to say, when to say it, and how to communicate under uncertainty without making customers more anxious.
Proactive reliability work — finding failure modes before users do — is what separates SRE from reactive ops.
Chaos engineering
Principles of controlled failure injection: blast radius, steady state, hypotheses, and rollback. Chaos Mesh and LitmusChaos for Kubernetes.
Load testing & capacity planning
k6, Locust, and wrk for baseline profiling. Translating load test results into resource provisioning decisions.
Dependency resilience
Circuit breakers, retries with jitter, timeouts, and bulkheads — the patterns that prevent cascading failures.
Autoscaling for reliability
HPA, VPA, KEDA for event-driven scaling — keeping headroom without over-provisioning.
Blog: KEDA Event-Driven AutoscalingeBPF-based observability
Kernel-level tracing without instrumentation overhead — Cilium and Tetragon for network and security observability.
Blog: eBPF & Platform EngineeringPerformance profiling
CPU flame graphs, memory allocation profiling, and identifying hot paths that don't show up in standard metrics.
Mature SRE teams spend more time eliminating classes of failure than responding to individual incidents.
Toil elimination at scale
Automating runbooks, self-healing controllers, and building platform abstractions that remove whole categories of toil.
Multi-cluster reliability
Active-active vs active-passive, cross-cluster failover, and managing SLOs across multiple regions.
Blog: Multi-Cluster Patterns & PitfallsSLO-based capacity planning
Using error budget burn projections to predict when capacity will become a reliability risk — before it pages.
Embedded vs centralised SRE
The two team models, when each makes sense, and how to transition between them as the organisation grows.
Platform engineering overlap
Where SRE ends and platform engineering begins — golden paths, self-service infra, and shared reliability standards.
Platform Engineering RoadmapMeasuring reliability programme health
DORA metrics, alert-to-page ratios, postmortem action item completion rates, and toil percentage trends.
The toolkit has SLO calculators, PromQL references, incident playbooks, and alert rule builders for the stages above — no account required.