Observability Engineering Hub
You can't fix what you can't see. Build SLOs, write alert rules, parse logs, and respond to incidents — with tools designed for engineers who are on-call.
SLO / SLI Calculator
Calculate SLI, error budget, and allowed downtime.
Log Parser Sandbox
Filter logs by regex and level in your browser.
Incident Checklist Editor
Create and export incident checklists in markdown.
Prometheus Alert Rule Builder
Build alert rules with labels and annotations.
On-Call Playbooks
Structured runbooks for the most common production incidents — written for engineers under pressure.
Latency Spike Playbook
p95/p99 checks across DB, cache, and upstream services.
Error Rate Playbook
Find top failing endpoints, trace recent deploys, and roll back.
Disk Full Incident
Log rotation, tmp cleanup, and disk pressure resolution steps.
Memory Leak Identification
Identify and diagnose memory leaks in production containers.
Not sure where to start?
Answer 2–3 questions and the Troubleshooting Wizard will route you to the exact playbook for your incident.
PromQL Cheat Sheet
The query patterns you actually use on-call — rate, histogram_quantile, absent, and recording rules.
- Error rate and latency queries
- Aggregation across labels
- Alert expression patterns
Prometheus vs Datadog
Self-hosted vs managed observability — cost model, cardinality limits, and migration tradeoffs.
- Total cost of ownership
- High-cardinality label handling
- When to switch and when not to


