Observability Engineering Hub

You can't fix what you can't see. Build SLOs, write alert rules, parse logs, and respond to incidents — with tools designed for engineers who are on-call.

SLO / SLI Calculator

Calculate SLI, error budget, and allowed downtime.

Open Tool

Log Parser Sandbox

Filter logs by regex and level in your browser.

Open Tool

Incident Checklist Editor

Create and export incident checklists in markdown.

Open Tool

Prometheus Alert Rule Builder

Build alert rules with labels and annotations.

Open Tool

On-Call Playbooks

Structured runbooks for the most common production incidents — written for engineers under pressure.

Latency Spike Playbook

p95/p99 checks across DB, cache, and upstream services.

Open Playbook

Error Rate Playbook

Find top failing endpoints, trace recent deploys, and roll back.

Open Playbook

Disk Full Incident

Log rotation, tmp cleanup, and disk pressure resolution steps.

Open Playbook

Memory Leak Identification

Identify and diagnose memory leaks in production containers.

Open Playbook

Not sure where to start?

Answer 2–3 questions and the Troubleshooting Wizard will route you to the exact playbook for your incident.

Launch Wizard

PromQL Cheat Sheet

The query patterns you actually use on-call — rate, histogram_quantile, absent, and recording rules.

Error rate and latency queries
Aggregation across labels
Alert expression patterns

Read Cheat Sheet

Prometheus vs Datadog

Self-hosted vs managed observability — cost model, cardinality limits, and migration tradeoffs.

Total cost of ownership
High-cardinality label handling
When to switch and when not to

Read Comparison

Observability Deep-Dives

View all observability guides

OpenTelemetry Migration Guide

How to move from Datadog and New Relic to OTel without losing visibility.

SLOs That Actually Work

Moving beyond "99.9%" to error budgets and burn-rate alerting.

eBPF Observability with Hubble

Service maps and flow observability without sidecars or agents.