Monitoring Kubernetes with Prometheus and Grafana
Deploy the kube-prometheus-stack with Helm, understand what it collects out of the box, build a dashboard for your application, and set up your first alert rule — all in under an hour.
Before you begin
- A running Kubernetes cluster
- Helm 3 installed
- kubectl configured
- At least 2 CPU and 4Gi memory available in the cluster
You don't need to configure Prometheus from scratch. The kube-prometheus-stack Helm chart deploys Prometheus, Grafana, Alertmanager, and a set of pre-built dashboards and alert rules that cover the entire Kubernetes stack — nodes, pods, deployments, PVCs, and more.
This tutorial gets you from zero to a working monitoring stack, then shows you how to add your own application metrics.
Step 1: Install the kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=changeme \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=20Gi
This deploys:
- Prometheus (metrics collection and storage)
- Grafana (dashboards and visualisation)
- Alertmanager (alert routing and deduplication)
- kube-state-metrics (exposes Kubernetes object state as metrics)
- node-exporter (exposes host-level metrics: CPU, memory, disk, network)
Wait for everything to start:
kubectl wait --for=condition=Ready pods --all -n monitoring --timeout=180s
kubectl get pods -n monitoring
Step 2: Access Grafana
Forward Grafana's port locally:
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
Open http://localhost:3000. Log in with admin / changeme.
Navigate to Dashboards → Browse. You'll find 30+ pre-built dashboards:
- Kubernetes / Cluster — overall cluster health
- Kubernetes / Nodes — per-node CPU, memory, disk
- Kubernetes / Pods — per-pod resource usage
- Kubernetes / Workloads — deployment/daemonset/statefulset status
Step 3: Access Prometheus
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring
Open http://localhost:9090. This is Prometheus's built-in query UI.
Try a few queries:
# CPU usage per pod (5-minute average)
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
# Memory usage per pod
container_memory_working_set_bytes{namespace="production"}
# Number of ready replicas per deployment
kube_deployment_status_replicas_ready{namespace="production"}
# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])
Step 4: Instrument Your Application
To expose custom metrics from your application, use a Prometheus client library.
Node.js (prom-client):
const client = require('prom-client');
const register = new client.Registry();
// Counter: total HTTP requests
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'status_code'],
registers: [register]
});
// Histogram: request duration
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
registers: [register]
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Instrument a route
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({
method: req.method,
route: req.path
});
res.on('finish', () => {
httpRequestsTotal.inc({ method: req.method, status_code: res.statusCode });
end({ status_code: res.statusCode });
});
next();
});
Go (prometheus/client_golang):
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
}, []string{"method", "status_code"})
httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: []float64{0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5},
}, []string{"method", "route"})
)
// In your main():
http.Handle("/metrics", promhttp.Handler())
Step 5: Tell Prometheus to Scrape Your App
Create a ServiceMonitor — a CRD that kube-prometheus-stack uses to configure scraping:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: production
labels:
release: kube-prometheus-stack # Must match the Helm release label
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: http
path: /metrics
interval: 15s
kubectl apply -f servicemonitor.yaml
Your application's Service must have a port named http (or whatever you specify in endpoints.port). Verify Prometheus is scraping it at http://localhost:9090 → Status → Targets.
Step 6: Build a Grafana Dashboard for Your App
In Grafana, click the + icon → Dashboard → Add visualization.
Panel 1: Request rate
sum(rate(http_requests_total{namespace="production"}[2m])) by (status_code)
Set visualization type: Time series. Set legend to {{status_code}}.
Panel 2: P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{namespace="production"}[5m])) by (le, route)
)
Panel 3: Error rate (5xx)
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[2m]))
/
sum(rate(http_requests_total{namespace="production"}[2m]))
Set threshold to 0.01 (1% error rate = red).
Save the dashboard. Click Share → Export → save the JSON to your repo to version it.
Step 7: Create an Alert Rule
Alert when error rate exceeds 1% for 5 minutes:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: production
labels:
release: kube-prometheus-stack
spec:
groups:
- name: my-app
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="production"}[5m]))
> 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High 5xx error rate on my-app"
description: "Error rate is {{ $value | humanizePercentage }} — investigate logs"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{namespace="production"}[1h]) > 3
for: 0m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
kubectl apply -f prometheus-rules.yaml
Check the rule at http://localhost:9090 → Alerts. It should appear in INACTIVE state (no firing yet).
Step 8: Configure Alertmanager
By default, Alertmanager doesn't route alerts anywhere. Configure Slack notifications:
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-kube-prometheus-stack
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
route:
group_by: [alertname, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: slack-alerts
routes:
- match:
severity: critical
receiver: slack-critical
receivers:
- name: slack-alerts
slack_configs:
- channel: "#alerts"
title: "{{ .CommonAnnotations.summary }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
- name: slack-critical
slack_configs:
- channel: "#oncall"
title: "CRITICAL: {{ .CommonAnnotations.summary }}"
kubectl apply -f alertmanager-config.yaml
# Restart Alertmanager to pick it up
kubectl rollout restart statefulset/alertmanager-kube-prometheus-stack-alertmanager -n monitoring
Persistent Storage in Production
The storageSpec in Step 1 creates a PersistentVolumeClaim for Prometheus. For Grafana, add:
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=5Gi \
--reuse-values
Without persistent storage, your dashboards and alert history disappear on pod restart.
We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.
Struggling with this in production?
We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.