DevOps & Platform

Monitoring Kubernetes with Prometheus and Grafana

Beginner45 min to complete11 min read

Deploy the kube-prometheus-stack with Helm, understand what it collects out of the box, build a dashboard for your application, and set up your first alert rule — all in under an hour.

Before you begin

  • A running Kubernetes cluster
  • Helm 3 installed
  • kubectl configured
  • At least 2 CPU and 4Gi memory available in the cluster
Prometheus
Grafana
Kubernetes
Monitoring
Observability

You don't need to configure Prometheus from scratch. The kube-prometheus-stack Helm chart deploys Prometheus, Grafana, Alertmanager, and a set of pre-built dashboards and alert rules that cover the entire Kubernetes stack — nodes, pods, deployments, PVCs, and more.

This tutorial gets you from zero to a working monitoring stack, then shows you how to add your own application metrics.

Step 1: Install the kube-prometheus-stack

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=changeme \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=20Gi

This deploys:

  • Prometheus (metrics collection and storage)
  • Grafana (dashboards and visualisation)
  • Alertmanager (alert routing and deduplication)
  • kube-state-metrics (exposes Kubernetes object state as metrics)
  • node-exporter (exposes host-level metrics: CPU, memory, disk, network)

Wait for everything to start:

bash
kubectl wait --for=condition=Ready pods --all -n monitoring --timeout=180s
kubectl get pods -n monitoring

Step 2: Access Grafana

Forward Grafana's port locally:

bash
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

Open http://localhost:3000. Log in with admin / changeme.

Navigate to Dashboards → Browse. You'll find 30+ pre-built dashboards:

  • Kubernetes / Cluster — overall cluster health
  • Kubernetes / Nodes — per-node CPU, memory, disk
  • Kubernetes / Pods — per-pod resource usage
  • Kubernetes / Workloads — deployment/daemonset/statefulset status

Step 3: Access Prometheus

bash
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring

Open http://localhost:9090. This is Prometheus's built-in query UI.

Try a few queries:

promql
# CPU usage per pod (5-minute average)
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])

# Memory usage per pod
container_memory_working_set_bytes{namespace="production"}

# Number of ready replicas per deployment
kube_deployment_status_replicas_ready{namespace="production"}

# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])

Step 4: Instrument Your Application

To expose custom metrics from your application, use a Prometheus client library.

Node.js (prom-client):

javascript
const client = require('prom-client');
const register = new client.Registry();

// Counter: total HTTP requests
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'status_code'],
  registers: [register]
});

// Histogram: request duration
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
  registers: [register]
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Instrument a route
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    route: req.path
  });
  res.on('finish', () => {
    httpRequestsTotal.inc({ method: req.method, status_code: res.statusCode });
    end({ status_code: res.statusCode });
  });
  next();
});

Go (prometheus/client_golang):

go
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    }, []string{"method", "status_code"})

    httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request duration",
        Buckets: []float64{0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5},
    }, []string{"method", "route"})
)

// In your main():
http.Handle("/metrics", promhttp.Handler())

Step 5: Tell Prometheus to Scrape Your App

Create a ServiceMonitor — a CRD that kube-prometheus-stack uses to configure scraping:

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: production
  labels:
    release: kube-prometheus-stack   # Must match the Helm release label
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
bash
kubectl apply -f servicemonitor.yaml

Your application's Service must have a port named http (or whatever you specify in endpoints.port). Verify Prometheus is scraping it at http://localhost:9090 → Status → Targets.

Step 6: Build a Grafana Dashboard for Your App

In Grafana, click the + icon → Dashboard → Add visualization.

Panel 1: Request rate

promql
sum(rate(http_requests_total{namespace="production"}[2m])) by (status_code)

Set visualization type: Time series. Set legend to {{status_code}}.

Panel 2: P95 latency

promql
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{namespace="production"}[5m])) by (le, route)
)

Panel 3: Error rate (5xx)

promql
sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[2m]))
/
sum(rate(http_requests_total{namespace="production"}[2m]))

Set threshold to 0.01 (1% error rate = red).

Save the dashboard. Click Share → Export → save the JSON to your repo to version it.

Step 7: Create an Alert Rule

Alert when error rate exceeds 1% for 5 minutes:

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: production
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: my-app
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{namespace="production",status_code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{namespace="production"}[5m]))
            > 0.01
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High 5xx error rate on my-app"
            description: "Error rate is {{ $value | humanizePercentage }} — investigate logs"

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total{namespace="production"}[1h]) > 3
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"
bash
kubectl apply -f prometheus-rules.yaml

Check the rule at http://localhost:9090 → Alerts. It should appear in INACTIVE state (no firing yet).

Step 8: Configure Alertmanager

By default, Alertmanager doesn't route alerts anywhere. Configure Slack notifications:

yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-kube-prometheus-stack
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

    route:
      group_by: [alertname, namespace]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: slack-alerts
      routes:
        - match:
            severity: critical
          receiver: slack-critical

    receivers:
      - name: slack-alerts
        slack_configs:
          - channel: "#alerts"
            title: "{{ .CommonAnnotations.summary }}"
            text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"

      - name: slack-critical
        slack_configs:
          - channel: "#oncall"
            title: "CRITICAL: {{ .CommonAnnotations.summary }}"
bash
kubectl apply -f alertmanager-config.yaml
# Restart Alertmanager to pick it up
kubectl rollout restart statefulset/alertmanager-kube-prometheus-stack-alertmanager -n monitoring

Persistent Storage in Production

The storageSpec in Step 1 creates a PersistentVolumeClaim for Prometheus. For Grafana, add:

bash
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=5Gi \
  --reuse-values

Without persistent storage, your dashboards and alert history disappear on pod restart.

We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.

Struggling with this in production?

We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.