Observability

Install Prometheus and Grafana on Kubernetes with kube-prometheus-stack

Intermediate25 min to complete14 min read

Deploy the full kube-prometheus-stack — Prometheus, Alertmanager, and Grafana — with a single Helm install. Access pre-built cluster dashboards, write a custom alert rule, and configure Alertmanager to route alerts to Slack.

Before you begin

  • A running Kubernetes cluster
  • kubectl configured with cluster-admin access
  • Helm 3 installed
  • A Slack workspace (optional — for alert routing)
Prometheus
Grafana
Alertmanager
Kubernetes
Observability
Helm
Monitoring

Running a Kubernetes cluster without metrics is flying blind. You need to know when nodes are under memory pressure, when pods are crash-looping, when deployment rollouts stall, and when your API latency spikes before users notice. kube-prometheus-stack gives you all of that in a single Helm install: Prometheus scrapes every Kubernetes component, Grafana visualises it with pre-built dashboards, and Alertmanager routes firing alerts wherever you need them.

Step 1: Add the Prometheus Community Helm repository

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Step 2: Create the monitoring namespace

bash
kubectl create namespace monitoring

Step 3: Install kube-prometheus-stack

This installs Prometheus Operator, Prometheus, Alertmanager, Grafana, kube-state-metrics, node-exporter, and 20+ pre-built dashboards:

bash
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=your-secure-password \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

The retention=15d keeps 15 days of metrics. In production, avoid passing the Grafana password via --set as it is stored in shell history and in the Helm release Secret in plaintext. Use --values grafana-values.yaml or --set grafana.admin.existingSecret=my-grafana-secret instead. For longer retention, consider Thanos or VictoriaMetrics as a remote storage backend.

Step 4: Wait for all pods to start

bash
kubectl get pods -n monitoring --watch

Expected output after 2–3 minutes:

NAME                                                     READY   STATUS
alertmanager-kube-prometheus-stack-alertmanager-0        2/2     Running
kube-prometheus-stack-grafana-7d9f8c6b5-xk4p2            2/2     Running
kube-prometheus-stack-kube-state-metrics-6b9d-r4n7l      1/1     Running
kube-prometheus-stack-operator-5c9b9b9-k2j9m             1/1     Running
kube-prometheus-stack-prometheus-node-exporter-4wqxt     1/1     Running
prometheus-kube-prometheus-stack-prometheus-0            2/2     Running

Step 5: Access Grafana

Port-forward Grafana to your local machine:

bash
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

Open http://localhost:3000 and log in with:

  • Username: admin
  • Password: the password you set in step 3

Step 6: Explore the pre-built dashboards

Navigate to Dashboards → Browse. You'll find dashboards for:

  • Kubernetes / Compute Resources / Cluster — CPU/memory usage across the cluster
  • Kubernetes / Compute Resources / Namespace (Pods) — per-pod resource usage
  • Kubernetes / Networking / Cluster — network traffic by namespace
  • Node Exporter / Full — detailed per-node OS metrics
  • Kubernetes / Persistent Volumes — PVC usage and capacity

The most useful one to check first is Kubernetes / Compute Resources / Cluster — it shows you immediately if any namespace is consuming a disproportionate share of cluster resources.

Step 7: Access Prometheus directly

Port-forward Prometheus to explore raw metrics and test PromQL queries:

bash
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring

Open http://localhost:9090 and try a query:

promql
sum(rate(container_cpu_usage_seconds_total{namespace!="kube-system",container!=""}[5m])) by (namespace)

This shows CPU usage rate per namespace over the last 5 minutes — useful for identifying which teams are consuming the most compute. The container!="" filter excludes pause containers, which expose the same metric at the pod level and would otherwise cause double-counting.

Step 8: Create a custom alert rule

PrometheusRule resources let you add alert rules without editing any ConfigMaps. The release: kube-prometheus-stack label tells the Prometheus Operator to pick up this rule:

bash
1cat <<EOF | kubectl apply -f -
2apiVersion: monitoring.coreos.com/v1
3kind: PrometheusRule
4metadata:
5  name: custom-alerts
6  namespace: monitoring
7  labels:
8    release: kube-prometheus-stack  # must match your helm install release name
9spec:
10  groups:
11  - name: pod.rules
12    rules:
13    - alert: PodCrashLooping
14      expr: increase(kube_pod_container_status_restarts_total[15m]) > 1
15      for: 5m
16      labels:
17        severity: warning
18      annotations:
19        summary: "Pod {{ $labels.pod }} is crash-looping"
20        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted more than once in the last 15 minutes"
21    - alert: NodeHighMemoryPressure
22      expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
23      for: 5m
24      labels:
25        severity: critical
26      annotations:
27        summary: "Node {{ $labels.instance }} has less than 10% memory available"
28        description: "Available memory on {{ $labels.instance }} has been below 10% for 5 minutes"
29EOF

Verify Prometheus picked up the rule:

bash
# Check in the Prometheus UI at http://localhost:9090/rules
# Or via kubectl:
kubectl get prometheusrule custom-alerts -n monitoring

Step 9: Configure Alertmanager for Slack notifications

Create a Slack incoming webhook in your workspace (Your workspace → Apps → Incoming Webhooks), then update the Alertmanager configuration:

bash
1cat <<EOF | kubectl apply -f -
2apiVersion: v1
3kind: Secret
4metadata:
5  name: alertmanager-kube-prometheus-stack-alertmanager
6  namespace: monitoring
7stringData:
8  alertmanager.yaml: |
9    global:
10      resolve_timeout: 5m
11    route:
12      receiver: slack-critical
13      group_by: ['alertname', 'namespace']
14      group_wait: 30s
15      group_interval: 5m
16      repeat_interval: 4h
17      routes:
18      - match:
19          severity: critical
20        receiver: slack-critical
21      - match:
22          severity: warning
23        receiver: slack-warnings
24    receivers:
25    - name: slack-critical
26      slack_configs:
27      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
28        channel: '#platform-alerts'
29        title: ':red_circle: {{ .GroupLabels.alertname }}'
30        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}'
31        send_resolved: true
32    - name: slack-warnings
33      slack_configs:
34      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
35        channel: '#platform-warnings'
36        title: ':warning: {{ .GroupLabels.alertname }}'
37        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}'
38        send_resolved: true
39EOF

Set api_url per receiver rather than in globalglobal.slack_api_url was deprecated in Alertmanager 0.22 and produces warnings in current versions.

Alertmanager automatically reloads its configuration when the Secret is updated.

Step 10: Verify Alertmanager received the config

Port-forward Alertmanager:

bash
kubectl port-forward svc/kube-prometheus-stack-alertmanager 9093:9093 -n monitoring

Open http://localhost:9093 and check the Status page to confirm your receivers are configured correctly.

What you built

Your cluster has full metrics coverage: Prometheus scrapes every node, pod, and Kubernetes component every 30 seconds. Grafana visualises 15 days of history with pre-built dashboards you can use immediately. PrometheusRule resources let any team add alert rules without touching the central configuration. Alertmanager routes firing alerts to the right Slack channels with deduplication and grouping so you're not flooded with repeated notifications.

We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.

Struggling with this in production?

We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.