Kubernetes Observability: Prometheus, Grafana, OpenTelemetry (2026)

A Kubernetes cluster without observability is a black box. Pods restart and you don't know why. Latency spikes and you can't tell which service. Node pressure triggers evictions and by the time you investigate, the evidence is gone.

A Kubernetes cluster with too much observability is a different problem: $40,000/month in Datadog bills, Prometheus running out of disk every two weeks, dashboards nobody looks at, and alert fatigue that trains your team to ignore PagerDuty.

This guide covers the production-calibrated observability stack — what to collect, how to structure it, and the specific configurations that prevent it from becoming unmanageable.

The Three Pillars in a Kubernetes Context

Metrics — numeric time-series data. CPU, memory, request rate, error rate, latency percentiles. Prometheus is the standard; every Kubernetes component exposes Prometheus-format metrics.

Logs — structured or unstructured event streams from application and system processes. Loki (Grafana's log aggregation system), Elasticsearch, or CloudWatch Logs are the common stores.

Traces — distributed request traces showing how a request flows through multiple services. OpenTelemetry is the standard instrumentation layer; Tempo (Grafana), Jaeger, or Zipkin are common backends.

All three feed into Grafana for visualisation and alerting. The integration point is OpenTelemetry Collector — a vendor-neutral pipeline for collecting, processing, and exporting all three signal types to any backend.

The kube-prometheus-stack

The fastest path to production-ready Kubernetes metrics is kube-prometheus-stack (formerly prometheus-operator). It deploys:

Prometheus Operator — manages Prometheus and Alertmanager as Kubernetes resources
Prometheus — metrics collection and storage
Alertmanager — alert routing and deduplication
Grafana — dashboards and visualisation
kube-state-metrics — cluster-state metrics (Deployment replicas, Pod status, PVC status)
node-exporter — node-level metrics (CPU, memory, disk, network per node)
Pre-built dashboards for Kubernetes cluster overview, node resources, pod resources

bash

1helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
2
3helm upgrade --install kube-prometheus-stack \
4  prometheus-community/kube-prometheus-stack \
5  --namespace monitoring \
6  --create-namespace \
7  --version 67.0.0 \
8  --values prometheus-values.yaml \
9  --wait

yaml

1# prometheus-values.yaml
2prometheus:
3  prometheusSpec:
4    retention: 15d                  # How long to keep raw metrics
5    retentionSize: 50GB             # Size-based retention cap
6    storageSpec:
7      volumeClaimTemplate:
8        spec:
9          storageClassName: gp3
10          accessModes: ["ReadWriteOnce"]
11          resources:
12            requests:
13              storage: 100Gi
14    resources:
15      requests:
16        cpu: 500m
17        memory: 2Gi
18      limits:
19        memory: 4Gi
20    # Scrape all ServiceMonitors and PodMonitors in any namespace
21    serviceMonitorSelectorNilUsesHelmValues: false
22    podMonitorSelectorNilUsesHelmValues: false
23    ruleSelectorNilUsesHelmValues: false
24
25alertmanager:
26  alertmanagerSpec:
27    storage:
28      volumeClaimTemplate:
29        spec:
30          storageClassName: gp3
31          resources:
32            requests:
33              storage: 10Gi
34
35grafana:
36  adminPassword: "change-me-via-secret"
37  persistence:
38    enabled: true
39    storageClassName: gp3
40    size: 10Gi
41  ingress:
42    enabled: true
43    ingressClassName: nginx
44    hosts:
45      - grafana.example.com
46    tls:
47      - secretName: grafana-tls
48        hosts:
49          - grafana.example.com
50
51# Reduce default scrape intervals for lower cardinality
52defaultRules:
53  rules:
54    kubeProxy: false    # Disable if using Cilium (no kube-proxy)

ServiceMonitor and PodMonitor

The Prometheus Operator uses ServiceMonitor and PodMonitor CRDs to configure scrape targets. Instead of editing prometheus.yml, you create these resources alongside your application.

yaml

1# ServiceMonitor for an HTTP service
2apiVersion: monitoring.coreos.com/v1
3kind: ServiceMonitor
4metadata:
5  name: api-metrics
6  namespace: production
7  labels:
8    app: api
9spec:
10  selector:
11    matchLabels:
12      app: api
13  endpoints:
14    - port: metrics          # Named port on the Service
15      path: /metrics
16      interval: 30s
17      scrapeTimeout: 10s
18  namespaceSelector:
19    matchNames:
20      - production

yaml

1# PodMonitor for pods without a Service (DaemonSets, sidecars)
2apiVersion: monitoring.coreos.com/v1
3kind: PodMonitor
4metadata:
5  name: node-exporter
6  namespace: monitoring
7spec:
8  selector:
9    matchLabels:
10      app.kubernetes.io/name: node-exporter
11  podMetricsEndpoints:
12    - port: metrics
13      interval: 30s

The Prometheus Operator watches for these resources cluster-wide and automatically adds the scrape targets to Prometheus configuration. Application teams create their own ServiceMonitor in their namespace; no central Prometheus config file to edit. For a deep dive into ServiceMonitor, PodMonitor, AlertmanagerConfig, and multi-namespace monitoring, see Prometheus Operator: ServiceMonitor, AlertManager, and Production Monitoring.

Recording Rules: Taming Cardinality

Raw Prometheus metrics can be expensive to query. A rate() over a high-cardinality metric at query time scans millions of time series. Recording rules pre-compute expensive queries and store the results as new, lower-cardinality metrics:

yaml

1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4  name: api-recording-rules
5  namespace: monitoring
6spec:
7  groups:
8    - name: api.rules
9      interval: 30s
10      rules:
11        # Pre-compute request rate by service
12        - record: job:http_requests_total:rate5m
13          expr: sum(rate(http_requests_total[5m])) by (job, namespace)
14
15        # Pre-compute error rate
16        - record: job:http_errors_total:rate5m
17          expr: |
18            sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, namespace)
19            /
20            sum(rate(http_requests_total[5m])) by (job, namespace)
21
22        # P95 latency per service
23        - record: job:http_request_duration_seconds:p95_5m
24          expr: |
25            histogram_quantile(0.95,
26              sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, namespace)
27            )

Dashboard queries reference the recording rule metrics (job:http_requests_total:rate5m) instead of the raw metric. Dashboards load in milliseconds; alerts evaluate in milliseconds.

Alerting

AlertManager routes alerts to Slack, PagerDuty, email, or any webhook. Alert rules are defined as PrometheusRule resources:

yaml

1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4  name: api-alerts
5  namespace: monitoring
6spec:
7  groups:
8    - name: api.alerts
9      rules:
10        - alert: HighErrorRate
11          expr: job:http_errors_total:rate5m > 0.05
12          for: 5m
13          labels:
14            severity: warning
15            team: api
16          annotations:
17            summary: "High error rate on {{ $labels.job }}"
18            description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
19            runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
20
21        - alert: PodCrashLooping
22          expr: |
23            rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 5
24          for: 0m
25          labels:
26            severity: critical
27          annotations:
28            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"

Alert fatigue prevention:

Every alert needs a runbook_url — if there's no runbook, there's no alert
for: 5m means the condition must be true for 5 continuous minutes before firing — prevents alerts on transient spikes
Use severity: warning liberally and severity: critical sparingly — critical should mean "wake someone up now"

Logs with Loki

Loki (Grafana's log aggregation system) is the cost-effective alternative to Elasticsearch for Kubernetes logs. It indexes only metadata (labels), not the full log content — dramatically cheaper at scale.

bash

helm upgrade --install loki grafana/loki \
  --namespace monitoring \
  --version 6.22.0 \
  --values loki-values.yaml

yaml

1# loki-values.yaml — S3-backed storage for production
2loki:
3  auth_enabled: false
4  storage:
5    type: s3
6    s3:
7      region: us-east-1
8      bucketnames: my-loki-logs
9  schema_config:
10    configs:
11      - from: 2024-01-01
12        store: tsdb
13        object_store: s3
14        schema: v13
15        index:
16          prefix: index_
17          period: 24h
18  limits_config:
19    retention_period: 30d
20    ingestion_rate_mb: 16
21    ingestion_burst_size_mb: 32

Promtail ships logs from pods to Loki:

bash

helm upgrade --install promtail grafana/promtail \
  --namespace monitoring \
  --set loki.serviceName=loki

Promtail runs as a DaemonSet, reads container logs from the node's /var/log/pods/ directory, and adds Kubernetes metadata labels (pod name, namespace, container name) before shipping to Loki.

In Grafana, logs appear in the Explore view and can be correlated with metrics using shared labels (pod name, namespace). For a production-focused setup using Fluent Bit (a lighter-weight alternative to Promtail) with Loki on EKS, including S3 backend configuration and LogQL querying, see Kubernetes Logging: Fluent Bit and Grafana Loki.

Distributed Tracing with OpenTelemetry

OpenTelemetry (OTEL) is the CNCF standard for application instrumentation. The OpenTelemetry Collector is the central processing pipeline — applications send traces, metrics, and logs to it; it processes and exports to backends.

bash

helm upgrade --install opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace monitoring \
  --values otel-values.yaml

yaml

1# otel-values.yaml
2mode: deployment    # or daemonset for node-local collection
3
4config:
5  receivers:
6    otlp:
7      protocols:
8        grpc:
9          endpoint: 0.0.0.0:4317
10        http:
11          endpoint: 0.0.0.0:4318
12
13  processors:
14    batch:
15      timeout: 1s
16      send_batch_size: 1024
17    memory_limiter:
18      check_interval: 1s
19      limit_mib: 512
20    # Add Kubernetes metadata to all telemetry
21    k8sattributes:
22      extract:
23        metadata:
24          - k8s.namespace.name
25          - k8s.pod.name
26          - k8s.deployment.name
27          - k8s.node.name
28
29  exporters:
30    otlp/tempo:
31      endpoint: tempo:4317
32      tls:
33        insecure: true
34    prometheusremotewrite:
35      endpoint: http://kube-prometheus-stack-prometheus:9090/api/v1/write
36    loki:
37      endpoint: http://loki:3100/loki/api/v1/push
38
39  service:
40    pipelines:
41      traces:
42        receivers: [otlp]
43        processors: [memory_limiter, k8sattributes, batch]
44        exporters: [otlp/tempo]
45      metrics:
46        receivers: [otlp]
47        processors: [memory_limiter, k8sattributes, batch]
48        exporters: [prometheusremotewrite]
49      logs:
50        receivers: [otlp]
51        processors: [memory_limiter, k8sattributes, batch]
52        exporters: [loki]

Applications instrument with the OTEL SDK for their language and send to opentelemetry-collector.monitoring.svc.cluster.local:4317.

Grafana Tempo stores traces and integrates with Grafana for trace visualisation and exemplar linking (jump from a metric spike directly to the trace that caused it).

Cardinality: The Cost Killer

High cardinality is the most common cause of Prometheus cost explosions. Cardinality is the number of unique time series — each unique combination of metric name and label values is one series.

Dangerous label patterns:

1// WRONG — user_id in a label creates one series per user
2prometheus.NewCounterVec(prometheus.CounterOpts{
3    Name: "api_requests_total",
4}, []string{"method", "path", "status", "user_id"})   // user_id = millions of series
5
6// RIGHT — aggregate at a higher level
7prometheus.NewCounterVec(prometheus.CounterOpts{
8    Name: "api_requests_total",
9}, []string{"method", "status_class"})   // status_class = "2xx", "4xx", "5xx"

Never use these as metric label values: user IDs, request IDs, session tokens, email addresses, or any other high-cardinality identifier.

Cardinality audit:

bash

1# Find the highest-cardinality metrics
2curl -s http://prometheus:9090/api/v1/label/__name__/values | \
3  jq '.data[]' | while read metric; do
4    count=$(curl -s "http://prometheus:9090/api/v1/query?query=count({__name__=\"${metric//\"/}\"})" | \
5      jq '.data.result[0].value[1] // "0"')
6    echo "$count $metric"
7  done | sort -rn | head -20

For managed Prometheus (Amazon Managed Prometheus, Grafana Cloud), high cardinality directly maps to cost. Identify and fix high-cardinality metrics before they accumulate.

The Grafana Dashboard Hierarchy

Avoid dashboard sprawl — 200 dashboards that nobody uses is worse than 10 dashboards that everyone knows.

Tier 1: Platform Overview — one dashboard, visible to everyone. Cluster health, node utilisation, top 10 CPU/memory consumers, active incidents.

Tier 2: Service Health — one dashboard per service. RED metrics (Rate, Errors, Duration) for the service. Links to runbooks. Owned by the service team.

Tier 3: Deep Dive — detailed dashboards for specific investigation scenarios. Database query latency breakdown, Karpenter provisioning events, etcd latency. Used during incidents, not for routine monitoring.

Store dashboards as JSON in Git and provision via Grafana's dashboards ConfigMap or the Grafana Operator. Dashboard-as-code prevents the "who changed this dashboard?" problem.

Frequently Asked Questions

How much storage does Prometheus need?

A rough formula: (samples/second) × (bytes/sample) × (seconds of retention). Prometheus stores approximately 1–2 bytes per sample. A cluster scraping 100,000 metrics every 30 seconds generates ~3,333 samples/second. At 2 bytes/sample and 15 days (1,296,000 seconds): ~8.6GB. Add 30–50% overhead for safety. In practice, 50–100GB is comfortable for a medium cluster with 15-day retention.

Should I use Thanos or Cortex for long-term storage?

For clusters where you need >30 days of metrics or cross-cluster federation, yes. Thanos sidecars alongside Prometheus upload blocks to S3/GCS, enabling unlimited retention and global querying. For most teams, 15–30 days of local Prometheus storage is sufficient and significantly simpler. Add Thanos when you hit a retention or federation requirement, not by default.

What's the difference between Grafana Loki and Elasticsearch for Kubernetes logs?

Loki indexes only labels (pod name, namespace, container); full-text search runs at query time via regex. Elasticsearch indexes all log content for fast full-text search. Loki is 10–50× cheaper at comparable log volume. Choose Loki unless you have a strong need for fast full-text log search — most Kubernetes operational queries filter by pod, namespace, or label, which Loki handles efficiently.

Is OpenTelemetry production-ready?

Yes. The OTEL Collector and the language SDKs for Go, Java, Python, Node.js, and .NET are all stable. The specification for traces (v1.0) and metrics (v1.0) is stable. Logs are stable as of OTEL spec 1.0 (2024). The migration path from vendor agents to OTEL is straightforward — see OpenTelemetry Migration from Vendor Agents.

How do I reduce Prometheus memory usage?

Primary levers: reduce cardinality (fewer unique label combinations), increase scrape interval (60s instead of 15s for non-critical metrics), use recording rules to reduce query complexity, set --storage.tsdb.retention.size to cap storage and force old blocks to be deleted. For large clusters, horizontal sharding via Thanos or Victoria Metrics is the scaling path.

For a deep dive into Prometheus Operator CRDs (ServiceMonitor, AlertmanagerConfig, PrometheusRule) and multi-namespace RBAC, see Prometheus Operator: ServiceMonitor, AlertManager, and Production Monitoring. For the full production monitoring stack with persistent storage, remote write, and recording rules, see Prometheus and Grafana on Kubernetes: Production Monitoring Stack. For the SLO and alert design that sits on top of this stack, see SLOs, Error Budgets, and Burn Rate Alerts. For migrating from vendor agents to OpenTelemetry, see OpenTelemetry Migration from Vendor Agents.

Building an observability stack for a production Kubernetes platform? Talk to us at Coding Protocols — we help platform teams implement monitoring that gives signal without noise.

Kubernetes Observability: Prometheus, Grafana, and OpenTelemetry in Production

The Three Pillars in a Kubernetes Context

The kube-prometheus-stack

ServiceMonitor and PodMonitor

Recording Rules: Taming Cardinality

Alerting

Logs with Loki

Distributed Tracing with OpenTelemetry

Cardinality: The Cost Killer

The Grafana Dashboard Hierarchy

Frequently Asked Questions

How much storage does Prometheus need?

Should I use Thanos or Cortex for long-term storage?

What's the difference between Grafana Loki and Elasticsearch for Kubernetes logs?

Is OpenTelemetry production-ready?

How do I reduce Prometheus memory usage?

Related Topics

Read Next

Kubernetes Resource Management: Quotas, LimitRanges, and QoS Classes

Helm Advanced Patterns: Chart Development and Production Operations

Kubernetes Operators: Building Controllers with controller-runtime