Kubernetes Observability: Prometheus, Grafana, and OpenTelemetry in Production
Metrics, logs, and traces are not optional on a production Kubernetes platform. Here's how to build an observability stack with Prometheus, Grafana, and OpenTelemetry that gives you the signals you need without drowning in noise or cost.

A Kubernetes cluster without observability is a black box. Pods restart and you don't know why. Latency spikes and you can't tell which service. Node pressure triggers evictions and by the time you investigate, the evidence is gone.
A Kubernetes cluster with too much observability is a different problem: $40,000/month in Datadog bills, Prometheus running out of disk every two weeks, dashboards nobody looks at, and alert fatigue that trains your team to ignore PagerDuty.
This guide covers the production-calibrated observability stack — what to collect, how to structure it, and the specific configurations that prevent it from becoming unmanageable.
The Three Pillars in a Kubernetes Context
Metrics — numeric time-series data. CPU, memory, request rate, error rate, latency percentiles. Prometheus is the standard; every Kubernetes component exposes Prometheus-format metrics.
Logs — structured or unstructured event streams from application and system processes. Loki (Grafana's log aggregation system), Elasticsearch, or CloudWatch Logs are the common stores.
Traces — distributed request traces showing how a request flows through multiple services. OpenTelemetry is the standard instrumentation layer; Tempo (Grafana), Jaeger, or Zipkin are common backends.
All three feed into Grafana for visualisation and alerting. The integration point is OpenTelemetry Collector — a vendor-neutral pipeline for collecting, processing, and exporting all three signal types to any backend.
The kube-prometheus-stack
The fastest path to production-ready Kubernetes metrics is kube-prometheus-stack (formerly prometheus-operator). It deploys:
- Prometheus Operator — manages Prometheus and Alertmanager as Kubernetes resources
- Prometheus — metrics collection and storage
- Alertmanager — alert routing and deduplication
- Grafana — dashboards and visualisation
- kube-state-metrics — cluster-state metrics (Deployment replicas, Pod status, PVC status)
- node-exporter — node-level metrics (CPU, memory, disk, network per node)
- Pre-built dashboards for Kubernetes cluster overview, node resources, pod resources
1helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
2
3helm upgrade --install kube-prometheus-stack \
4 prometheus-community/kube-prometheus-stack \
5 --namespace monitoring \
6 --create-namespace \
7 --version 67.0.0 \
8 --values prometheus-values.yaml \
9 --wait1# prometheus-values.yaml
2prometheus:
3 prometheusSpec:
4 retention: 15d # How long to keep raw metrics
5 retentionSize: 50GB # Size-based retention cap
6 storageSpec:
7 volumeClaimTemplate:
8 spec:
9 storageClassName: gp3
10 accessModes: ["ReadWriteOnce"]
11 resources:
12 requests:
13 storage: 100Gi
14 resources:
15 requests:
16 cpu: 500m
17 memory: 2Gi
18 limits:
19 memory: 4Gi
20 # Scrape all ServiceMonitors and PodMonitors in any namespace
21 serviceMonitorSelectorNilUsesHelmValues: false
22 podMonitorSelectorNilUsesHelmValues: false
23 ruleSelectorNilUsesHelmValues: false
24
25alertmanager:
26 alertmanagerSpec:
27 storage:
28 volumeClaimTemplate:
29 spec:
30 storageClassName: gp3
31 resources:
32 requests:
33 storage: 10Gi
34
35grafana:
36 adminPassword: "change-me-via-secret"
37 persistence:
38 enabled: true
39 storageClassName: gp3
40 size: 10Gi
41 ingress:
42 enabled: true
43 ingressClassName: nginx
44 hosts:
45 - grafana.example.com
46 tls:
47 - secretName: grafana-tls
48 hosts:
49 - grafana.example.com
50
51# Reduce default scrape intervals for lower cardinality
52defaultRules:
53 rules:
54 kubeProxy: false # Disable if using Cilium (no kube-proxy)ServiceMonitor and PodMonitor
The Prometheus Operator uses ServiceMonitor and PodMonitor CRDs to configure scrape targets. Instead of editing prometheus.yml, you create these resources alongside your application.
1# ServiceMonitor for an HTTP service
2apiVersion: monitoring.coreos.com/v1
3kind: ServiceMonitor
4metadata:
5 name: api-metrics
6 namespace: production
7 labels:
8 app: api
9spec:
10 selector:
11 matchLabels:
12 app: api
13 endpoints:
14 - port: metrics # Named port on the Service
15 path: /metrics
16 interval: 30s
17 scrapeTimeout: 10s
18 namespaceSelector:
19 matchNames:
20 - production1# PodMonitor for pods without a Service (DaemonSets, sidecars)
2apiVersion: monitoring.coreos.com/v1
3kind: PodMonitor
4metadata:
5 name: node-exporter
6 namespace: monitoring
7spec:
8 selector:
9 matchLabels:
10 app.kubernetes.io/name: node-exporter
11 podMetricsEndpoints:
12 - port: metrics
13 interval: 30sThe Prometheus Operator watches for these resources cluster-wide and automatically adds the scrape targets to Prometheus configuration. Application teams create their own ServiceMonitor in their namespace; no central Prometheus config file to edit. For a deep dive into ServiceMonitor, PodMonitor, AlertmanagerConfig, and multi-namespace monitoring, see Prometheus Operator: ServiceMonitor, AlertManager, and Production Monitoring.
Recording Rules: Taming Cardinality
Raw Prometheus metrics can be expensive to query. A rate() over a high-cardinality metric at query time scans millions of time series. Recording rules pre-compute expensive queries and store the results as new, lower-cardinality metrics:
1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4 name: api-recording-rules
5 namespace: monitoring
6spec:
7 groups:
8 - name: api.rules
9 interval: 30s
10 rules:
11 # Pre-compute request rate by service
12 - record: job:http_requests_total:rate5m
13 expr: sum(rate(http_requests_total[5m])) by (job, namespace)
14
15 # Pre-compute error rate
16 - record: job:http_errors_total:rate5m
17 expr: |
18 sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, namespace)
19 /
20 sum(rate(http_requests_total[5m])) by (job, namespace)
21
22 # P95 latency per service
23 - record: job:http_request_duration_seconds:p95_5m
24 expr: |
25 histogram_quantile(0.95,
26 sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, namespace)
27 )Dashboard queries reference the recording rule metrics (job:http_requests_total:rate5m) instead of the raw metric. Dashboards load in milliseconds; alerts evaluate in milliseconds.
Alerting
AlertManager routes alerts to Slack, PagerDuty, email, or any webhook. Alert rules are defined as PrometheusRule resources:
1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4 name: api-alerts
5 namespace: monitoring
6spec:
7 groups:
8 - name: api.alerts
9 rules:
10 - alert: HighErrorRate
11 expr: job:http_errors_total:rate5m > 0.05
12 for: 5m
13 labels:
14 severity: warning
15 team: api
16 annotations:
17 summary: "High error rate on {{ $labels.job }}"
18 description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
19 runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
20
21 - alert: PodCrashLooping
22 expr: |
23 rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 5
24 for: 0m
25 labels:
26 severity: critical
27 annotations:
28 summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"Alert fatigue prevention:
- Every alert needs a
runbook_url— if there's no runbook, there's no alert for: 5mmeans the condition must be true for 5 continuous minutes before firing — prevents alerts on transient spikes- Use
severity: warningliberally andseverity: criticalsparingly — critical should mean "wake someone up now"
Logs with Loki
Loki (Grafana's log aggregation system) is the cost-effective alternative to Elasticsearch for Kubernetes logs. It indexes only metadata (labels), not the full log content — dramatically cheaper at scale.
helm upgrade --install loki grafana/loki \
--namespace monitoring \
--version 6.22.0 \
--values loki-values.yaml1# loki-values.yaml — S3-backed storage for production
2loki:
3 auth_enabled: false
4 storage:
5 type: s3
6 s3:
7 region: us-east-1
8 bucketnames: my-loki-logs
9 schema_config:
10 configs:
11 - from: 2024-01-01
12 store: tsdb
13 object_store: s3
14 schema: v13
15 index:
16 prefix: index_
17 period: 24h
18 limits_config:
19 retention_period: 30d
20 ingestion_rate_mb: 16
21 ingestion_burst_size_mb: 32Promtail ships logs from pods to Loki:
helm upgrade --install promtail grafana/promtail \
--namespace monitoring \
--set loki.serviceName=lokiPromtail runs as a DaemonSet, reads container logs from the node's /var/log/pods/ directory, and adds Kubernetes metadata labels (pod name, namespace, container name) before shipping to Loki.
In Grafana, logs appear in the Explore view and can be correlated with metrics using shared labels (pod name, namespace). For a production-focused setup using Fluent Bit (a lighter-weight alternative to Promtail) with Loki on EKS, including S3 backend configuration and LogQL querying, see Kubernetes Logging: Fluent Bit and Grafana Loki.
Distributed Tracing with OpenTelemetry
OpenTelemetry (OTEL) is the CNCF standard for application instrumentation. The OpenTelemetry Collector is the central processing pipeline — applications send traces, metrics, and logs to it; it processes and exports to backends.
helm upgrade --install opentelemetry-collector open-telemetry/opentelemetry-collector \
--namespace monitoring \
--values otel-values.yaml1# otel-values.yaml
2mode: deployment # or daemonset for node-local collection
3
4config:
5 receivers:
6 otlp:
7 protocols:
8 grpc:
9 endpoint: 0.0.0.0:4317
10 http:
11 endpoint: 0.0.0.0:4318
12
13 processors:
14 batch:
15 timeout: 1s
16 send_batch_size: 1024
17 memory_limiter:
18 check_interval: 1s
19 limit_mib: 512
20 # Add Kubernetes metadata to all telemetry
21 k8sattributes:
22 extract:
23 metadata:
24 - k8s.namespace.name
25 - k8s.pod.name
26 - k8s.deployment.name
27 - k8s.node.name
28
29 exporters:
30 otlp/tempo:
31 endpoint: tempo:4317
32 tls:
33 insecure: true
34 prometheusremotewrite:
35 endpoint: http://kube-prometheus-stack-prometheus:9090/api/v1/write
36 loki:
37 endpoint: http://loki:3100/loki/api/v1/push
38
39 service:
40 pipelines:
41 traces:
42 receivers: [otlp]
43 processors: [memory_limiter, k8sattributes, batch]
44 exporters: [otlp/tempo]
45 metrics:
46 receivers: [otlp]
47 processors: [memory_limiter, k8sattributes, batch]
48 exporters: [prometheusremotewrite]
49 logs:
50 receivers: [otlp]
51 processors: [memory_limiter, k8sattributes, batch]
52 exporters: [loki]Applications instrument with the OTEL SDK for their language and send to opentelemetry-collector.monitoring.svc.cluster.local:4317.
Grafana Tempo stores traces and integrates with Grafana for trace visualisation and exemplar linking (jump from a metric spike directly to the trace that caused it).
Cardinality: The Cost Killer
High cardinality is the most common cause of Prometheus cost explosions. Cardinality is the number of unique time series — each unique combination of metric name and label values is one series.
Dangerous label patterns:
1// WRONG — user_id in a label creates one series per user
2prometheus.NewCounterVec(prometheus.CounterOpts{
3 Name: "api_requests_total",
4}, []string{"method", "path", "status", "user_id"}) // user_id = millions of series
5
6// RIGHT — aggregate at a higher level
7prometheus.NewCounterVec(prometheus.CounterOpts{
8 Name: "api_requests_total",
9}, []string{"method", "status_class"}) // status_class = "2xx", "4xx", "5xx"Never use these as metric label values: user IDs, request IDs, session tokens, email addresses, or any other high-cardinality identifier.
Cardinality audit:
1# Find the highest-cardinality metrics
2curl -s http://prometheus:9090/api/v1/label/__name__/values | \
3 jq '.data[]' | while read metric; do
4 count=$(curl -s "http://prometheus:9090/api/v1/query?query=count({__name__=\"${metric//\"/}\"})" | \
5 jq '.data.result[0].value[1] // "0"')
6 echo "$count $metric"
7 done | sort -rn | head -20For managed Prometheus (Amazon Managed Prometheus, Grafana Cloud), high cardinality directly maps to cost. Identify and fix high-cardinality metrics before they accumulate.
The Grafana Dashboard Hierarchy
Avoid dashboard sprawl — 200 dashboards that nobody uses is worse than 10 dashboards that everyone knows.
Tier 1: Platform Overview — one dashboard, visible to everyone. Cluster health, node utilisation, top 10 CPU/memory consumers, active incidents.
Tier 2: Service Health — one dashboard per service. RED metrics (Rate, Errors, Duration) for the service. Links to runbooks. Owned by the service team.
Tier 3: Deep Dive — detailed dashboards for specific investigation scenarios. Database query latency breakdown, Karpenter provisioning events, etcd latency. Used during incidents, not for routine monitoring.
Store dashboards as JSON in Git and provision via Grafana's dashboards ConfigMap or the Grafana Operator. Dashboard-as-code prevents the "who changed this dashboard?" problem.
Frequently Asked Questions
How much storage does Prometheus need?
A rough formula: (samples/second) × (bytes/sample) × (seconds of retention). Prometheus stores approximately 1–2 bytes per sample. A cluster scraping 100,000 metrics every 30 seconds generates ~3,333 samples/second. At 2 bytes/sample and 15 days (1,296,000 seconds): ~8.6GB. Add 30–50% overhead for safety. In practice, 50–100GB is comfortable for a medium cluster with 15-day retention.
Should I use Thanos or Cortex for long-term storage?
For clusters where you need >30 days of metrics or cross-cluster federation, yes. Thanos sidecars alongside Prometheus upload blocks to S3/GCS, enabling unlimited retention and global querying. For most teams, 15–30 days of local Prometheus storage is sufficient and significantly simpler. Add Thanos when you hit a retention or federation requirement, not by default.
What's the difference between Grafana Loki and Elasticsearch for Kubernetes logs?
Loki indexes only labels (pod name, namespace, container); full-text search runs at query time via regex. Elasticsearch indexes all log content for fast full-text search. Loki is 10–50× cheaper at comparable log volume. Choose Loki unless you have a strong need for fast full-text log search — most Kubernetes operational queries filter by pod, namespace, or label, which Loki handles efficiently.
Is OpenTelemetry production-ready?
Yes. The OTEL Collector and the language SDKs for Go, Java, Python, Node.js, and .NET are all stable. The specification for traces (v1.0) and metrics (v1.0) is stable. Logs are stable as of OTEL spec 1.0 (2024). The migration path from vendor agents to OTEL is straightforward — see OpenTelemetry Migration from Vendor Agents.
How do I reduce Prometheus memory usage?
Primary levers: reduce cardinality (fewer unique label combinations), increase scrape interval (60s instead of 15s for non-critical metrics), use recording rules to reduce query complexity, set --storage.tsdb.retention.size to cap storage and force old blocks to be deleted. For large clusters, horizontal sharding via Thanos or Victoria Metrics is the scaling path.
For a deep dive into Prometheus Operator CRDs (ServiceMonitor, AlertmanagerConfig, PrometheusRule) and multi-namespace RBAC, see Prometheus Operator: ServiceMonitor, AlertManager, and Production Monitoring. For the full production monitoring stack with persistent storage, remote write, and recording rules, see Prometheus and Grafana on Kubernetes: Production Monitoring Stack. For the SLO and alert design that sits on top of this stack, see SLOs, Error Budgets, and Burn Rate Alerts. For migrating from vendor agents to OpenTelemetry, see OpenTelemetry Migration from Vendor Agents.
Building an observability stack for a production Kubernetes platform? Talk to us at Coding Protocols — we help platform teams implement monitoring that gives signal without noise.


