AWS CloudWatch: Logs, Metrics, Alarms, and Container Insights for EKS
CloudWatch is AWS's native observability service — logs, metrics, alarms, dashboards, and distributed tracing via X-Ray. For EKS, Container Insights adds Kubernetes-aware metrics and log aggregation. This covers CloudWatch Logs with log groups and retention, Metrics and metric math, Alarms with composite alarms, CloudWatch Agent for custom metrics, Container Insights setup on EKS, EMF (Embedded Metric Format) for high-cardinality metrics, and X-Ray distributed tracing for service-to-service visibility.

CloudWatch is the default observability layer on AWS. Every AWS service emits metrics to CloudWatch, EC2 instances and EKS nodes can ship logs and custom metrics, and X-Ray provides distributed tracing across AWS services and application code. For EKS, Container Insights adds pod-level metrics and structured log collection.
Using CloudWatch well means understanding its cost structure (you pay per metric, per log ingestion, per alarm) and its data model (metrics are dimensioned time series; logs are searchable streams). Naively dumping all logs and metrics into CloudWatch at high resolution gets expensive quickly.
CloudWatch Logs
Log Groups and Log Streams
CloudWatch Logs organizes data into log groups (an application or service) and log streams (a single source — one pod, one Lambda invocation, one EC2 instance).
1# Create a log group with retention
2aws logs create-log-group \
3 --log-group-name /eks/prod/payments-api
4
5# Set retention policy (default: never expire)
6aws logs put-retention-policy \
7 --log-group-name /eks/prod/payments-api \
8 --retention-in-days 30 # 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653Set retention on every log group. The default is never-expire, and log storage costs $0.03/GB/month (us-east-1). A busy cluster generating 10 GB/day will cost $9/day in log storage alone if logs never expire.
CloudWatch Logs Insights
CloudWatch Logs Insights is a query language for searching and analyzing log data:
# Find error rate over the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as error_count by bin(5m)
| sort @timestamp desc
# Find slow requests (parse structured JSON logs)
fields @timestamp, @message
| parse @message '{"level":"*","latency_ms":*,' as level, latency
| filter latency > 1000
| stats avg(latency) as avg_latency, count() as request_count by bin(1m)
# Top error types
fields @timestamp, @message
| filter @message like /Exception/
| parse @message '* at *' as exception_type, location
| stats count() by exception_type
| sort count desc
| limit 20
1# Run a query from the CLI
2aws logs start-query \
3 --log-group-name /eks/prod/payments-api \
4 --start-time $(date -d '-1 hour' +%s) \
5 --end-time $(date +%s) \
6 --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 20'
7
8# Get results (poll until complete)
9aws logs get-query-results --query-id <query-id>Insights queries are charged per GB scanned. Partition log groups by service and set short retention to keep query costs predictable.
CloudWatch Metrics
Metric Model
Every CloudWatch metric is identified by:
- Namespace:
AWS/EKS,AWS/RDS,AWS/EC2, or your custom namespace - Metric name:
CPUUtilization,ReplicaLag,RequestCount - Dimensions: key-value pairs that filter the metric —
ClusterName=prod-cluster,Namespace=payments
1# Get EKS node CPU utilization for the last hour
2aws cloudwatch get-metric-statistics \
3 --namespace ContainerInsights \
4 --metric-name node_cpu_utilization \
5 --dimensions Name=ClusterName,Value=prod-cluster \
6 --start-time "$(date -u -d '-1 hour' +%Y-%m-%dT%H:%M:%SZ)" \
7 --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
8 --period 60 \
9 --statistics Average
10
11# Get available metrics for a namespace
12aws cloudwatch list-metrics --namespace ContainerInsightsCustom Metrics with PutMetricData
1# Publish a custom metric
2aws cloudwatch put-metric-data \
3 --namespace "Payments/API" \
4 --metric-data '[
5 {
6 "MetricName": "OrdersProcessed",
7 "Dimensions": [
8 {"Name": "Environment", "Value": "prod"},
9 {"Name": "Region", "Value": "us-east-1"}
10 ],
11 "Value": 42,
12 "Unit": "Count",
13 "Timestamp": "2026-05-10T14:00:00Z"
14 }
15 ]'Custom metrics cost $0.30/metric/month for the first 10,000 metrics. High-cardinality metrics (one metric per user, per request ID) can generate thousands of dimension combinations and become expensive. Use EMF (Embedded Metric Format) for high-cardinality metrics — it's cheaper and doesn't require separate API calls.
Metric Math
CloudWatch Metric Math computes derived metrics from existing ones:
1# Example: error rate = errors / total requests
2aws cloudwatch get-metric-data \
3 --metric-data-queries '[
4 {
5 "Id": "errors",
6 "MetricStat": {
7 "Metric": {"Namespace": "Payments/API", "MetricName": "ErrorCount"},
8 "Period": 60,
9 "Stat": "Sum"
10 }
11 },
12 {
13 "Id": "total",
14 "MetricStat": {
15 "Metric": {"Namespace": "Payments/API", "MetricName": "RequestCount"},
16 "Period": 60,
17 "Stat": "Sum"
18 }
19 },
20 {
21 "Id": "error_rate",
22 "Expression": "errors / total * 100",
23 "Label": "Error Rate (%)"
24 }
25 ]' \
26 --start-time "$(date -u -d '-1 hour' +%Y-%m-%dT%H:%M:%SZ)" \
27 --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)"CloudWatch Alarms
Standard Alarms
1# Alarm when API error rate > 1% for 5 consecutive minutes
2aws cloudwatch put-metric-alarm \
3 --alarm-name payments-api-error-rate-high \
4 --alarm-description "Payments API error rate exceeded 1%" \
5 --namespace "Payments/API" \
6 --metric-name ErrorRate \
7 --statistic Average \
8 --period 60 \
9 --evaluation-periods 5 \
10 --threshold 1.0 \
11 --comparison-operator GreaterThanThreshold \
12 --treat-missing-data notBreaching \
13 --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alerts \
14 --ok-actions arn:aws:sns:us-east-1:012345678901:platform-alertsKey alarm configuration:
evaluation-periods: number of consecutive periods the metric must breach the threshold before the alarm fires. Use 3–5 periods to avoid noise from transient spikes.treat-missing-data:notBreaching(missing = OK),breaching(missing = alarm),missing(alarm stays in previous state),ignore(previous state maintained). UsenotBreachingfor metrics that are only emitted when there's traffic.datapoints-to-alarm: the number of breaching datapoints withinevaluation-periodsneeded to alarm. Defaults toevaluation-periods. Set lower thanevaluation-periodsfor M-of-N alerting: alarm if 3 of 5 periods breach.
Composite Alarms
Composite alarms combine multiple alarms with logical operators. Use composite alarms to reduce alert noise — alert on "high error rate AND high latency" rather than independently:
1aws cloudwatch put-composite-alarm \
2 --alarm-name payments-api-degraded \
3 --alarm-description "Payments API is degraded (high errors AND high latency)" \
4 --alarm-rule "ALARM(payments-api-error-rate-high) AND ALARM(payments-api-p99-latency-high)" \
5 --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alerts-critical \
6 --actions-suppressor payments-api-maintenance-window \
7 --actions-suppressor-wait-period 60 \
8 --actions-suppressor-extension-period 120The actions-suppressor prevents the composite alarm from firing during a maintenance window — when the suppressor alarm is in ALARM state, the composite alarm's actions are suppressed.
Container Insights for EKS
Container Insights provides Kubernetes-aware metrics: pod CPU/memory, node utilization, cluster-level aggregations, and application-level container restart counts.
Installation
1# Install CloudWatch Agent + Fluent Bit using the AWS-provided addon
2aws eks create-addon \
3 --cluster-name prod-cluster \
4 --addon-name amazon-cloudwatch-observability \
5 --addon-version v1.7.0-eksbuild.1 \
6 --service-account-role-arn arn:aws:iam::012345678901:role/CloudWatchAgentRole
7
8# Or deploy via kubectl using the quick-start manifest
9ClusterName=prod-cluster
10RegionName=us-east-1
11FluentBitHttpPort='2020'
12FluentBitReadFromHead='Off'
13curl -s https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml \
14 | sed "s/{{cluster_name}}/${ClusterName}/;s/{{region_name}}/${RegionName}/;s/{{http_server_toggle}}/"On"/;s/{{http_server_port}}/${FluentBitHttpPort}/;s/{{read_from_head}}/${FluentBitReadFromHead}/" \
15 | kubectl apply -f -IAM permissions for the CloudWatch Agent (attach to node instance role or use IRSA):
1{
2 "Version": "2012-10-17",
3 "Statement": [
4 {
5 "Effect": "Allow",
6 "Action": [
7 "cloudwatch:PutMetricData",
8 "ec2:DescribeVolumes",
9 "ec2:DescribeTags",
10 "logs:PutLogEvents",
11 "logs:DescribeLogStreams",
12 "logs:DescribeLogGroups",
13 "logs:CreateLogStream",
14 "logs:CreateLogGroup"
15 ],
16 "Resource": "*"
17 }
18 ]
19}Container Insights Metrics
Container Insights publishes metrics to the ContainerInsights namespace:
| Metric | Dimensions | Purpose |
|---|---|---|
pod_cpu_utilization | ClusterName, Namespace, PodName | Per-pod CPU % |
pod_memory_utilization | ClusterName, Namespace, PodName | Per-pod memory % |
pod_cpu_reserved_capacity | ClusterName, Namespace | Requested CPU vs node capacity |
pod_memory_reserved_capacity | ClusterName, Namespace | Requested memory vs node capacity |
node_cpu_utilization | ClusterName, NodeName | Node CPU % |
node_memory_utilization | ClusterName, NodeName | Node memory % |
pod_number_of_container_restarts | ClusterName, Namespace, PodName | CrashLoopBackOff detection |
1# Create alarm for CrashLoopBackOff detection
2aws cloudwatch put-metric-alarm \
3 --alarm-name eks-pod-restart-storm \
4 --namespace ContainerInsights \
5 --metric-name pod_number_of_container_restarts \
6 --dimensions Name=ClusterName,Value=prod-cluster Name=Namespace,Value=payments \
7 --statistic Sum \
8 --period 300 \
9 --evaluation-periods 1 \
10 --threshold 10 \
11 --comparison-operator GreaterThanThreshold \
12 --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alertsFluent Bit Log Routing
Container Insights installs Fluent Bit as a DaemonSet to collect container logs and route them to CloudWatch Logs. Each container's stdout/stderr is automatically collected.
Configure Fluent Bit output to route logs to different log groups by namespace:
1# ConfigMap for Fluent Bit configuration
2# The AWS Container Insights Fluent Bit config uses the "application.*" tag prefix
3# (set by the tail input plugin). "kube.*" is a Fluentd convention — not used here.
4[OUTPUT]
5 Name cloudwatch_logs
6 Match application.payments.*
7 region us-east-1
8 log_group_name /eks/prod/payments
9 log_stream_prefix pod-
10 auto_create_group On
11
12[OUTPUT]
13 Name cloudwatch_logs
14 Match application.*
15 region us-east-1
16 log_group_name /eks/prod/default
17 log_stream_prefix pod-
18 auto_create_group OnEmbedded Metric Format (EMF)
EMF lets applications emit high-cardinality metrics by embedding metric metadata in structured log events. CloudWatch parses these events and creates metrics without separate PutMetricData API calls. The logs are billed at log ingestion rates (cheaper than custom metrics for high cardinality).
1import json
2import time
3
4def emit_metric(metric_name, value, dimensions):
5 """Emit a metric using EMF to stdout (CloudWatch picks it up from logs)"""
6 emf_event = {
7 "_aws": {
8 "Timestamp": int(time.time() * 1000),
9 "CloudWatchMetrics": [
10 {
11 "Namespace": "Payments/API",
12 "Dimensions": [list(dimensions.keys())],
13 "Metrics": [{"Name": metric_name, "Unit": "Milliseconds"}]
14 }
15 ]
16 },
17 metric_name: value,
18 **dimensions
19 }
20 print(json.dumps(emf_event))
21
22# Usage — emits a metric per request with customer_id dimension
23emit_metric(
24 "RequestLatency",
25 value=145.3,
26 dimensions={"CustomerTier": "premium", "Region": "us-east-1"}
27)EMF is particularly useful for per-request metrics where the cardinality of dimensions (e.g., CustomerTier, APIVersion) would create too many unique metric streams via PutMetricData.
X-Ray Distributed Tracing
X-Ray provides request tracing across services — from the frontend through API gateways, into microservices, and out to databases and downstream APIs.
SDK Integration
1# Python — instrument with X-Ray
2from aws_xray_sdk.core import xray_recorder, patch_all
3
4# Patch common libraries (requests, boto3, psycopg2, etc.)
5patch_all()
6
7xray_recorder.configure(service='payments-api')
8
9@xray_recorder.capture('process_payment')
10def process_payment(payment_data):
11 # This function creates a subsegment in the current trace
12 with xray_recorder.in_subsegment('validate_card') as subsegment:
13 subsegment.put_metadata('card_type', payment_data['card_type'])
14 result = validate_card(payment_data)
15 return result1// Go — instrument with X-Ray
2import "github.com/aws/aws-xray-sdk-go/xray"
3
4func ProcessPayment(ctx context.Context, data PaymentData) error {
5 ctx, seg := xray.BeginSegment(ctx, "payments-api")
6 defer seg.Close(nil)
7
8 _, subSeg := xray.BeginSubsegment(ctx, "validate-card")
9 if err := validateCard(ctx, data); err != nil {
10 subSeg.Close(err)
11 return err
12 }
13 subSeg.Close(nil)
14 return nil
15}X-Ray Daemon on EKS
X-Ray requires the X-Ray daemon to receive trace data from the SDK and forward it to the X-Ray service:
1apiVersion: apps/v1
2kind: DaemonSet
3metadata:
4 name: xray-daemon
5 namespace: monitoring
6spec:
7 selector:
8 matchLabels:
9 app: xray-daemon
10 template:
11 metadata:
12 labels:
13 app: xray-daemon
14 spec:
15 containers:
16 - name: xray-daemon
17 image: amazon/aws-xray-daemon:3.x
18 ports:
19 - containerPort: 2000
20 protocol: UDP
21 resources:
22 limits:
23 memory: 128Mi
24 requests:
25 cpu: 32m
26 memory: 24Mi
27 serviceAccountName: xray-daemon # Needs IAM permission: xray:PutTraceSegments, xray:PutTelemetryRecordsApplications send trace data to the daemon via UDP on port 2000. The daemon batches and forwards to the X-Ray service.
Configure the X-Ray SDK to use the daemon's address:
# Environment variable — SDK will send traces to this host:port
AWS_XRAY_DAEMON_ADDRESS=xray-daemon.monitoring.svc.cluster.local:2000Frequently Asked Questions
How do I reduce CloudWatch costs?
The main cost drivers and how to address them:
-
Log ingestion: Set log retention policies on all log groups. Use log filtering at the source (Fluent Bit) to drop debug/trace-level logs before they reach CloudWatch. Ship lower-priority logs to S3 instead.
-
Custom metrics: Use EMF for high-cardinality application metrics instead of
PutMetricData. Reduce metric resolution to 1 minute (Standard resolution) for non-critical metrics — high-resolution metrics don't cost more to store, but alarms with 10-second or 30-second evaluation periods cost $0.30/month vs $0.10/month for standard 60-second alarms. -
Alarms: Disable alarms for metrics you no longer care about — each alarm costs $0.10/month (standard, 60-second period) or $0.30/month (high-resolution, 10-second or 30-second period).
-
Logs Insights queries: Queries are charged per GB scanned. Use shorter time windows and specific log groups rather than querying all logs.
What's the difference between CloudWatch and Prometheus/Grafana for EKS?
| CloudWatch | Prometheus + Grafana | |
|---|---|---|
| Setup | Managed, enabled via add-on | Self-hosted or Amazon Managed Prometheus |
| Kubernetes awareness | Container Insights adds K8s dimensions | Native K8s integration via kube-state-metrics |
| Query language | CloudWatch Metrics Insights (SQL-like) | PromQL |
| Retention | Configurable (default 15 months) | Configurable (typically shorter due to storage cost) |
| Cost | Per metric/alarm/log GB | Storage cost + compute for Prometheus |
| Alerting | CloudWatch Alarms → SNS | AlertManager → PagerDuty/Slack |
| Dashboards | CloudWatch Dashboards | Grafana (more powerful) |
For teams already on AWS with simple metrics needs, CloudWatch with Container Insights is the path of least resistance. For complex metrics, custom dashboards, or multi-cloud monitoring, Prometheus + Grafana gives more flexibility. Amazon Managed Service for Prometheus removes the Prometheus operational burden.
Can CloudWatch Alarms trigger EKS scaling?
CloudWatch Alarms can trigger any action via SNS. For EKS autoscaling:
- HPA: HPA uses the Kubernetes Metrics API (metrics-server or Prometheus Adapter), not CloudWatch directly
- KEDA: KEDA can use CloudWatch as a scaler — scale pods based on any CloudWatch metric
- Karpenter: node scaling is triggered by unschedulable pods, not CloudWatch alarms directly
For scheduled scaling (scale up before business hours), use KEDA's cron scaler or EventBridge Scheduler to call the Kubernetes API.
For Prometheus and Grafana as a CloudWatch complement for richer EKS metrics, see Kubernetes Observability: Prometheus, Grafana, and OpenTelemetry in Production. For KEDA that scales EKS workloads based on CloudWatch metrics, see KEDA: Event-Driven Autoscaling for Kubernetes. For the IAM role that grants the CloudWatch Agent and X-Ray daemon permission to publish metrics and traces, see AWS IAM: Roles, Policies, and IRSA for EKS.
Setting up Container Insights for a production EKS cluster, designing CloudWatch alarms for an SLO-based alerting strategy, or integrating X-Ray tracing across a microservices architecture? Talk to us at Coding Protocols — we help platform teams build observability stacks that surface problems before they become incidents.


