AWS
14 min readMay 4, 2026

AWS CloudWatch: Logs, Metrics, Alarms, and Container Insights for EKS

CloudWatch is AWS's native observability service — logs, metrics, alarms, dashboards, and distributed tracing via X-Ray. For EKS, Container Insights adds Kubernetes-aware metrics and log aggregation. This covers CloudWatch Logs with log groups and retention, Metrics and metric math, Alarms with composite alarms, CloudWatch Agent for custom metrics, Container Insights setup on EKS, EMF (Embedded Metric Format) for high-cardinality metrics, and X-Ray distributed tracing for service-to-service visibility.

CO
Coding Protocols Team
Platform Engineering
AWS CloudWatch: Logs, Metrics, Alarms, and Container Insights for EKS

CloudWatch is the default observability layer on AWS. Every AWS service emits metrics to CloudWatch, EC2 instances and EKS nodes can ship logs and custom metrics, and X-Ray provides distributed tracing across AWS services and application code. For EKS, Container Insights adds pod-level metrics and structured log collection.

Using CloudWatch well means understanding its cost structure (you pay per metric, per log ingestion, per alarm) and its data model (metrics are dimensioned time series; logs are searchable streams). Naively dumping all logs and metrics into CloudWatch at high resolution gets expensive quickly.


CloudWatch Logs

Log Groups and Log Streams

CloudWatch Logs organizes data into log groups (an application or service) and log streams (a single source — one pod, one Lambda invocation, one EC2 instance).

bash
1# Create a log group with retention
2aws logs create-log-group \
3  --log-group-name /eks/prod/payments-api
4
5# Set retention policy (default: never expire)
6aws logs put-retention-policy \
7  --log-group-name /eks/prod/payments-api \
8  --retention-in-days 30    # 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653

Set retention on every log group. The default is never-expire, and log storage costs $0.03/GB/month (us-east-1). A busy cluster generating 10 GB/day will cost $9/day in log storage alone if logs never expire.

CloudWatch Logs Insights

CloudWatch Logs Insights is a query language for searching and analyzing log data:

# Find error rate over the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as error_count by bin(5m)
| sort @timestamp desc

# Find slow requests (parse structured JSON logs)
fields @timestamp, @message
| parse @message '{"level":"*","latency_ms":*,' as level, latency
| filter latency > 1000
| stats avg(latency) as avg_latency, count() as request_count by bin(1m)

# Top error types
fields @timestamp, @message
| filter @message like /Exception/
| parse @message '* at *' as exception_type, location
| stats count() by exception_type
| sort count desc
| limit 20
bash
1# Run a query from the CLI
2aws logs start-query \
3  --log-group-name /eks/prod/payments-api \
4  --start-time $(date -d '-1 hour' +%s) \
5  --end-time $(date +%s) \
6  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 20'
7
8# Get results (poll until complete)
9aws logs get-query-results --query-id <query-id>

Insights queries are charged per GB scanned. Partition log groups by service and set short retention to keep query costs predictable.


CloudWatch Metrics

Metric Model

Every CloudWatch metric is identified by:

  • Namespace: AWS/EKS, AWS/RDS, AWS/EC2, or your custom namespace
  • Metric name: CPUUtilization, ReplicaLag, RequestCount
  • Dimensions: key-value pairs that filter the metric — ClusterName=prod-cluster, Namespace=payments
bash
1# Get EKS node CPU utilization for the last hour
2aws cloudwatch get-metric-statistics \
3  --namespace ContainerInsights \
4  --metric-name node_cpu_utilization \
5  --dimensions Name=ClusterName,Value=prod-cluster \
6  --start-time "$(date -u -d '-1 hour' +%Y-%m-%dT%H:%M:%SZ)" \
7  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
8  --period 60 \
9  --statistics Average
10
11# Get available metrics for a namespace
12aws cloudwatch list-metrics --namespace ContainerInsights

Custom Metrics with PutMetricData

bash
1# Publish a custom metric
2aws cloudwatch put-metric-data \
3  --namespace "Payments/API" \
4  --metric-data '[
5    {
6      "MetricName": "OrdersProcessed",
7      "Dimensions": [
8        {"Name": "Environment", "Value": "prod"},
9        {"Name": "Region", "Value": "us-east-1"}
10      ],
11      "Value": 42,
12      "Unit": "Count",
13      "Timestamp": "2026-05-10T14:00:00Z"
14    }
15  ]'

Custom metrics cost $0.30/metric/month for the first 10,000 metrics. High-cardinality metrics (one metric per user, per request ID) can generate thousands of dimension combinations and become expensive. Use EMF (Embedded Metric Format) for high-cardinality metrics — it's cheaper and doesn't require separate API calls.

Metric Math

CloudWatch Metric Math computes derived metrics from existing ones:

bash
1# Example: error rate = errors / total requests
2aws cloudwatch get-metric-data \
3  --metric-data-queries '[
4    {
5      "Id": "errors",
6      "MetricStat": {
7        "Metric": {"Namespace": "Payments/API", "MetricName": "ErrorCount"},
8        "Period": 60,
9        "Stat": "Sum"
10      }
11    },
12    {
13      "Id": "total",
14      "MetricStat": {
15        "Metric": {"Namespace": "Payments/API", "MetricName": "RequestCount"},
16        "Period": 60,
17        "Stat": "Sum"
18      }
19    },
20    {
21      "Id": "error_rate",
22      "Expression": "errors / total * 100",
23      "Label": "Error Rate (%)"
24    }
25  ]' \
26  --start-time "$(date -u -d '-1 hour' +%Y-%m-%dT%H:%M:%SZ)" \
27  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)"

CloudWatch Alarms

Standard Alarms

bash
1# Alarm when API error rate > 1% for 5 consecutive minutes
2aws cloudwatch put-metric-alarm \
3  --alarm-name payments-api-error-rate-high \
4  --alarm-description "Payments API error rate exceeded 1%" \
5  --namespace "Payments/API" \
6  --metric-name ErrorRate \
7  --statistic Average \
8  --period 60 \
9  --evaluation-periods 5 \
10  --threshold 1.0 \
11  --comparison-operator GreaterThanThreshold \
12  --treat-missing-data notBreaching \
13  --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alerts \
14  --ok-actions arn:aws:sns:us-east-1:012345678901:platform-alerts

Key alarm configuration:

  • evaluation-periods: number of consecutive periods the metric must breach the threshold before the alarm fires. Use 3–5 periods to avoid noise from transient spikes.
  • treat-missing-data: notBreaching (missing = OK), breaching (missing = alarm), missing (alarm stays in previous state), ignore (previous state maintained). Use notBreaching for metrics that are only emitted when there's traffic.
  • datapoints-to-alarm: the number of breaching datapoints within evaluation-periods needed to alarm. Defaults to evaluation-periods. Set lower than evaluation-periods for M-of-N alerting: alarm if 3 of 5 periods breach.

Composite Alarms

Composite alarms combine multiple alarms with logical operators. Use composite alarms to reduce alert noise — alert on "high error rate AND high latency" rather than independently:

bash
1aws cloudwatch put-composite-alarm \
2  --alarm-name payments-api-degraded \
3  --alarm-description "Payments API is degraded (high errors AND high latency)" \
4  --alarm-rule "ALARM(payments-api-error-rate-high) AND ALARM(payments-api-p99-latency-high)" \
5  --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alerts-critical \
6  --actions-suppressor payments-api-maintenance-window \
7  --actions-suppressor-wait-period 60 \
8  --actions-suppressor-extension-period 120

The actions-suppressor prevents the composite alarm from firing during a maintenance window — when the suppressor alarm is in ALARM state, the composite alarm's actions are suppressed.


Container Insights for EKS

Container Insights provides Kubernetes-aware metrics: pod CPU/memory, node utilization, cluster-level aggregations, and application-level container restart counts.

Installation

bash
1# Install CloudWatch Agent + Fluent Bit using the AWS-provided addon
2aws eks create-addon \
3  --cluster-name prod-cluster \
4  --addon-name amazon-cloudwatch-observability \
5  --addon-version v1.7.0-eksbuild.1 \
6  --service-account-role-arn arn:aws:iam::012345678901:role/CloudWatchAgentRole
7
8# Or deploy via kubectl using the quick-start manifest
9ClusterName=prod-cluster
10RegionName=us-east-1
11FluentBitHttpPort='2020'
12FluentBitReadFromHead='Off'
13curl -s https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml \
14  | sed "s/{{cluster_name}}/${ClusterName}/;s/{{region_name}}/${RegionName}/;s/{{http_server_toggle}}/"On"/;s/{{http_server_port}}/${FluentBitHttpPort}/;s/{{read_from_head}}/${FluentBitReadFromHead}/" \
15  | kubectl apply -f -

IAM permissions for the CloudWatch Agent (attach to node instance role or use IRSA):

json
1{
2  "Version": "2012-10-17",
3  "Statement": [
4    {
5      "Effect": "Allow",
6      "Action": [
7        "cloudwatch:PutMetricData",
8        "ec2:DescribeVolumes",
9        "ec2:DescribeTags",
10        "logs:PutLogEvents",
11        "logs:DescribeLogStreams",
12        "logs:DescribeLogGroups",
13        "logs:CreateLogStream",
14        "logs:CreateLogGroup"
15      ],
16      "Resource": "*"
17    }
18  ]
19}

Container Insights Metrics

Container Insights publishes metrics to the ContainerInsights namespace:

MetricDimensionsPurpose
pod_cpu_utilizationClusterName, Namespace, PodNamePer-pod CPU %
pod_memory_utilizationClusterName, Namespace, PodNamePer-pod memory %
pod_cpu_reserved_capacityClusterName, NamespaceRequested CPU vs node capacity
pod_memory_reserved_capacityClusterName, NamespaceRequested memory vs node capacity
node_cpu_utilizationClusterName, NodeNameNode CPU %
node_memory_utilizationClusterName, NodeNameNode memory %
pod_number_of_container_restartsClusterName, Namespace, PodNameCrashLoopBackOff detection
bash
1# Create alarm for CrashLoopBackOff detection
2aws cloudwatch put-metric-alarm \
3  --alarm-name eks-pod-restart-storm \
4  --namespace ContainerInsights \
5  --metric-name pod_number_of_container_restarts \
6  --dimensions Name=ClusterName,Value=prod-cluster Name=Namespace,Value=payments \
7  --statistic Sum \
8  --period 300 \
9  --evaluation-periods 1 \
10  --threshold 10 \
11  --comparison-operator GreaterThanThreshold \
12  --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alerts

Fluent Bit Log Routing

Container Insights installs Fluent Bit as a DaemonSet to collect container logs and route them to CloudWatch Logs. Each container's stdout/stderr is automatically collected.

Configure Fluent Bit output to route logs to different log groups by namespace:

yaml
1# ConfigMap for Fluent Bit configuration
2# The AWS Container Insights Fluent Bit config uses the "application.*" tag prefix
3# (set by the tail input plugin). "kube.*" is a Fluentd convention — not used here.
4[OUTPUT]
5    Name cloudwatch_logs
6    Match application.payments.*
7    region us-east-1
8    log_group_name /eks/prod/payments
9    log_stream_prefix pod-
10    auto_create_group On
11
12[OUTPUT]
13    Name cloudwatch_logs
14    Match application.*
15    region us-east-1
16    log_group_name /eks/prod/default
17    log_stream_prefix pod-
18    auto_create_group On

Embedded Metric Format (EMF)

EMF lets applications emit high-cardinality metrics by embedding metric metadata in structured log events. CloudWatch parses these events and creates metrics without separate PutMetricData API calls. The logs are billed at log ingestion rates (cheaper than custom metrics for high cardinality).

python
1import json
2import time
3
4def emit_metric(metric_name, value, dimensions):
5    """Emit a metric using EMF to stdout (CloudWatch picks it up from logs)"""
6    emf_event = {
7        "_aws": {
8            "Timestamp": int(time.time() * 1000),
9            "CloudWatchMetrics": [
10                {
11                    "Namespace": "Payments/API",
12                    "Dimensions": [list(dimensions.keys())],
13                    "Metrics": [{"Name": metric_name, "Unit": "Milliseconds"}]
14                }
15            ]
16        },
17        metric_name: value,
18        **dimensions
19    }
20    print(json.dumps(emf_event))
21
22# Usage — emits a metric per request with customer_id dimension
23emit_metric(
24    "RequestLatency",
25    value=145.3,
26    dimensions={"CustomerTier": "premium", "Region": "us-east-1"}
27)

EMF is particularly useful for per-request metrics where the cardinality of dimensions (e.g., CustomerTier, APIVersion) would create too many unique metric streams via PutMetricData.


X-Ray Distributed Tracing

X-Ray provides request tracing across services — from the frontend through API gateways, into microservices, and out to databases and downstream APIs.

SDK Integration

python
1# Python — instrument with X-Ray
2from aws_xray_sdk.core import xray_recorder, patch_all
3
4# Patch common libraries (requests, boto3, psycopg2, etc.)
5patch_all()
6
7xray_recorder.configure(service='payments-api')
8
9@xray_recorder.capture('process_payment')
10def process_payment(payment_data):
11    # This function creates a subsegment in the current trace
12    with xray_recorder.in_subsegment('validate_card') as subsegment:
13        subsegment.put_metadata('card_type', payment_data['card_type'])
14        result = validate_card(payment_data)
15    return result
go
1// Go — instrument with X-Ray
2import "github.com/aws/aws-xray-sdk-go/xray"
3
4func ProcessPayment(ctx context.Context, data PaymentData) error {
5    ctx, seg := xray.BeginSegment(ctx, "payments-api")
6    defer seg.Close(nil)
7
8    _, subSeg := xray.BeginSubsegment(ctx, "validate-card")
9    if err := validateCard(ctx, data); err != nil {
10        subSeg.Close(err)
11        return err
12    }
13    subSeg.Close(nil)
14    return nil
15}

X-Ray Daemon on EKS

X-Ray requires the X-Ray daemon to receive trace data from the SDK and forward it to the X-Ray service:

yaml
1apiVersion: apps/v1
2kind: DaemonSet
3metadata:
4  name: xray-daemon
5  namespace: monitoring
6spec:
7  selector:
8    matchLabels:
9      app: xray-daemon
10  template:
11    metadata:
12      labels:
13        app: xray-daemon
14    spec:
15      containers:
16        - name: xray-daemon
17          image: amazon/aws-xray-daemon:3.x
18          ports:
19            - containerPort: 2000
20              protocol: UDP
21          resources:
22            limits:
23              memory: 128Mi
24            requests:
25              cpu: 32m
26              memory: 24Mi
27      serviceAccountName: xray-daemon    # Needs IAM permission: xray:PutTraceSegments, xray:PutTelemetryRecords

Applications send trace data to the daemon via UDP on port 2000. The daemon batches and forwards to the X-Ray service.

Configure the X-Ray SDK to use the daemon's address:

bash
# Environment variable — SDK will send traces to this host:port
AWS_XRAY_DAEMON_ADDRESS=xray-daemon.monitoring.svc.cluster.local:2000

Frequently Asked Questions

How do I reduce CloudWatch costs?

The main cost drivers and how to address them:

  1. Log ingestion: Set log retention policies on all log groups. Use log filtering at the source (Fluent Bit) to drop debug/trace-level logs before they reach CloudWatch. Ship lower-priority logs to S3 instead.

  2. Custom metrics: Use EMF for high-cardinality application metrics instead of PutMetricData. Reduce metric resolution to 1 minute (Standard resolution) for non-critical metrics — high-resolution metrics don't cost more to store, but alarms with 10-second or 30-second evaluation periods cost $0.30/month vs $0.10/month for standard 60-second alarms.

  3. Alarms: Disable alarms for metrics you no longer care about — each alarm costs $0.10/month (standard, 60-second period) or $0.30/month (high-resolution, 10-second or 30-second period).

  4. Logs Insights queries: Queries are charged per GB scanned. Use shorter time windows and specific log groups rather than querying all logs.

What's the difference between CloudWatch and Prometheus/Grafana for EKS?

CloudWatchPrometheus + Grafana
SetupManaged, enabled via add-onSelf-hosted or Amazon Managed Prometheus
Kubernetes awarenessContainer Insights adds K8s dimensionsNative K8s integration via kube-state-metrics
Query languageCloudWatch Metrics Insights (SQL-like)PromQL
RetentionConfigurable (default 15 months)Configurable (typically shorter due to storage cost)
CostPer metric/alarm/log GBStorage cost + compute for Prometheus
AlertingCloudWatch Alarms → SNSAlertManager → PagerDuty/Slack
DashboardsCloudWatch DashboardsGrafana (more powerful)

For teams already on AWS with simple metrics needs, CloudWatch with Container Insights is the path of least resistance. For complex metrics, custom dashboards, or multi-cloud monitoring, Prometheus + Grafana gives more flexibility. Amazon Managed Service for Prometheus removes the Prometheus operational burden.

Can CloudWatch Alarms trigger EKS scaling?

CloudWatch Alarms can trigger any action via SNS. For EKS autoscaling:

  • HPA: HPA uses the Kubernetes Metrics API (metrics-server or Prometheus Adapter), not CloudWatch directly
  • KEDA: KEDA can use CloudWatch as a scaler — scale pods based on any CloudWatch metric
  • Karpenter: node scaling is triggered by unschedulable pods, not CloudWatch alarms directly

For scheduled scaling (scale up before business hours), use KEDA's cron scaler or EventBridge Scheduler to call the Kubernetes API.


For Prometheus and Grafana as a CloudWatch complement for richer EKS metrics, see Kubernetes Observability: Prometheus, Grafana, and OpenTelemetry in Production. For KEDA that scales EKS workloads based on CloudWatch metrics, see KEDA: Event-Driven Autoscaling for Kubernetes. For the IAM role that grants the CloudWatch Agent and X-Ray daemon permission to publish metrics and traces, see AWS IAM: Roles, Policies, and IRSA for EKS.

Setting up Container Insights for a production EKS cluster, designing CloudWatch alarms for an SLO-based alerting strategy, or integrating X-Ray tracing across a microservices architecture? Talk to us at Coding Protocols — we help platform teams build observability stacks that surface problems before they become incidents.

Related Topics

AWS
CloudWatch
Container Insights
X-Ray
Observability
EKS
Platform Engineering

Read Next