Cloud Engineering

Rightsizing AWS Costs: Finding and Fixing Overprovisioned Resources

Beginner25 min to complete9 min read

Most AWS bills have 20–40% waste from overprovisioned EC2 instances, underused RDS, and forgotten resources. This tutorial shows you how to find it systematically and reduce it without impacting reliability.

Before you begin

  • AWS account with billing access
  • AWS CLI configured
  • Basic understanding of EC2 and RDS
AWS
Cost Optimization
FinOps
EC2
EKS

AWS bills grow because it's easier to overprovision than to tune. A t3.xlarge "just to be safe" instead of t3.medium doubles the instance cost. An RDS db.r5.large with 10% CPU average is wasting 90% of its compute.

This tutorial gives you a systematic process to find the waste and act on it.

The Four Categories of AWS Waste

  1. Overprovisioned resources — right type, wrong size (most common)
  2. Unused resources — running but doing nothing (snapshots, idle EBS, stopped instances)
  3. Wrong purchase type — On-Demand when Reserved or Spot would be cheaper
  4. Wrong storage class — S3 Standard for cold data, gp2 instead of gp3

Step 1: Get the High-Level Picture with Cost Explorer

In the AWS Console → Cost Management → Cost Explorer:

  1. Set date range to last 3 months
  2. Group by Service — see what's driving the bill
  3. Group by Usage Type — see EC2 instances, data transfer, storage separately
  4. Filter to your top spending service, group by Instance Type

Look for:

  • Instance types with low utilisation (Cost Explorer shows rightsizing recommendations)
  • Data transfer costs (often hidden, can be 20% of bill)
  • NAT Gateway data processed (usually avoidable)

Enable Cost Explorer Rightsizing Recommendations: Cost Management → Rightsizing Recommendations. This uses CloudWatch CPU metrics to suggest instance downsizes.

Step 2: Find Underutilised EC2 Instances

bash
# Find instances with avg CPU < 10% over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -u -v-14d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 86400 \
  --statistics Average

# Better: use AWS Compute Optimizer
aws compute-optimizer get-ec2-instance-recommendations \
  --query "instanceRecommendations[?finding=='OVER_PROVISIONED'].[instanceArn,recommendationOptions[0].instanceType,utilizationMetrics[0].value]" \
  --output table

AWS Compute Optimizer uses 14 days of CloudWatch data to recommend downsizes. Enable it first:

bash
aws compute-optimizer update-enrollment-status --status Active

Wait 24 hours for it to process your account, then check recommendations:

bash
aws compute-optimizer get-ec2-instance-recommendations \
  --filters name=finding,values=OVER_PROVISIONED

Step 3: Rightsize EKS Pod Resource Requests

In Kubernetes, requests determine which node a pod lands on. Overprovisioned requests leave nodes underutilised — you pay for capacity that sits idle.

Find pods with low CPU utilisation using Prometheus:

promql
# Pods using less than 20% of their CPU request (over 24h)
(
  sum by (pod, namespace) (
    rate(container_cpu_usage_seconds_total{container!=""}[24h])
  )
  /
  sum by (pod, namespace) (
    kube_pod_container_resource_requests{resource="cpu", container!=""}
  )
) < 0.2
promql
# Memory: pods using less than 30% of their memory request
(
  sum by (pod, namespace) (
    container_memory_working_set_bytes{container!=""}
  )
  /
  sum by (pod, namespace) (
    kube_pod_container_resource_requests{resource="memory", container!=""}
  )
) < 0.3

Install the Vertical Pod Autoscaler (VPA) in recommendation mode to get automated suggestions:

bash
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

# Create a VPA object in recommendation mode (won't change pods automatically)
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"   # Recommend only, don't change pods
EOF

# After 24h, check recommendations
kubectl describe vpa my-app-vpa -n production

VPA recommendations:

  Container Recommendations:
    Container Name:  app
    Lower Bound:
      Cpu:     25m
      Memory:  64Mi
    Target:          ← use this
      Cpu:     100m
      Memory:  256Mi
    Upper Bound:
      Cpu:     500m
      Memory:  512Mi
    Uncapped Target:
      Cpu:     87m
      Memory:  230Mi

Update your deployment with the Target values. Then re-check in a week.

Step 4: Find Unused Resources

Idle Elastic IPs (charged when unattached):

bash
aws ec2 describe-addresses \
  --query "Addresses[?AssociationId==null].[AllocationId,PublicIp]" \
  --output table

Unattached EBS volumes:

bash
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query "Volumes[*].[VolumeId,Size,CreateTime]" \
  --output table

Old snapshots (more than 30 days, no associated instance):

bash
aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$(date -u -v-30d +%Y-%m-%d)'].[SnapshotId,StartTime,VolumeSize]" \
  --output table

Load balancers with no targets:

bash
aws elbv2 describe-target-groups \
  --query "TargetGroups[*].[TargetGroupArn,TargetType]" \
  --output text | while read arn type; do
    count=$(aws elbv2 describe-target-health \
      --target-group-arn $arn \
      --query "length(TargetHealthDescriptions)" \
      --output text)
    if [ "$count" = "0" ]; then
      echo "Empty target group: $arn"
    fi
  done

Old RDS snapshots:

bash
aws rds describe-db-snapshots \
  --query "DBSnapshots[?SnapshotCreateTime<='$(date -u -v-30d +%Y-%m-%d)' && SnapshotType=='manual'].[DBSnapshotIdentifier,AllocatedStorage,SnapshotCreateTime]" \
  --output table

Step 5: Switch gp2 EBS to gp3

gp3 is cheaper than gp2 at the same baseline performance and gives you 3000 IOPS free (vs gp2's baseline that scales with volume size). For most volumes, gp3 is a drop-in replacement that costs 20% less.

bash
# Find all gp2 volumes
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query "Volumes[*].[VolumeId,Size]" \
  --output text | while read vol_id size; do
    echo "Modifying $vol_id ($size GB) to gp3..."
    aws ec2 modify-volume --volume-id $vol_id --volume-type gp3
  done

The modification is live — no downtime, no detaching. It takes a few minutes per volume.

For EKS, update the StorageClass:

yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"

Step 6: Reserved Instances for Predictable Workloads

For EC2 instances that run 24/7, a 1-year Reserved Instance saves 40% over On-Demand. For 3-year, it's 60%.

Only reserve instance types and sizes you're confident won't change. Use Compute Savings Plans instead of RIs for more flexibility — they apply to any instance type and family automatically.

Check your On-Demand spend:

bash
aws ce get-cost-and-usage \
  --time-period Start=2026-01-01,End=2026-04-01 \
  --granularity MONTHLY \
  --filter '{"Dimensions":{"Key":"PURCHASE_TYPE","Values":["On Demand"]}}' \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=INSTANCE_TYPE

For EKS node groups, use Spot instances for stateless workloads with a fallback to On-Demand:

bash
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name spot-workers \
  --capacity-type SPOT \
  --instance-types t3.medium t3.large t3a.medium \
  --scaling-config minSize=2,maxSize=20,desiredSize=5

Mix of 3+ instance types reduces the chance of Spot interruption in a single AZ.

Tracking Progress

Set a monthly budget alert:

bash
aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "Monthly AWS Budget",
    "BudgetLimit": {"Amount": "1000", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[{
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80,
      "ThresholdType": "PERCENTAGE"
    },
    "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "you@example.com"}]
  }]'

Review Cost Explorer weekly for the first month after changes. Cost reductions take a full billing cycle to show up clearly.

We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.

Struggling with this in production?

We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.