Rightsizing AWS Costs: Finding and Fixing Overprovisioned Resources
Most AWS bills have 20–40% waste from overprovisioned EC2 instances, underused RDS, and forgotten resources. This tutorial shows you how to find it systematically and reduce it without impacting reliability.
Before you begin
- AWS account with billing access
- AWS CLI configured
- Basic understanding of EC2 and RDS
AWS bills grow because it's easier to overprovision than to tune. A t3.xlarge "just to be safe" instead of t3.medium doubles the instance cost. An RDS db.r5.large with 10% CPU average is wasting 90% of its compute.
This tutorial gives you a systematic process to find the waste and act on it.
The Four Categories of AWS Waste
- Overprovisioned resources — right type, wrong size (most common)
- Unused resources — running but doing nothing (snapshots, idle EBS, stopped instances)
- Wrong purchase type — On-Demand when Reserved or Spot would be cheaper
- Wrong storage class — S3 Standard for cold data, gp2 instead of gp3
Step 1: Get the High-Level Picture with Cost Explorer
In the AWS Console → Cost Management → Cost Explorer:
- Set date range to last 3 months
- Group by Service — see what's driving the bill
- Group by Usage Type — see EC2 instances, data transfer, storage separately
- Filter to your top spending service, group by Instance Type
Look for:
- Instance types with low utilisation (Cost Explorer shows rightsizing recommendations)
- Data transfer costs (often hidden, can be 20% of bill)
- NAT Gateway data processed (usually avoidable)
Enable Cost Explorer Rightsizing Recommendations: Cost Management → Rightsizing Recommendations. This uses CloudWatch CPU metrics to suggest instance downsizes.
Step 2: Find Underutilised EC2 Instances
# Find instances with avg CPU < 10% over 14 days
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time $(date -u -v-14d +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 86400 \
--statistics Average
# Better: use AWS Compute Optimizer
aws compute-optimizer get-ec2-instance-recommendations \
--query "instanceRecommendations[?finding=='OVER_PROVISIONED'].[instanceArn,recommendationOptions[0].instanceType,utilizationMetrics[0].value]" \
--output table
AWS Compute Optimizer uses 14 days of CloudWatch data to recommend downsizes. Enable it first:
aws compute-optimizer update-enrollment-status --status Active
Wait 24 hours for it to process your account, then check recommendations:
aws compute-optimizer get-ec2-instance-recommendations \
--filters name=finding,values=OVER_PROVISIONED
Step 3: Rightsize EKS Pod Resource Requests
In Kubernetes, requests determine which node a pod lands on. Overprovisioned requests leave nodes underutilised — you pay for capacity that sits idle.
Find pods with low CPU utilisation using Prometheus:
# Pods using less than 20% of their CPU request (over 24h)
(
sum by (pod, namespace) (
rate(container_cpu_usage_seconds_total{container!=""}[24h])
)
/
sum by (pod, namespace) (
kube_pod_container_resource_requests{resource="cpu", container!=""}
)
) < 0.2
# Memory: pods using less than 30% of their memory request
(
sum by (pod, namespace) (
container_memory_working_set_bytes{container!=""}
)
/
sum by (pod, namespace) (
kube_pod_container_resource_requests{resource="memory", container!=""}
)
) < 0.3
Install the Vertical Pod Autoscaler (VPA) in recommendation mode to get automated suggestions:
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Create a VPA object in recommendation mode (won't change pods automatically)
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # Recommend only, don't change pods
EOF
# After 24h, check recommendations
kubectl describe vpa my-app-vpa -n production
VPA recommendations:
Container Recommendations:
Container Name: app
Lower Bound:
Cpu: 25m
Memory: 64Mi
Target: ← use this
Cpu: 100m
Memory: 256Mi
Upper Bound:
Cpu: 500m
Memory: 512Mi
Uncapped Target:
Cpu: 87m
Memory: 230Mi
Update your deployment with the Target values. Then re-check in a week.
Step 4: Find Unused Resources
Idle Elastic IPs (charged when unattached):
aws ec2 describe-addresses \
--query "Addresses[?AssociationId==null].[AllocationId,PublicIp]" \
--output table
Unattached EBS volumes:
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query "Volumes[*].[VolumeId,Size,CreateTime]" \
--output table
Old snapshots (more than 30 days, no associated instance):
aws ec2 describe-snapshots \
--owner-ids self \
--query "Snapshots[?StartTime<='$(date -u -v-30d +%Y-%m-%d)'].[SnapshotId,StartTime,VolumeSize]" \
--output table
Load balancers with no targets:
aws elbv2 describe-target-groups \
--query "TargetGroups[*].[TargetGroupArn,TargetType]" \
--output text | while read arn type; do
count=$(aws elbv2 describe-target-health \
--target-group-arn $arn \
--query "length(TargetHealthDescriptions)" \
--output text)
if [ "$count" = "0" ]; then
echo "Empty target group: $arn"
fi
done
Old RDS snapshots:
aws rds describe-db-snapshots \
--query "DBSnapshots[?SnapshotCreateTime<='$(date -u -v-30d +%Y-%m-%d)' && SnapshotType=='manual'].[DBSnapshotIdentifier,AllocatedStorage,SnapshotCreateTime]" \
--output table
Step 5: Switch gp2 EBS to gp3
gp3 is cheaper than gp2 at the same baseline performance and gives you 3000 IOPS free (vs gp2's baseline that scales with volume size). For most volumes, gp3 is a drop-in replacement that costs 20% less.
# Find all gp2 volumes
aws ec2 describe-volumes \
--filters Name=volume-type,Values=gp2 \
--query "Volumes[*].[VolumeId,Size]" \
--output text | while read vol_id size; do
echo "Modifying $vol_id ($size GB) to gp3..."
aws ec2 modify-volume --volume-id $vol_id --volume-type gp3
done
The modification is live — no downtime, no detaching. It takes a few minutes per volume.
For EKS, update the StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
Step 6: Reserved Instances for Predictable Workloads
For EC2 instances that run 24/7, a 1-year Reserved Instance saves 40% over On-Demand. For 3-year, it's 60%.
Only reserve instance types and sizes you're confident won't change. Use Compute Savings Plans instead of RIs for more flexibility — they apply to any instance type and family automatically.
Check your On-Demand spend:
aws ce get-cost-and-usage \
--time-period Start=2026-01-01,End=2026-04-01 \
--granularity MONTHLY \
--filter '{"Dimensions":{"Key":"PURCHASE_TYPE","Values":["On Demand"]}}' \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=INSTANCE_TYPE
For EKS node groups, use Spot instances for stateless workloads with a fallback to On-Demand:
aws eks create-nodegroup \
--cluster-name my-cluster \
--nodegroup-name spot-workers \
--capacity-type SPOT \
--instance-types t3.medium t3.large t3a.medium \
--scaling-config minSize=2,maxSize=20,desiredSize=5
Mix of 3+ instance types reduces the chance of Spot interruption in a single AZ.
Tracking Progress
Set a monthly budget alert:
aws budgets create-budget \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget '{
"BudgetName": "Monthly AWS Budget",
"BudgetLimit": {"Amount": "1000", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{"SubscriptionType": "EMAIL", "Address": "you@example.com"}]
}]'
Review Cost Explorer weekly for the first month after changes. Cost reductions take a full billing cycle to show up clearly.
We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.
Struggling with this in production?
We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.