AWS ElastiCache: Redis, Valkey, and Caching Patterns for Production
ElastiCache provides managed Redis and Valkey (the open-source Redis fork) on AWS. Caching reduces database load, cuts latency, and absorbs traffic spikes — but only when the cache is correctly populated, invalidated, and sized. This covers ElastiCache architecture (cluster mode, replication groups, Multi-AZ failover), the major caching patterns (cache-aside, write-through, read-through), session storage, distributed locking, and the operational considerations that distinguish a well-tuned cache from a liability.

Caching is one of the highest-leverage optimizations in any production system: a well-placed cache turns a 10ms database query into a 0.1ms memory read and reduces database connection load by orders of magnitude. ElastiCache manages the operational work — provisioning, patching, failover, backup — so you can focus on cache design.
The hard part is not the infrastructure — it's deciding what to cache, for how long, how to invalidate it, and what happens when the cache is cold or evicts unexpectedly.
ElastiCache Architecture
Replication Groups
A replication group is the primary ElastiCache deployment unit for Redis/Valkey. It consists of one primary node (accepts reads and writes) and up to 5 replica nodes (serve reads, used for failover).
1# Create a replication group with 1 primary + 2 replicas
2aws elasticache create-replication-group \
3 --replication-group-id prod-cache \
4 --replication-group-description "Production cache layer" \
5 --engine valkey \
6 --engine-version 7.2.6 \
7 --cache-node-type cache.r7g.xlarge \
8 --num-cache-clusters 3 \
9 --multi-az-enabled \
10 --automatic-failover-enabled \
11 --at-rest-encryption-enabled \
12 --transit-encryption-enabled \
13 --cache-subnet-group-name prod-cache-subnets \
14 --security-group-ids sg-cache-groupMulti-AZ with automatic failover: when the primary fails, ElastiCache promotes the replica with the least replication lag to primary. Failover typically completes in under 60 seconds, though it can take up to 2 minutes under heavy load or DNS propagation delays. During failover, DNS updates so the primary endpoint resolves to the new primary — clients that reconnect automatically after failure (standard Redis client behavior) recover without code changes.
Cluster Mode
Cluster mode distributes data across multiple shards, each a replication group. This is required when your dataset exceeds the memory of a single node or when your write throughput exceeds what a single primary can handle.
1# Create a cluster mode enabled replication group with 3 shards, 2 replicas per shard
2aws elasticache create-replication-group \
3 --replication-group-id prod-cache-cluster \
4 --replication-group-description "Clustered production cache" \
5 --engine valkey \
6 --engine-version 7.2.6 \
7 --cache-node-type cache.r7g.large \
8 --num-node-groups 3 \
9 --replicas-per-node-group 2 \
10 --multi-az-enabled \
11 --automatic-failover-enabled \
12 --at-rest-encryption-enabled \
13 --transit-encryption-enabled \
14 --cache-subnet-group-name prod-cache-subnetsWith 3 shards, each shard owns approximately 1/3 of the keyspace (based on CRC16 hash slots). Keys are mapped to slots 0–16383; each shard owns a contiguous range. Multi-key operations (MGET, transactions with MULTI/EXEC) only work when all keys belong to the same shard — use hash tags {user:123}:profile to force related keys to the same shard.
Cluster mode vs non-cluster mode:
| Single-shard (non-cluster) | Cluster mode | |
|---|---|---|
| Data distribution | Single primary | 1–500 shards |
| Max memory | One node's memory | Shards × node memory |
| Cross-key operations | Supported | Only within same shard/hash tag |
| Scaling | Vertical (larger node) | Horizontal (more shards) |
| Connection | Single endpoint | Cluster-aware client required |
Serverless ElastiCache
ElastiCache Serverless scales automatically based on demand — no node type selection, no capacity planning, no cluster resizing. You pay for data stored and ECPUs consumed.
1# Create a serverless cache
2aws elasticache create-serverless-cache \
3 --serverless-cache-name prod-serverless-cache \
4 --engine valkey \
5 --cache-usage-limits '{
6 "DataStorage": {"Maximum": 10, "Unit": "GB"},
7 "ECPUPerSecond": {"Maximum": 10000}
8 }' \
9 --subnet-ids subnet-private-1a subnet-private-1b \
10 --security-group-ids sg-cache-groupElastiCache Serverless is appropriate when workload is unpredictable or bursty, or when you want to eliminate capacity planning. For sustained high-throughput workloads, provisioned instances are typically more cost-efficient.
Caching Patterns
Cache-Aside (Lazy Loading)
The application manages the cache: check the cache first, fetch from the database on a miss, populate the cache, return the result.
1import redis
2import json
3import time
4
5r = redis.Redis(host='prod-cache.abc123.ng.0001.use1.cache.amazonaws.com', port=6379, ssl=True)
6
7def get_user(user_id: str) -> dict:
8 cache_key = f'user:{user_id}'
9
10 # Try cache first
11 cached = r.get(cache_key)
12 if cached:
13 return json.loads(cached)
14
15 # Cache miss — fetch from database
16 user = db.query('SELECT * FROM users WHERE id = %s', user_id)
17 if user is None:
18 return None
19
20 # Populate cache with TTL
21 r.setex(cache_key, 3600, json.dumps(user)) # Expire in 1 hour
22 return userAdvantages:
- Only caches data that's actually requested — no wasted memory
- Cache failures don't break the application (falls back to database)
- Data in cache is fresh relative to when it was last requested
Disadvantages:
- First request after expiry (or cold start) hits the database — cache stampede risk when many clients simultaneously request the same expired key
- Data can be stale for the TTL duration — a write to the database doesn't automatically update the cache
Cache stampede mitigation: use a mutex pattern where only one client fetches from the database on a miss, others wait:
1def get_user_with_lock(user_id: str) -> dict:
2 cache_key = f'user:{user_id}'
3 lock_key = f'lock:user:{user_id}'
4
5 cached = r.get(cache_key)
6 if cached:
7 return json.loads(cached)
8
9 # Try to acquire lock (NX = set only if not exists, EX = expire in 5s)
10 acquired = r.set(lock_key, '1', nx=True, ex=5)
11 if acquired:
12 try:
13 user = db.query('SELECT * FROM users WHERE id = %s', user_id)
14 r.setex(cache_key, 3600, json.dumps(user))
15 return user
16 finally:
17 r.delete(lock_key)
18 else:
19 # Wait for lock holder to populate cache
20 time.sleep(0.05)
21 cached = r.get(cache_key)
22 return json.loads(cached) if cached else get_user_with_lock(user_id)Write-Through
The application writes to cache and database simultaneously on every write. The cache is always in sync with the database.
1def update_user(user_id: str, updates: dict) -> dict:
2 # Write to database
3 user = db.execute(
4 'UPDATE users SET name=%s, email=%s WHERE id=%s RETURNING *',
5 updates['name'], updates['email'], user_id
6 )
7
8 # Write to cache — always in sync
9 cache_key = f'user:{user_id}'
10 r.setex(cache_key, 3600, json.dumps(user))
11
12 return userAdvantages:
- Cache is always consistent with the database
- No stale reads after writes
Disadvantages:
- Every write hits both cache and database — write latency increases slightly
- Caches data that may never be read (wastes memory for write-heavy infrequently-read data)
Combining with cache-aside: write-through on writes + cache-aside on reads is the most common production pattern. Writes keep the cache current; reads populate on miss.
Write-Behind (Write-Back)
Write to cache immediately, write to the database asynchronously in the background. The cache is the system of record temporarily.
Write-behind is rare for primary application data because it risks data loss if the cache fails before the async write completes. It's appropriate for high-frequency updates where write coalescing matters (e.g., real-time analytics counters, gaming leaderboards) where losing a few updates is acceptable.
Session Storage
Redis is commonly used for server-side session storage — sessions are stored in Redis instead of a database, providing fast reads and automatic expiry.
1import secrets
2from flask import Flask, request, jsonify
3
4app = Flask(__name__)
5
6def create_session(user_id: str, ttl: int = 86400) -> str:
7 session_id = secrets.token_urlsafe(32) # Cryptographically secure random token
8 session_data = {'userId': user_id, 'createdAt': int(time.time())}
9 r.setex(f'session:{session_id}', ttl, json.dumps(session_data))
10 return session_id
11
12def get_session(session_id: str) -> dict | None:
13 data = r.get(f'session:{session_id}')
14 if data is None:
15 return None
16 # Optionally refresh TTL on each request (sliding expiry)
17 r.expire(f'session:{session_id}', 86400)
18 return json.loads(data)
19
20def invalidate_session(session_id: str):
21 r.delete(f'session:{session_id}')Redis TTL handles session expiry automatically — no background cleanup job needed. Store only the session ID in the cookie; keep session data server-side in Redis.
Distributed Locking
Redis SET NX (set if not exists) provides a primitive for distributed mutual exclusion:
1import uuid
2
3class RedisLock:
4 def __init__(self, redis_client, lock_key: str, ttl: int = 30):
5 self.r = redis_client
6 self.lock_key = lock_key
7 self.ttl = ttl
8 self.lock_value = str(uuid.uuid4()) # Unique value to prevent other holders from releasing this lock
9
10 def acquire(self) -> bool:
11 return self.r.set(self.lock_key, self.lock_value, nx=True, ex=self.ttl) is not None
12
13 def release(self):
14 # Lua script ensures atomic check-and-delete — prevents releasing a lock you don't hold
15 script = """
16 if redis.call("get", KEYS[1]) == ARGV[1] then
17 return redis.call("del", KEYS[1])
18 else
19 return 0
20 end
21 """
22 self.r.eval(script, 1, self.lock_key, self.lock_value)
23
24 def __enter__(self):
25 if not self.acquire():
26 raise RuntimeError(f'Could not acquire lock: {self.lock_key}')
27 return self
28
29 def __exit__(self, *args):
30 self.release()
31
32# Usage
33with RedisLock(r, 'lock:process-order:ord-123', ttl=30):
34 process_order('ord-123')The Lua script is critical — it atomically checks that the lock value matches before deleting, preventing a race condition where lock A expires, lock B acquires it, then lock A's release deletes lock B's lock.
Caveats: Redis distributed locks are advisory, not strict. Under network partition or Redis failover, a lock holder may believe they hold the lock while another client also holds it. For most workloads (deduplication, rate limiting, idempotency), this level of locking is sufficient. For critical mutual exclusion (financial transactions), use database-level locking or DynamoDB conditional writes instead.
Cache Sizing and Eviction
Memory Sizing
Size your cache based on the working set — the data that's actually requested regularly. A cache that's 20% of your database size typically achieves 80%+ hit rates for most workloads (hot data concentrates).
# Check current memory usage and eviction stats
redis-cli -h prod-cache.abc123.ng.0001.use1.cache.amazonaws.com -p 6379 --tls INFO memory | grep -E "used_memory_human|mem_fragmentation_ratio"
redis-cli ... INFO stats | grep -E "evicted_keys|keyspace_hits|keyspace_misses"Cache hit rate = keyspace_hits / (keyspace_hits + keyspace_misses). A hit rate below 90% suggests the cache is too small, TTLs are too short, or your access pattern is too uniform to benefit from caching.
Eviction Policies
When memory is full, Redis evicts keys based on the configured maxmemory-policy:
| Policy | Behavior |
|---|---|
noeviction | Reject new writes when memory is full (returns error) |
allkeys-lru | Evict the least recently used key from all keys |
volatile-lru | Evict LRU from keys with TTL set only |
allkeys-lfu | Evict the least frequently used key (better for skewed access) |
volatile-ttl | Evict the key with the nearest expiry |
For general-purpose caching, allkeys-lru or allkeys-lfu (for skewed access patterns) are the right defaults. Use noeviction only for data structures that must not be evicted (queues, sorted sets for leaderboards) — but ensure memory is sized to never fill completely.
# Set eviction policy on ElastiCache
aws elasticache modify-replication-group \
--replication-group-id prod-cache \
--cache-parameter-group-name prod-cache-paramsElastiCache does not allow direct CONFIG SET on managed instances — configure eviction policy via a parameter group:
1aws elasticache create-cache-parameter-group \
2 --cache-parameter-group-name prod-cache-params \
3 --cache-parameter-group-family valkey7 \
4 --description "Production cache parameters"
5
6aws elasticache modify-cache-parameter-group \
7 --cache-parameter-group-name prod-cache-params \
8 --parameter-name-values ParameterName=maxmemory-policy,ParameterValue=allkeys-lruMonitoring
Key ElastiCache CloudWatch metrics:
| Metric | What to watch |
|---|---|
CacheHits / CacheMisses | Hit rate — alert if drops below 90% |
Evictions | Nonzero = memory pressure; consider scaling up |
DatabaseMemoryUsagePercentage | Alert at 80% — above this, eviction or OOM risk |
ReplicationLag | Replica lag in seconds — alert if > 1s |
CurrConnections | Active connections — connection pool exhaustion shows up here |
NetworkBytesIn/Out | Bandwidth — large objects or high throughput |
EngineCPUUtilization | CPU-bound operations (large KEYS scans, SORT, complex Lua) |
1# Create alarm for high eviction rate (cache memory pressure)
2aws cloudwatch put-metric-alarm \
3 --alarm-name prod-cache-evictions \
4 --namespace AWS/ElastiCache \
5 --metric-name Evictions \
6 --dimensions Name=CacheClusterId,Value=prod-cache-0001-001 \
7 --statistic Sum \
8 --period 60 \
9 --evaluation-periods 3 \
10 --threshold 0 \
11 --comparison-operator GreaterThanThreshold \
12 --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alertsFrequently Asked Questions
What's the difference between Valkey and Redis on ElastiCache?
Valkey is the open-source fork of Redis created in 2024 by the Linux Foundation after Redis changed its license. Valkey is API-compatible with Redis and AWS now recommends Valkey for new deployments. For existing Redis clusters, migration is transparent — the same client libraries work without changes.
AWS also offers ElastiCache for Redis (OSS-licensed Redis) but Valkey is the forward-looking choice on AWS.
How do I handle cache invalidation?
Cache invalidation is hard. The three workable strategies:
-
TTL-based expiry: let keys expire naturally. Simple, eventually consistent. The stale window is the TTL duration. Right for most read-heavy data that tolerates brief staleness.
-
Event-driven invalidation: when data changes in the database, publish an event (to SNS, EventBridge, or a write to a stream) that triggers deletion of the corresponding cache key. Immediate consistency, more complex to implement.
-
Versioned keys: instead of deleting the old key, write a new version:
user:123:v2. Change the version reference in a separate key (user:123:version = v2). Old versions naturally expire. Avoids delete races but uses more memory.
For most production systems, TTL-based expiry with write-through on updates is the right starting point. Add event-driven invalidation only for data that must be immediately consistent after writes (user authentication state, billing information).
When should I NOT use ElastiCache?
- As a primary database: Redis/Valkey is an in-memory store. Without persistence configured (RDB snapshots or AOF logging), data is lost on restart. Even with persistence, ElastiCache's backup is not a substitute for Aurora or RDS for your primary data.
- For large objects > 1 MB: Redis is optimized for small keys. Large objects increase serialization cost, consume disproportionate memory, and are slow to evict. Store large blobs in S3, keep references in Redis.
- When you need relational queries: if your cache lookup logic requires JOIN-style operations, you're working against the tool. Use RDS or DynamoDB with appropriate indexes instead.
How do I connect securely to ElastiCache?
ElastiCache should live in a private subnet — never expose it to the internet. Use:
- VPC security groups: allow access only from application security groups (not CIDR blocks)
- In-transit encryption:
--transit-encryption-enabled— uses TLS for all client-server communication - Auth tokens:
--auth-token <token>for Redis AUTH — ElastiCache supports RBAC for Valkey 7.2+ clusters
1import redis
2
3# Connect with TLS + auth token
4r = redis.Redis(
5 host='prod-cache.abc123.ng.0001.use1.cache.amazonaws.com',
6 port=6379,
7 ssl=True,
8 ssl_cert_reqs='required',
9 ssl_ca_certs='/etc/ssl/certs/ca-certificates.crt', # Amazon root CA; path varies by OS
10 password='your-auth-token', # From Secrets Manager
11 decode_responses=True,
12)For DynamoDB as the primary database that ElastiCache offloads read traffic from, see AWS DynamoDB: Data Modeling, Capacity, Indexes, and Streams. For Lambda functions that use ElastiCache connections (initialized outside the handler for connection reuse), see AWS Lambda: Functions, Event Sources, Layers, and Serverless Patterns.
Sizing an ElastiCache cluster for a high-traffic API, implementing cache invalidation for a write-heavy workload, or debugging cache stampede causing database overload under traffic spikes? Talk to us at Coding Protocols — we help platform teams design caching layers that eliminate database bottlenecks without creating cache consistency problems.


