AWS ElastiCache: Redis, Valkey, and Caching Patterns for Production (2026)

Caching is one of the highest-leverage optimizations in any production system: a well-placed cache turns a 10ms database query into a 0.1ms memory read and reduces database connection load by orders of magnitude. ElastiCache manages the operational work — provisioning, patching, failover, backup — so you can focus on cache design.

The hard part is not the infrastructure — it's deciding what to cache, for how long, how to invalidate it, and what happens when the cache is cold or evicts unexpectedly.

ElastiCache Architecture

Replication Groups

A replication group is the primary ElastiCache deployment unit for Redis/Valkey. It consists of one primary node (accepts reads and writes) and up to 5 replica nodes (serve reads, used for failover).

bash

1# Create a replication group with 1 primary + 2 replicas
2aws elasticache create-replication-group \
3  --replication-group-id prod-cache \
4  --replication-group-description "Production cache layer" \
5  --engine valkey \
6  --engine-version 7.2.6 \
7  --cache-node-type cache.r7g.xlarge \
8  --num-cache-clusters 3 \
9  --multi-az-enabled \
10  --automatic-failover-enabled \
11  --at-rest-encryption-enabled \
12  --transit-encryption-enabled \
13  --cache-subnet-group-name prod-cache-subnets \
14  --security-group-ids sg-cache-group

Multi-AZ with automatic failover: when the primary fails, ElastiCache promotes the replica with the least replication lag to primary. Failover typically completes in under 60 seconds, though it can take up to 2 minutes under heavy load or DNS propagation delays. During failover, DNS updates so the primary endpoint resolves to the new primary — clients that reconnect automatically after failure (standard Redis client behavior) recover without code changes.

Cluster Mode

Cluster mode distributes data across multiple shards, each a replication group. This is required when your dataset exceeds the memory of a single node or when your write throughput exceeds what a single primary can handle.

bash

1# Create a cluster mode enabled replication group with 3 shards, 2 replicas per shard
2aws elasticache create-replication-group \
3  --replication-group-id prod-cache-cluster \
4  --replication-group-description "Clustered production cache" \
5  --engine valkey \
6  --engine-version 7.2.6 \
7  --cache-node-type cache.r7g.large \
8  --num-node-groups 3 \
9  --replicas-per-node-group 2 \
10  --multi-az-enabled \
11  --automatic-failover-enabled \
12  --at-rest-encryption-enabled \
13  --transit-encryption-enabled \
14  --cache-subnet-group-name prod-cache-subnets

With 3 shards, each shard owns approximately 1/3 of the keyspace (based on CRC16 hash slots). Keys are mapped to slots 0–16383; each shard owns a contiguous range. Multi-key operations (MGET, transactions with MULTI/EXEC) only work when all keys belong to the same shard — use hash tags {user:123}:profile to force related keys to the same shard.

Cluster mode vs non-cluster mode:

	Single-shard (non-cluster)	Cluster mode
Data distribution	Single primary	1–500 shards
Max memory	One node's memory	Shards × node memory
Cross-key operations	Supported	Only within same shard/hash tag
Scaling	Vertical (larger node)	Horizontal (more shards)
Connection	Single endpoint	Cluster-aware client required

Serverless ElastiCache

ElastiCache Serverless scales automatically based on demand — no node type selection, no capacity planning, no cluster resizing. You pay for data stored and ECPUs consumed.

bash

1# Create a serverless cache
2aws elasticache create-serverless-cache \
3  --serverless-cache-name prod-serverless-cache \
4  --engine valkey \
5  --cache-usage-limits '{
6    "DataStorage": {"Maximum": 10, "Unit": "GB"},
7    "ECPUPerSecond": {"Maximum": 10000}
8  }' \
9  --subnet-ids subnet-private-1a subnet-private-1b \
10  --security-group-ids sg-cache-group

ElastiCache Serverless is appropriate when workload is unpredictable or bursty, or when you want to eliminate capacity planning. For sustained high-throughput workloads, provisioned instances are typically more cost-efficient.

Caching Patterns

Cache-Aside (Lazy Loading)

The application manages the cache: check the cache first, fetch from the database on a miss, populate the cache, return the result.

python

1import redis
2import json
3import time
4
5r = redis.Redis(host='prod-cache.abc123.ng.0001.use1.cache.amazonaws.com', port=6379, ssl=True)
6
7def get_user(user_id: str) -> dict:
8    cache_key = f'user:{user_id}'
9    
10    # Try cache first
11    cached = r.get(cache_key)
12    if cached:
13        return json.loads(cached)
14    
15    # Cache miss — fetch from database
16    user = db.query('SELECT * FROM users WHERE id = %s', user_id)
17    if user is None:
18        return None
19    
20    # Populate cache with TTL
21    r.setex(cache_key, 3600, json.dumps(user))    # Expire in 1 hour
22    return user

Advantages:

Only caches data that's actually requested — no wasted memory
Cache failures don't break the application (falls back to database)
Data in cache is fresh relative to when it was last requested

Disadvantages:

First request after expiry (or cold start) hits the database — cache stampede risk when many clients simultaneously request the same expired key
Data can be stale for the TTL duration — a write to the database doesn't automatically update the cache

Cache stampede mitigation: use a mutex pattern where only one client fetches from the database on a miss, others wait:

python

1def get_user_with_lock(user_id: str) -> dict:
2    cache_key = f'user:{user_id}'
3    lock_key = f'lock:user:{user_id}'
4    
5    cached = r.get(cache_key)
6    if cached:
7        return json.loads(cached)
8    
9    # Try to acquire lock (NX = set only if not exists, EX = expire in 5s)
10    acquired = r.set(lock_key, '1', nx=True, ex=5)
11    if acquired:
12        try:
13            user = db.query('SELECT * FROM users WHERE id = %s', user_id)
14            r.setex(cache_key, 3600, json.dumps(user))
15            return user
16        finally:
17            r.delete(lock_key)
18    else:
19        # Wait for lock holder to populate cache
20        time.sleep(0.05)
21        cached = r.get(cache_key)
22        return json.loads(cached) if cached else get_user_with_lock(user_id)

Write-Through

The application writes to cache and database simultaneously on every write. The cache is always in sync with the database.

python

1def update_user(user_id: str, updates: dict) -> dict:
2    # Write to database
3    user = db.execute(
4        'UPDATE users SET name=%s, email=%s WHERE id=%s RETURNING *',
5        updates['name'], updates['email'], user_id
6    )
7    
8    # Write to cache — always in sync
9    cache_key = f'user:{user_id}'
10    r.setex(cache_key, 3600, json.dumps(user))
11    
12    return user

Advantages:

Cache is always consistent with the database
No stale reads after writes

Disadvantages:

Every write hits both cache and database — write latency increases slightly
Caches data that may never be read (wastes memory for write-heavy infrequently-read data)

Combining with cache-aside: write-through on writes + cache-aside on reads is the most common production pattern. Writes keep the cache current; reads populate on miss.

Write-Behind (Write-Back)

Write to cache immediately, write to the database asynchronously in the background. The cache is the system of record temporarily.

Write-behind is rare for primary application data because it risks data loss if the cache fails before the async write completes. It's appropriate for high-frequency updates where write coalescing matters (e.g., real-time analytics counters, gaming leaderboards) where losing a few updates is acceptable.

Session Storage

Redis is commonly used for server-side session storage — sessions are stored in Redis instead of a database, providing fast reads and automatic expiry.

python

1import secrets
2from flask import Flask, request, jsonify
3
4app = Flask(__name__)
5
6def create_session(user_id: str, ttl: int = 86400) -> str:
7    session_id = secrets.token_urlsafe(32)    # Cryptographically secure random token
8    session_data = {'userId': user_id, 'createdAt': int(time.time())}
9    r.setex(f'session:{session_id}', ttl, json.dumps(session_data))
10    return session_id
11
12def get_session(session_id: str) -> dict | None:
13    data = r.get(f'session:{session_id}')
14    if data is None:
15        return None
16    # Optionally refresh TTL on each request (sliding expiry)
17    r.expire(f'session:{session_id}', 86400)
18    return json.loads(data)
19
20def invalidate_session(session_id: str):
21    r.delete(f'session:{session_id}')

Redis TTL handles session expiry automatically — no background cleanup job needed. Store only the session ID in the cookie; keep session data server-side in Redis.

Distributed Locking

Redis SET NX (set if not exists) provides a primitive for distributed mutual exclusion:

python

1import uuid
2
3class RedisLock:
4    def __init__(self, redis_client, lock_key: str, ttl: int = 30):
5        self.r = redis_client
6        self.lock_key = lock_key
7        self.ttl = ttl
8        self.lock_value = str(uuid.uuid4())    # Unique value to prevent other holders from releasing this lock
9    
10    def acquire(self) -> bool:
11        return self.r.set(self.lock_key, self.lock_value, nx=True, ex=self.ttl) is not None
12    
13    def release(self):
14        # Lua script ensures atomic check-and-delete — prevents releasing a lock you don't hold
15        script = """
16        if redis.call("get", KEYS[1]) == ARGV[1] then
17            return redis.call("del", KEYS[1])
18        else
19            return 0
20        end
21        """
22        self.r.eval(script, 1, self.lock_key, self.lock_value)
23    
24    def __enter__(self):
25        if not self.acquire():
26            raise RuntimeError(f'Could not acquire lock: {self.lock_key}')
27        return self
28    
29    def __exit__(self, *args):
30        self.release()
31
32# Usage
33with RedisLock(r, 'lock:process-order:ord-123', ttl=30):
34    process_order('ord-123')

The Lua script is critical — it atomically checks that the lock value matches before deleting, preventing a race condition where lock A expires, lock B acquires it, then lock A's release deletes lock B's lock.

Caveats: Redis distributed locks are advisory, not strict. Under network partition or Redis failover, a lock holder may believe they hold the lock while another client also holds it. For most workloads (deduplication, rate limiting, idempotency), this level of locking is sufficient. For critical mutual exclusion (financial transactions), use database-level locking or DynamoDB conditional writes instead.

Cache Sizing and Eviction

Memory Sizing

Size your cache based on the working set — the data that's actually requested regularly. A cache that's 20% of your database size typically achieves 80%+ hit rates for most workloads (hot data concentrates).

bash

# Check current memory usage and eviction stats
redis-cli -h prod-cache.abc123.ng.0001.use1.cache.amazonaws.com -p 6379 --tls INFO memory | grep -E "used_memory_human|mem_fragmentation_ratio"
redis-cli ... INFO stats | grep -E "evicted_keys|keyspace_hits|keyspace_misses"

Cache hit rate = keyspace_hits / (keyspace_hits + keyspace_misses). A hit rate below 90% suggests the cache is too small, TTLs are too short, or your access pattern is too uniform to benefit from caching.

Eviction Policies

When memory is full, Redis evicts keys based on the configured maxmemory-policy:

Policy	Behavior
`noeviction`	Reject new writes when memory is full (returns error)
`allkeys-lru`	Evict the least recently used key from all keys
`volatile-lru`	Evict LRU from keys with TTL set only
`allkeys-lfu`	Evict the least frequently used key (better for skewed access)
`volatile-ttl`	Evict the key with the nearest expiry

For general-purpose caching, allkeys-lru or allkeys-lfu (for skewed access patterns) are the right defaults. Use noeviction only for data structures that must not be evicted (queues, sorted sets for leaderboards) — but ensure memory is sized to never fill completely.

bash

# Set eviction policy on ElastiCache
aws elasticache modify-replication-group \
  --replication-group-id prod-cache \
  --cache-parameter-group-name prod-cache-params

ElastiCache does not allow direct CONFIG SET on managed instances — configure eviction policy via a parameter group:

bash

1aws elasticache create-cache-parameter-group \
2  --cache-parameter-group-name prod-cache-params \
3  --cache-parameter-group-family valkey7 \
4  --description "Production cache parameters"
5
6aws elasticache modify-cache-parameter-group \
7  --cache-parameter-group-name prod-cache-params \
8  --parameter-name-values ParameterName=maxmemory-policy,ParameterValue=allkeys-lru

Monitoring

Key ElastiCache CloudWatch metrics:

Metric	What to watch
`CacheHits` / `CacheMisses`	Hit rate — alert if drops below 90%
`Evictions`	Nonzero = memory pressure; consider scaling up
`DatabaseMemoryUsagePercentage`	Alert at 80% — above this, eviction or OOM risk
`ReplicationLag`	Replica lag in seconds — alert if > 1s
`CurrConnections`	Active connections — connection pool exhaustion shows up here
`NetworkBytesIn/Out`	Bandwidth — large objects or high throughput
`EngineCPUUtilization`	CPU-bound operations (large KEYS scans, SORT, complex Lua)

bash

1# Create alarm for high eviction rate (cache memory pressure)
2aws cloudwatch put-metric-alarm \
3  --alarm-name prod-cache-evictions \
4  --namespace AWS/ElastiCache \
5  --metric-name Evictions \
6  --dimensions Name=CacheClusterId,Value=prod-cache-0001-001 \
7  --statistic Sum \
8  --period 60 \
9  --evaluation-periods 3 \
10  --threshold 0 \
11  --comparison-operator GreaterThanThreshold \
12  --alarm-actions arn:aws:sns:us-east-1:012345678901:platform-alerts

Frequently Asked Questions

What's the difference between Valkey and Redis on ElastiCache?

Valkey is the open-source fork of Redis created in 2024 by the Linux Foundation after Redis changed its license. Valkey is API-compatible with Redis and AWS now recommends Valkey for new deployments. For existing Redis clusters, migration is transparent — the same client libraries work without changes.

AWS also offers ElastiCache for Redis (OSS-licensed Redis) but Valkey is the forward-looking choice on AWS.

How do I handle cache invalidation?

Cache invalidation is hard. The three workable strategies:

TTL-based expiry: let keys expire naturally. Simple, eventually consistent. The stale window is the TTL duration. Right for most read-heavy data that tolerates brief staleness.
Event-driven invalidation: when data changes in the database, publish an event (to SNS, EventBridge, or a write to a stream) that triggers deletion of the corresponding cache key. Immediate consistency, more complex to implement.
Versioned keys: instead of deleting the old key, write a new version: user:123:v2. Change the version reference in a separate key (user:123:version = v2). Old versions naturally expire. Avoids delete races but uses more memory.

For most production systems, TTL-based expiry with write-through on updates is the right starting point. Add event-driven invalidation only for data that must be immediately consistent after writes (user authentication state, billing information).

When should I NOT use ElastiCache?

As a primary database: Redis/Valkey is an in-memory store. Without persistence configured (RDB snapshots or AOF logging), data is lost on restart. Even with persistence, ElastiCache's backup is not a substitute for Aurora or RDS for your primary data.
For large objects > 1 MB: Redis is optimized for small keys. Large objects increase serialization cost, consume disproportionate memory, and are slow to evict. Store large blobs in S3, keep references in Redis.
When you need relational queries: if your cache lookup logic requires JOIN-style operations, you're working against the tool. Use RDS or DynamoDB with appropriate indexes instead.

How do I connect securely to ElastiCache?

ElastiCache should live in a private subnet — never expose it to the internet. Use:

VPC security groups: allow access only from application security groups (not CIDR blocks)
In-transit encryption: --transit-encryption-enabled — uses TLS for all client-server communication
Auth tokens: --auth-token <token> for Redis AUTH — ElastiCache supports RBAC for Valkey 7.2+ clusters

python

1import redis
2
3# Connect with TLS + auth token
4r = redis.Redis(
5    host='prod-cache.abc123.ng.0001.use1.cache.amazonaws.com',
6    port=6379,
7    ssl=True,
8    ssl_cert_reqs='required',
9    ssl_ca_certs='/etc/ssl/certs/ca-certificates.crt',    # Amazon root CA; path varies by OS
10    password='your-auth-token',    # From Secrets Manager
11    decode_responses=True,
12)

For DynamoDB as the primary database that ElastiCache offloads read traffic from, see AWS DynamoDB: Data Modeling, Capacity, Indexes, and Streams. For Lambda functions that use ElastiCache connections (initialized outside the handler for connection reuse), see AWS Lambda: Functions, Event Sources, Layers, and Serverless Patterns.

Sizing an ElastiCache cluster for a high-traffic API, implementing cache invalidation for a write-heavy workload, or debugging cache stampede causing database overload under traffic spikes? Talk to us at Coding Protocols — we help platform teams design caching layers that eliminate database bottlenecks without creating cache consistency problems.

AWS ElastiCache: Redis, Valkey, and Caching Patterns for Production

ElastiCache Architecture

Replication Groups

Cluster Mode

Serverless ElastiCache

Caching Patterns

Cache-Aside (Lazy Loading)

Write-Through

Write-Behind (Write-Back)

Session Storage

Distributed Locking

Cache Sizing and Eviction

Memory Sizing

Eviction Policies

Monitoring

Frequently Asked Questions

What's the difference between Valkey and Redis on ElastiCache?

How do I handle cache invalidation?

When should I NOT use ElastiCache?

How do I connect securely to ElastiCache?

Related Topics

Read Next

AWS Route 53: DNS, Routing Policies, and Health Checks

AWS Step Functions: Orchestrating Distributed Workflows

Terraform for EKS: Complete Infrastructure as Code Guide