Problem #3: Design a Web Cache — Part 2 (Sections 9-14)
Stage 2: Deep Technical Drill-Down
9. Deep Dive on Key Components
9.1 Eviction Policies Compared
LRU (Least Recently Used)
How it works:
- Maintain a doubly-linked list + hash map
- On access: move item to head of list
- On eviction: remove item from tail of list
- O(1) for get, set, and eviction
Example:
Cache capacity: 3
Access sequence: A B C D A B E
[A] → cache: [A]
[A B] → cache: [A, B]
[A B C] → cache: [A, B, C] (full)
[B C D] → cache: [B, C, D] (evict A - least recently used)
[C D A] → cache: [C, D, A] (A accessed, moved to front)
[D A B] → cache: [D, A, B] (B accessed, moved to front)
[A B E] → cache: [A, B, E] (evict D - least recently used)Redis uses approximated LRU: Instead of tracking exact access order (which requires a linked list and extra memory), Redis samples N random keys and evicts the one with the oldest access time. This is ~95% as effective as true LRU with much less memory overhead.
LFU (Least Frequently Used)
How it works:
- Track access count per item
- On eviction: remove item with lowest count
- Break ties with LRU (oldest among least frequent)
- Redis 4.0+ supports approximated LFU
When to prefer over LRU:
- LFU is better for workloads with "scan resistance"
- If a batch job reads 1M items once, LRU evicts all hot items
- LFU keeps hot items because their frequency is highTTL-Based Expiration
Two types:
1. Absolute expiration: Expires at a fixed time (e.g., 10 minutes from creation)
- Good for: data that changes on a schedule, rate limit windows
2. Sliding expiration: Expires after N seconds of inactivity
- Good for: sessions, frequently accessed data that should stay cached while active
Both can be combined: "Expire after 1 hour OR after 5 minutes of inactivity, whichever comes first"9.2 Cache Invalidation Strategies (The Hard Problem)
"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton
Strategy 1: TTL-Based (Passive)
Set TTL on every cache entry → eventually expires.
Staleness window = TTL duration.
Pros: Simple, no active invalidation needed
Cons: Stale data served until TTL expires
When to use: When bounded staleness (e.g., 5 minutes) is acceptableStrategy 2: Event-Driven Invalidation (Active)
On data change → publish invalidation event → all caches remove the key.
Uses Redis Pub/Sub or Kafka.
Pros: Near-real-time consistency
Cons: Complexity, missed events cause stale cache
When to use: When staleness must be < 1 secondStrategy 3: Write-Through (Inline)
On write → update cache AND database in the same transaction.
Pros: Cache is always consistent
Cons: Write latency increases, caches data that may never be read
When to use: When staleness is absolutely unacceptableStrategy 4: Version-Based (Optimistic)
Each cache entry has a version number.
On read: compare cache version with DB version.
If mismatch: invalidate and refetch.
Pros: No stale reads ever
Cons: Requires version tracking in DB (extra column), extra read on every cache hit
When to use: Financial data, inventory countsInvalidation Comparison
| Strategy | Staleness | Complexity | Consistency | Performance |
|---|---|---|---|---|
| TTL-only | Bounded by TTL | Very Low | Eventual | Best |
| Event-driven | ~100ms | Medium | Near-real-time | Good |
| Write-through | 0ms | Medium | Strong | Write penalty |
| Version-based | 0ms | High | Strong | Read penalty |
Recommended default: TTL + Event-driven for critical keys. This gives you fast reads AND near-real-time consistency where it matters.
9.3 Thundering Herd Protection (Critical for Interviews)
The Problem
When a popular cache key expires, hundreds of concurrent requests simultaneously miss the cache and hit the database with the same query. This can overload the DB.
Solution: Distributed Lock with Single-Flight
async Task<T> GetOrSetAsync<T>(string key, Func<Task<T>> factory, CacheOptions options)
{
// 1. Try cache
var cached = await _l2Cache.GetAsync<T>(key);
if (cached is not null) return cached;
// 2. Try to acquire lock (only one caller proceeds to DB)
var lockKey = $"lock:{key}";
var acquired = await _lockProvider.TryAcquireAsync(lockKey, TimeSpan.FromSeconds(5));
if (acquired)
{
try
{
// Double-check: another thread may have populated while we waited for lock
cached = await _l2Cache.GetAsync<T>(key);
if (cached is not null) return cached;
// 3. Fetch from DB (only this caller does this)
var value = await factory();
// 4. Populate cache
await _l2Cache.SetAsync(key, value, options);
return value;
}
finally
{
await _lockProvider.ReleaseAsync(lockKey);
}
}
else
{
// 5. Lock not acquired — another caller is fetching. Wait and retry cache.
await Task.Delay(50); // Brief wait
return await GetOrSetAsync(key, factory, options); // Retry (will hit cache)
}
}9.4 Serialization Strategy
| Format | Size | Speed | Human-Readable | .NET Support |
|---|---|---|---|---|
| System.Text.Json | Medium | Fast | Yes | Built-in |
| MessagePack | Small (30-50% of JSON) | Very fast | No | MessagePack-CSharp |
| Protobuf | Small | Very fast | No | Google.Protobuf |
| MemoryPack | Smallest | Fastest | No | .NET 7+ |
Recommendation: Start with System.Text.Json (zero dependencies). Switch to MessagePack when cache memory or bandwidth becomes a concern.
10. Bottlenecks and Trade-offs
10.1 Cache Consistency vs. Performance
| Decision | Trade-off |
|---|---|
| Longer TTL | Better hit rate, more stale data |
| Shorter TTL | Fresher data, more cache misses |
| Write-through | Zero staleness, higher write latency |
| Event-driven invalidation | Near-real-time, but complexity + risk of missed events |
10.2 L1 Cache Consistency
- Each app instance has its own L1 cache
- After invalidation, some instances may still have stale L1 entries
- Mitigation: Short L1 TTL (30-60s) + Redis Pub/Sub invalidation
10.3 Hot Key Problem
- A single extremely popular key (e.g., homepage product) can overload one Redis shard
- Mitigations:
- L1 cache absorbs most reads (L2 only hit on L1 miss)
- Key replication: store the key on multiple shards with different suffixes
- Local caching with short TTL for known hot keys
10.4 Cold Start / Cache Warming
- After a deploy or cache flush, hit rate drops to 0%
- Mitigations:
- Rolling deploys (only one instance cold at a time)
- Cache warming worker: preloads popular keys from DB on startup
- Gradual traffic shift: route small % of traffic first
10.5 Big Key Problem
- Caching large objects (1MB+) slows Redis and consumes bandwidth
- Mitigations:
- Don't cache objects > 100KB
- Compress large values before caching
- Split large objects into smaller keys
11. Failure Scenarios
11.1 Redis Primary Crash
- Impact: All cache reads/writes fail
- Behavior: Fail-open to database (higher latency, more DB load)
- Recovery: Redis Sentinel promotes replica (< 10s), clients reconnect
- Monitoring: Alert on Redis connection errors, track DB fallback rate
11.2 Cache Miss Storm (Thundering Herd)
- Scenario: Popular key expires, 1000 requests simultaneously hit DB
- Handling: Distributed lock ensures only one request fetches from DB
- Alternative: "Stale-while-revalidate" — serve stale value while one request refreshes
11.3 Cache Poisoning
- Scenario: Corrupt or wrong data stored in cache
- Impact: All readers get bad data until TTL expires
- Handling: Version-based validation, manual cache flush, monitoring for data anomalies
11.4 Memory Exhaustion
- Scenario: Cache grows beyond available memory
- Impact: Eviction increases, hit rate drops, Redis may OOM
- Prevention: Set
maxmemory+maxmemory-policy(e.g.,allkeys-lru) - Monitoring: Track memory usage, eviction rate
11.5 L1/L2 Inconsistency
- Scenario: Key is invalidated in L2 but still present in L1 of some instances
- Impact: Some users see stale data briefly
- Handling: Short L1 TTL (30-60s) bounds the window. Redis Pub/Sub propagates invalidation to all L1 caches.
11.6 Pub/Sub Message Loss
- Scenario: Invalidation event lost → L1 caches retain stale data
- Impact: Stale data served until L1 TTL expires
- Handling: L1 TTL acts as the safety net. For critical data, don't use L1 cache.
11.7 Serialization/Deserialization Failure
- Scenario: Schema change makes cached data incompatible
- Impact: Deserialization throws, cache returns error
- Handling: Catch deserialization errors → treat as cache miss → fetch from DB
12. Scaling Strategy
Phase 1: Single Redis Instance
App Instances (N) → Single Redis (primary + replica)
- Handles up to 100K ops/sec
- Simple, cost-effectivePhase 2: Redis Cluster
App Instances → Redis Cluster (6+ nodes)
- Horizontal scaling: add shards as data grows
- Each shard handles its portion of the key spacePhase 3: Multi-Level Caching
L1 (In-process) → L2 (Redis Cluster) → L3 (CDN for static content)
- L1: ~100MB per instance, 30-60s TTL
- L2: ~100GB shared, 5-30min TTL
- L3: CDN for HTML, images, API responsesPhase 4: Multi-Region
Per-region L2 Redis → Cross-region async replication
- Each region has independent L2 cache
- Writes go to primary region → replicate to others
- Accept brief cross-region staleness13. Security Design
13.1 Data Protection
- Encryption in transit: TLS between app and Redis (Redis 6+)
- Encryption at rest: If caching sensitive data (PII, tokens), encrypt values before caching
- Access control: Redis ACLs (Redis 6+) — restrict which services can access which key prefixes
13.2 Cache Poisoning Prevention
- Only allow authenticated services to write to cache
- Validate data integrity on cache read (checksum or signature)
- Audit log for cache flush operations
13.3 Sensitive Data
- Never cache: passwords, credit card numbers, API secrets
- Cache with encryption: PII (names, emails), session tokens
- Cache freely: public product data, configuration, reference data
14. Observability
14.1 Key Metrics
# Counters
cache_operations_total{operation="get|set|delete", result="hit|miss|error", level="l1|l2"}
cache_invalidations_total{method="ttl|explicit|pubsub"}
cache_serialization_errors_total
# Gauges
cache_keys_total{level="l1|l2"}
cache_memory_bytes{level="l1|l2"}
cache_hit_rate{level="l1|l2"}
# Histograms
cache_operation_duration_ms{operation, level}
cache_value_size_bytes14.2 SLOs
| SLI | SLO |
|---|---|
| Cache hit latency (L1) | p99 < 0.1ms |
| Cache hit latency (L2) | p99 < 2ms |
| Hit rate (combined L1+L2) | > 85% |
| Availability | 99.99% |
| Staleness (event-driven keys) | < 5 seconds |
14.3 Alerting
| Alert | Condition | Severity |
|---|---|---|
| Hit rate drop | < 70% for 5 minutes | P2 (Warning) |
| Redis unreachable | Connection errors > 0 for 30s | P1 (Critical) |
| Memory usage high | > 80% of maxmemory | P2 (Warning) |
| Eviction rate spike | > 1000/sec | P3 (Info) |
| DB fallback rate high | > 30% of reads hitting DB | P2 (Warning) |