docs/System Design/03-web-cache-part2

Problem #3: Design a Web Cache — Part 2 (Sections 9-14)

Stage 2: Deep Technical Drill-Down

9. Deep Dive on Key Components

9.1 Eviction Policies Compared

LRU (Least Recently Used)

text

How it works:
- Maintain a doubly-linked list + hash map
- On access: move item to head of list
- On eviction: remove item from tail of list
- O(1) for get, set, and eviction

Example:
  Cache capacity: 3
  Access sequence: A B C D A B E

  [A]         → cache: [A]
  [A B]       → cache: [A, B]
  [A B C]     → cache: [A, B, C] (full)
  [B C D]     → cache: [B, C, D] (evict A - least recently used)
  [C D A]     → cache: [C, D, A] (A accessed, moved to front)
  [D A B]     → cache: [D, A, B] (B accessed, moved to front)
  [A B E]     → cache: [A, B, E] (evict D - least recently used)

Redis uses approximated LRU: Instead of tracking exact access order (which requires a linked list and extra memory), Redis samples N random keys and evicts the one with the oldest access time. This is ~95% as effective as true LRU with much less memory overhead.

LFU (Least Frequently Used)

text

How it works:
- Track access count per item
- On eviction: remove item with lowest count
- Break ties with LRU (oldest among least frequent)
- Redis 4.0+ supports approximated LFU

When to prefer over LRU:
- LFU is better for workloads with "scan resistance"
- If a batch job reads 1M items once, LRU evicts all hot items
- LFU keeps hot items because their frequency is high

TTL-Based Expiration

text

Two types:
1. Absolute expiration: Expires at a fixed time (e.g., 10 minutes from creation)
   - Good for: data that changes on a schedule, rate limit windows

2. Sliding expiration: Expires after N seconds of inactivity
   - Good for: sessions, frequently accessed data that should stay cached while active

Both can be combined: "Expire after 1 hour OR after 5 minutes of inactivity, whichever comes first"

9.2 Cache Invalidation Strategies (The Hard Problem)

"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton

Strategy 1: TTL-Based (Passive)

text

Set TTL on every cache entry → eventually expires.
Staleness window = TTL duration.

Pros: Simple, no active invalidation needed
Cons: Stale data served until TTL expires
When to use: When bounded staleness (e.g., 5 minutes) is acceptable

Strategy 2: Event-Driven Invalidation (Active)

text

On data change → publish invalidation event → all caches remove the key.
Uses Redis Pub/Sub or Kafka.

Pros: Near-real-time consistency
Cons: Complexity, missed events cause stale cache
When to use: When staleness must be < 1 second

Strategy 3: Write-Through (Inline)

text

On write → update cache AND database in the same transaction.

Pros: Cache is always consistent
Cons: Write latency increases, caches data that may never be read
When to use: When staleness is absolutely unacceptable

Strategy 4: Version-Based (Optimistic)

text

Each cache entry has a version number.
On read: compare cache version with DB version.
If mismatch: invalidate and refetch.

Pros: No stale reads ever
Cons: Requires version tracking in DB (extra column), extra read on every cache hit
When to use: Financial data, inventory counts

Invalidation Comparison

Strategy	Staleness	Complexity	Consistency	Performance
TTL-only	Bounded by TTL	Very Low	Eventual	Best
Event-driven	~100ms	Medium	Near-real-time	Good
Write-through	0ms	Medium	Strong	Write penalty
Version-based	0ms	High	Strong	Read penalty

Recommended default: TTL + Event-driven for critical keys. This gives you fast reads AND near-real-time consistency where it matters.

9.3 Thundering Herd Protection (Critical for Interviews)

The Problem

When a popular cache key expires, hundreds of concurrent requests simultaneously miss the cache and hit the database with the same query. This can overload the DB.

Solution: Distributed Lock with Single-Flight

csharp

async Task<T> GetOrSetAsync<T>(string key, Func<Task<T>> factory, CacheOptions options)
{
    // 1. Try cache
    var cached = await _l2Cache.GetAsync<T>(key);
    if (cached is not null) return cached;

    // 2. Try to acquire lock (only one caller proceeds to DB)
    var lockKey = $"lock:{key}";
    var acquired = await _lockProvider.TryAcquireAsync(lockKey, TimeSpan.FromSeconds(5));

    if (acquired)
    {
        try
        {
            // Double-check: another thread may have populated while we waited for lock
            cached = await _l2Cache.GetAsync<T>(key);
            if (cached is not null) return cached;

            // 3. Fetch from DB (only this caller does this)
            var value = await factory();

            // 4. Populate cache
            await _l2Cache.SetAsync(key, value, options);
            return value;
        }
        finally
        {
            await _lockProvider.ReleaseAsync(lockKey);
        }
    }
    else
    {
        // 5. Lock not acquired — another caller is fetching. Wait and retry cache.
        await Task.Delay(50); // Brief wait
        return await GetOrSetAsync(key, factory, options); // Retry (will hit cache)
    }
}

9.4 Serialization Strategy

Format	Size	Speed	Human-Readable	.NET Support
System.Text.Json	Medium	Fast	Yes	Built-in
MessagePack	Small (30-50% of JSON)	Very fast	No	MessagePack-CSharp
Protobuf	Small	Very fast	No	Google.Protobuf
MemoryPack	Smallest	Fastest	No	.NET 7+

Recommendation: Start with System.Text.Json (zero dependencies). Switch to MessagePack when cache memory or bandwidth becomes a concern.

10. Bottlenecks and Trade-offs

10.1 Cache Consistency vs. Performance

Decision	Trade-off
Longer TTL	Better hit rate, more stale data
Shorter TTL	Fresher data, more cache misses
Write-through	Zero staleness, higher write latency
Event-driven invalidation	Near-real-time, but complexity + risk of missed events

10.2 L1 Cache Consistency

Each app instance has its own L1 cache
After invalidation, some instances may still have stale L1 entries
Mitigation: Short L1 TTL (30-60s) + Redis Pub/Sub invalidation

10.3 Hot Key Problem

A single extremely popular key (e.g., homepage product) can overload one Redis shard
Mitigations:
1. L1 cache absorbs most reads (L2 only hit on L1 miss)
2. Key replication: store the key on multiple shards with different suffixes
3. Local caching with short TTL for known hot keys

10.4 Cold Start / Cache Warming

After a deploy or cache flush, hit rate drops to 0%
Mitigations:
1. Rolling deploys (only one instance cold at a time)
2. Cache warming worker: preloads popular keys from DB on startup
3. Gradual traffic shift: route small % of traffic first

10.5 Big Key Problem

Caching large objects (1MB+) slows Redis and consumes bandwidth
Mitigations:
1. Don't cache objects > 100KB
2. Compress large values before caching
3. Split large objects into smaller keys

11. Failure Scenarios

11.1 Redis Primary Crash

Impact: All cache reads/writes fail
Behavior: Fail-open to database (higher latency, more DB load)
Recovery: Redis Sentinel promotes replica (< 10s), clients reconnect
Monitoring: Alert on Redis connection errors, track DB fallback rate

11.2 Cache Miss Storm (Thundering Herd)

Scenario: Popular key expires, 1000 requests simultaneously hit DB
Handling: Distributed lock ensures only one request fetches from DB
Alternative: "Stale-while-revalidate" — serve stale value while one request refreshes

11.3 Cache Poisoning

Scenario: Corrupt or wrong data stored in cache
Impact: All readers get bad data until TTL expires
Handling: Version-based validation, manual cache flush, monitoring for data anomalies

11.4 Memory Exhaustion

Scenario: Cache grows beyond available memory
Impact: Eviction increases, hit rate drops, Redis may OOM
Prevention: Set maxmemory + maxmemory-policy (e.g., allkeys-lru)
Monitoring: Track memory usage, eviction rate

11.5 L1/L2 Inconsistency

Scenario: Key is invalidated in L2 but still present in L1 of some instances
Impact: Some users see stale data briefly
Handling: Short L1 TTL (30-60s) bounds the window. Redis Pub/Sub propagates invalidation to all L1 caches.

11.6 Pub/Sub Message Loss

Scenario: Invalidation event lost → L1 caches retain stale data
Impact: Stale data served until L1 TTL expires
Handling: L1 TTL acts as the safety net. For critical data, don't use L1 cache.

11.7 Serialization/Deserialization Failure

Scenario: Schema change makes cached data incompatible
Impact: Deserialization throws, cache returns error
Handling: Catch deserialization errors → treat as cache miss → fetch from DB

12. Scaling Strategy

Phase 1: Single Redis Instance

text

App Instances (N) → Single Redis (primary + replica)
- Handles up to 100K ops/sec
- Simple, cost-effective

Phase 2: Redis Cluster

text

App Instances → Redis Cluster (6+ nodes)
- Horizontal scaling: add shards as data grows
- Each shard handles its portion of the key space

Phase 3: Multi-Level Caching

text

L1 (In-process) → L2 (Redis Cluster) → L3 (CDN for static content)
- L1: ~100MB per instance, 30-60s TTL
- L2: ~100GB shared, 5-30min TTL
- L3: CDN for HTML, images, API responses

Phase 4: Multi-Region

text

Per-region L2 Redis → Cross-region async replication
- Each region has independent L2 cache
- Writes go to primary region → replicate to others
- Accept brief cross-region staleness

13. Security Design

13.1 Data Protection

Encryption in transit: TLS between app and Redis (Redis 6+)
Encryption at rest: If caching sensitive data (PII, tokens), encrypt values before caching
Access control: Redis ACLs (Redis 6+) — restrict which services can access which key prefixes

13.2 Cache Poisoning Prevention

Only allow authenticated services to write to cache
Validate data integrity on cache read (checksum or signature)
Audit log for cache flush operations

13.3 Sensitive Data

Never cache: passwords, credit card numbers, API secrets
Cache with encryption: PII (names, emails), session tokens
Cache freely: public product data, configuration, reference data

14. Observability

14.1 Key Metrics

text

# Counters
cache_operations_total{operation="get|set|delete", result="hit|miss|error", level="l1|l2"}
cache_invalidations_total{method="ttl|explicit|pubsub"}
cache_serialization_errors_total

# Gauges
cache_keys_total{level="l1|l2"}
cache_memory_bytes{level="l1|l2"}
cache_hit_rate{level="l1|l2"}

# Histograms
cache_operation_duration_ms{operation, level}
cache_value_size_bytes

14.2 SLOs

SLI	SLO
Cache hit latency (L1)	p99 < 0.1ms
Cache hit latency (L2)	p99 < 2ms
Hit rate (combined L1+L2)	> 85%
Availability	99.99%
Staleness (event-driven keys)	< 5 seconds

14.3 Alerting

Alert	Condition	Severity
Hit rate drop	< 70% for 5 minutes	P2 (Warning)
Redis unreachable	Connection errors > 0 for 30s	P1 (Critical)
Memory usage high	> 80% of maxmemory	P2 (Warning)
Eviction rate spike	> 1000/sec	P3 (Info)
DB fallback rate high	> 30% of reads hitting DB	P2 (Warning)

Previous← Web Cache Part1 NextWeb Cache Part3 →