docs/System Design/03-web-cache-part2
Edit on GitHub

Problem #3: Design a Web Cache — Part 2 (Sections 9-14)

Stage 2: Deep Technical Drill-Down


9. Deep Dive on Key Components

9.1 Eviction Policies Compared

LRU (Least Recently Used)

text
How it works:
- Maintain a doubly-linked list + hash map
- On access: move item to head of list
- On eviction: remove item from tail of list
- O(1) for get, set, and eviction

Example:
  Cache capacity: 3
  Access sequence: A B C D A B E

  [A]         → cache: [A]
  [A B]       → cache: [A, B]
  [A B C]     → cache: [A, B, C] (full)
  [B C D]     → cache: [B, C, D] (evict A - least recently used)
  [C D A]     → cache: [C, D, A] (A accessed, moved to front)
  [D A B]     → cache: [D, A, B] (B accessed, moved to front)
  [A B E]     → cache: [A, B, E] (evict D - least recently used)

Redis uses approximated LRU: Instead of tracking exact access order (which requires a linked list and extra memory), Redis samples N random keys and evicts the one with the oldest access time. This is ~95% as effective as true LRU with much less memory overhead.

LFU (Least Frequently Used)

text
How it works:
- Track access count per item
- On eviction: remove item with lowest count
- Break ties with LRU (oldest among least frequent)
- Redis 4.0+ supports approximated LFU

When to prefer over LRU:
- LFU is better for workloads with "scan resistance"
- If a batch job reads 1M items once, LRU evicts all hot items
- LFU keeps hot items because their frequency is high

TTL-Based Expiration

text
Two types:
1. Absolute expiration: Expires at a fixed time (e.g., 10 minutes from creation)
   - Good for: data that changes on a schedule, rate limit windows

2. Sliding expiration: Expires after N seconds of inactivity
   - Good for: sessions, frequently accessed data that should stay cached while active

Both can be combined: "Expire after 1 hour OR after 5 minutes of inactivity, whichever comes first"

9.2 Cache Invalidation Strategies (The Hard Problem)

"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton

Strategy 1: TTL-Based (Passive)

text
Set TTL on every cache entry → eventually expires.
Staleness window = TTL duration.

Pros: Simple, no active invalidation needed
Cons: Stale data served until TTL expires
When to use: When bounded staleness (e.g., 5 minutes) is acceptable

Strategy 2: Event-Driven Invalidation (Active)

text
On data change → publish invalidation event → all caches remove the key.
Uses Redis Pub/Sub or Kafka.

Pros: Near-real-time consistency
Cons: Complexity, missed events cause stale cache
When to use: When staleness must be < 1 second

Strategy 3: Write-Through (Inline)

text
On write → update cache AND database in the same transaction.

Pros: Cache is always consistent
Cons: Write latency increases, caches data that may never be read
When to use: When staleness is absolutely unacceptable

Strategy 4: Version-Based (Optimistic)

text
Each cache entry has a version number.
On read: compare cache version with DB version.
If mismatch: invalidate and refetch.

Pros: No stale reads ever
Cons: Requires version tracking in DB (extra column), extra read on every cache hit
When to use: Financial data, inventory counts

Invalidation Comparison

StrategyStalenessComplexityConsistencyPerformance
TTL-onlyBounded by TTLVery LowEventualBest
Event-driven~100msMediumNear-real-timeGood
Write-through0msMediumStrongWrite penalty
Version-based0msHighStrongRead penalty

Recommended default: TTL + Event-driven for critical keys. This gives you fast reads AND near-real-time consistency where it matters.


9.3 Thundering Herd Protection (Critical for Interviews)

The Problem

When a popular cache key expires, hundreds of concurrent requests simultaneously miss the cache and hit the database with the same query. This can overload the DB.

Solution: Distributed Lock with Single-Flight

csharp
async Task<T> GetOrSetAsync<T>(string key, Func<Task<T>> factory, CacheOptions options)
{
    // 1. Try cache
    var cached = await _l2Cache.GetAsync<T>(key);
    if (cached is not null) return cached;

    // 2. Try to acquire lock (only one caller proceeds to DB)
    var lockKey = $"lock:{key}";
    var acquired = await _lockProvider.TryAcquireAsync(lockKey, TimeSpan.FromSeconds(5));

    if (acquired)
    {
        try
        {
            // Double-check: another thread may have populated while we waited for lock
            cached = await _l2Cache.GetAsync<T>(key);
            if (cached is not null) return cached;

            // 3. Fetch from DB (only this caller does this)
            var value = await factory();

            // 4. Populate cache
            await _l2Cache.SetAsync(key, value, options);
            return value;
        }
        finally
        {
            await _lockProvider.ReleaseAsync(lockKey);
        }
    }
    else
    {
        // 5. Lock not acquired — another caller is fetching. Wait and retry cache.
        await Task.Delay(50); // Brief wait
        return await GetOrSetAsync(key, factory, options); // Retry (will hit cache)
    }
}

9.4 Serialization Strategy

FormatSizeSpeedHuman-Readable.NET Support
System.Text.JsonMediumFastYesBuilt-in
MessagePackSmall (30-50% of JSON)Very fastNoMessagePack-CSharp
ProtobufSmallVery fastNoGoogle.Protobuf
MemoryPackSmallestFastestNo.NET 7+

Recommendation: Start with System.Text.Json (zero dependencies). Switch to MessagePack when cache memory or bandwidth becomes a concern.


10. Bottlenecks and Trade-offs

10.1 Cache Consistency vs. Performance

DecisionTrade-off
Longer TTLBetter hit rate, more stale data
Shorter TTLFresher data, more cache misses
Write-throughZero staleness, higher write latency
Event-driven invalidationNear-real-time, but complexity + risk of missed events

10.2 L1 Cache Consistency

  • Each app instance has its own L1 cache
  • After invalidation, some instances may still have stale L1 entries
  • Mitigation: Short L1 TTL (30-60s) + Redis Pub/Sub invalidation

10.3 Hot Key Problem

  • A single extremely popular key (e.g., homepage product) can overload one Redis shard
  • Mitigations:
    1. L1 cache absorbs most reads (L2 only hit on L1 miss)
    2. Key replication: store the key on multiple shards with different suffixes
    3. Local caching with short TTL for known hot keys

10.4 Cold Start / Cache Warming

  • After a deploy or cache flush, hit rate drops to 0%
  • Mitigations:
    1. Rolling deploys (only one instance cold at a time)
    2. Cache warming worker: preloads popular keys from DB on startup
    3. Gradual traffic shift: route small % of traffic first

10.5 Big Key Problem

  • Caching large objects (1MB+) slows Redis and consumes bandwidth
  • Mitigations:
    1. Don't cache objects > 100KB
    2. Compress large values before caching
    3. Split large objects into smaller keys

11. Failure Scenarios

11.1 Redis Primary Crash

  • Impact: All cache reads/writes fail
  • Behavior: Fail-open to database (higher latency, more DB load)
  • Recovery: Redis Sentinel promotes replica (< 10s), clients reconnect
  • Monitoring: Alert on Redis connection errors, track DB fallback rate

11.2 Cache Miss Storm (Thundering Herd)

  • Scenario: Popular key expires, 1000 requests simultaneously hit DB
  • Handling: Distributed lock ensures only one request fetches from DB
  • Alternative: "Stale-while-revalidate" — serve stale value while one request refreshes

11.3 Cache Poisoning

  • Scenario: Corrupt or wrong data stored in cache
  • Impact: All readers get bad data until TTL expires
  • Handling: Version-based validation, manual cache flush, monitoring for data anomalies

11.4 Memory Exhaustion

  • Scenario: Cache grows beyond available memory
  • Impact: Eviction increases, hit rate drops, Redis may OOM
  • Prevention: Set maxmemory + maxmemory-policy (e.g., allkeys-lru)
  • Monitoring: Track memory usage, eviction rate

11.5 L1/L2 Inconsistency

  • Scenario: Key is invalidated in L2 but still present in L1 of some instances
  • Impact: Some users see stale data briefly
  • Handling: Short L1 TTL (30-60s) bounds the window. Redis Pub/Sub propagates invalidation to all L1 caches.

11.6 Pub/Sub Message Loss

  • Scenario: Invalidation event lost → L1 caches retain stale data
  • Impact: Stale data served until L1 TTL expires
  • Handling: L1 TTL acts as the safety net. For critical data, don't use L1 cache.

11.7 Serialization/Deserialization Failure

  • Scenario: Schema change makes cached data incompatible
  • Impact: Deserialization throws, cache returns error
  • Handling: Catch deserialization errors → treat as cache miss → fetch from DB

12. Scaling Strategy

Phase 1: Single Redis Instance

text
App Instances (N) → Single Redis (primary + replica)
- Handles up to 100K ops/sec
- Simple, cost-effective

Phase 2: Redis Cluster

text
App Instances → Redis Cluster (6+ nodes)
- Horizontal scaling: add shards as data grows
- Each shard handles its portion of the key space

Phase 3: Multi-Level Caching

text
L1 (In-process) → L2 (Redis Cluster) → L3 (CDN for static content)
- L1: ~100MB per instance, 30-60s TTL
- L2: ~100GB shared, 5-30min TTL
- L3: CDN for HTML, images, API responses

Phase 4: Multi-Region

text
Per-region L2 Redis → Cross-region async replication
- Each region has independent L2 cache
- Writes go to primary region → replicate to others
- Accept brief cross-region staleness

13. Security Design

13.1 Data Protection

  • Encryption in transit: TLS between app and Redis (Redis 6+)
  • Encryption at rest: If caching sensitive data (PII, tokens), encrypt values before caching
  • Access control: Redis ACLs (Redis 6+) — restrict which services can access which key prefixes

13.2 Cache Poisoning Prevention

  • Only allow authenticated services to write to cache
  • Validate data integrity on cache read (checksum or signature)
  • Audit log for cache flush operations

13.3 Sensitive Data

  • Never cache: passwords, credit card numbers, API secrets
  • Cache with encryption: PII (names, emails), session tokens
  • Cache freely: public product data, configuration, reference data

14. Observability

14.1 Key Metrics

text
# Counters
cache_operations_total{operation="get|set|delete", result="hit|miss|error", level="l1|l2"}
cache_invalidations_total{method="ttl|explicit|pubsub"}
cache_serialization_errors_total

# Gauges
cache_keys_total{level="l1|l2"}
cache_memory_bytes{level="l1|l2"}
cache_hit_rate{level="l1|l2"}

# Histograms
cache_operation_duration_ms{operation, level}
cache_value_size_bytes

14.2 SLOs

SLISLO
Cache hit latency (L1)p99 < 0.1ms
Cache hit latency (L2)p99 < 2ms
Hit rate (combined L1+L2)> 85%
Availability99.99%
Staleness (event-driven keys)< 5 seconds

14.3 Alerting

AlertConditionSeverity
Hit rate drop< 70% for 5 minutesP2 (Warning)
Redis unreachableConnection errors > 0 for 30sP1 (Critical)
Memory usage high> 80% of maxmemoryP2 (Warning)
Eviction rate spike> 1000/secP3 (Info)
DB fallback rate high> 30% of reads hitting DBP2 (Warning)