docs/System Design/32-distributed-file-system-part1
Edit on GitHub

Problem #32: Design a Distributed File System (GFS/HDFS) — Full Deep Dive


1. Problem Statement

Design a distributed file system that stores petabytes of data across thousands of commodity machines, optimized for large sequential reads/writes (not random access), with automatic replication and fault tolerance.

Think: Google File System (GFS), Hadoop Distributed File System (HDFS), Amazon EFS.


2-3. Requirements

Core

IDRequirement
FR-1Store large files (GBs to TBs) across a cluster
FR-2Read file (sequential reads optimized)
FR-3Write/append to file (append-heavy, not random writes)
FR-4Automatic replication (3 copies default)
FR-5Fault tolerance (survive node failures transparently)
FR-6Namespace management (directories, file metadata)
FR-7Support concurrent readers (multiple clients reading same file)

Non-Functional

RequirementTarget
ThroughputHigh aggregate bandwidth (100+ GB/sec across cluster)
File sizeOptimized for 100 MB+ files (not small files)
LatencySequential read: high throughput > low latency
StoragePetabytes
Availability99.9% (commodity hardware fails regularly)

4. Key Design Assumptions (GFS Paper)

text
GFS was designed with specific workload assumptions:
  1. Files are LARGE (multi-GB). Small files exist but aren't optimized for.
  2. Reads are mostly SEQUENTIAL (streaming). Random reads are rare.
  3. Writes are mostly APPENDS (log files, analytics output). Random writes are rare.
  4. Hardware FAILS regularly (commodity servers, cheap disks).
  5. High sustained THROUGHPUT is more important than low LATENCY.

These assumptions drive every design decision.
Contrast with: traditional file systems (small files, random I/O, low latency).

5-8. Architecture

GFS/HDFS Architecture

Core Concepts

text
Chunk (Block):
  - Files split into fixed-size chunks (64 MB in GFS, 128 MB in HDFS)
  - Each chunk stored as a regular Linux file on a chunk server
  - Each chunk replicated to 3 chunk servers (default)
  - Chunk identified by globally unique 64-bit chunk handle

Why 64-128 MB chunks?
  - Reduces metadata on master (1 PB / 64 MB = 16M chunks, manageable in RAM)
  - Reduces client-master interactions (fewer chunks to look up)
  - Enables large sequential I/O (read an entire chunk without seeking)
  - Trade-off: small files waste space (a 1 KB file still occupies a chunk entry)

Master (NameNode):
  - Single node (simplified, but single point of failure)
  - Stores ALL metadata in memory:
    → Namespace (directory tree): ~64 bytes per file/directory
    → File → chunk mapping: ~64 bytes per chunk
    → Chunk → location mapping: from chunk server heartbeats (not persisted)
  - 1 PB of data → ~16M chunks → ~1 GB of metadata → fits in RAM easily

Read Flow

Write Flow (Append — GFS Pipeline)


9. Deep Dive: Master Fault Tolerance

text
The master is a single point of failure. How to protect it?

1. Operation Log (WAL):
   - Every metadata mutation logged to disk BEFORE responding
   - Log replicated to remote backup machines
   - On master crash → new master replays operation log

2. Checkpoints:
   - Periodically snapshot the in-memory metadata to disk
   - Compact B-tree format for fast loading
   - Recovery: load latest checkpoint + replay log entries after checkpoint

3. Shadow Masters (Stanby NameNode):
   - Read replicas of the master
   - Serve read-only metadata requests (slightly stale)
   - Can be promoted to primary on master failure

4. HDFS HA (modern approach):
   - Active/Standby NameNode pair
   - Shared edit log via JournalNodes (Quorum-based)
   - ZooKeeper for automatic failover
   - Failover time: 30-60 seconds

Chunk location data is NOT persisted by master:
   - Chunk servers report their chunks via heartbeats on startup
   - Master rebuilds chunk→location map from heartbeats
   - This simplifies master recovery (less state to persist)

Handling Chunk Server Failures

text
Detection:
  - Chunk server sends heartbeat to master every 3 seconds
  - No heartbeat for 30 seconds → master marks server as dead

Recovery:
  - Master identifies under-replicated chunks (< 3 replicas)
  - Prioritizes: chunks with 1 replica > chunks with 2 replicas
  - Instructs healthy chunk servers to copy chunks
  - Respects rack-awareness: replicas spread across racks

Rack-aware placement (default):
  Replica 1: same rack as writer (low latency)
  Replica 2: different rack (survives rack failure)
  Replica 3: same rack as replica 2 but different node

This survives: 1 node failure, 1 rack failure, 1 disk failure.

10-14. Trade-offs, Interview

Key trade-offs:

DecisionTrade-off
Large chunks (64 MB)Pros: less metadata, fewer master calls, good throughput. Cons: wasted space for small files, hot spots if small file is popular
Single masterPros: simple consistency. Cons: SPOF, metadata scalability limit. Mitigated: shadow masters, HA failover
Append-only writesPros: simpler concurrency (no random write conflicts). Cons: can't update in place (must rewrite entire file)
Chain replication (pipeline)Pros: maximizes bandwidth. Cons: higher latency (data travels through chain)

Interview tips:

  1. Architecture: single master (metadata in RAM) + many chunk servers (data on disk)
  2. Large chunk size (64-128 MB) — explain WHY
  3. Write pipeline (client → primary → secondary chain) — maximizes bandwidth
  4. Lease mechanism for primary election per chunk
  5. Master fault tolerance (operation log + checkpoints + HA failover)
  6. Rack-aware replication for durability

Practice Mode

5 Questions

  1. "Why are chunks 64 MB instead of 4 KB like a regular filesystem?" → Optimized for large sequential reads/writes. Reduces master metadata (fits in RAM). Fewer client-master round trips. Trade-off: small files waste space.
  2. "How does the write pipeline work?" → Client pushes data to primary → primary forwards to secondaries (chain). Then client sends write request to primary → primary assigns serial number → forwards to secondaries. Chain maximizes bandwidth.
  3. "What happens when a chunk server dies?" → Master detects via missed heartbeats (30s). Identifies under-replicated chunks. Instructs healthy servers to re-replicate. Rack-aware placement ensures fault tolerance.
  4. "How is the master protected from being a SPOF?" → Operation log replicated to backups. Periodic checkpoints. Shadow/standby masters. HDFS HA: active/standby with shared journal and ZooKeeper failover.
  5. "Why doesn't the master persist chunk locations?" → Chunk servers report their chunks via heartbeats. Master rebuilds the mapping dynamically. Simplifies recovery — no stale location data.