docs/System Design/32-distributed-file-system-part1

Problem #32: Design a Distributed File System (GFS/HDFS) — Full Deep Dive

1. Problem Statement

Design a distributed file system that stores petabytes of data across thousands of commodity machines, optimized for large sequential reads/writes (not random access), with automatic replication and fault tolerance.

Think: Google File System (GFS), Hadoop Distributed File System (HDFS), Amazon EFS.

2-3. Requirements

Core

ID	Requirement
FR-1	Store large files (GBs to TBs) across a cluster
FR-2	Read file (sequential reads optimized)
FR-3	Write/append to file (append-heavy, not random writes)
FR-4	Automatic replication (3 copies default)
FR-5	Fault tolerance (survive node failures transparently)
FR-6	Namespace management (directories, file metadata)
FR-7	Support concurrent readers (multiple clients reading same file)

Non-Functional

Requirement	Target
Throughput	High aggregate bandwidth (100+ GB/sec across cluster)
File size	Optimized for 100 MB+ files (not small files)
Latency	Sequential read: high throughput > low latency
Storage	Petabytes
Availability	99.9% (commodity hardware fails regularly)

4. Key Design Assumptions (GFS Paper)

text

GFS was designed with specific workload assumptions:
  1. Files are LARGE (multi-GB). Small files exist but aren't optimized for.
  2. Reads are mostly SEQUENTIAL (streaming). Random reads are rare.
  3. Writes are mostly APPENDS (log files, analytics output). Random writes are rare.
  4. Hardware FAILS regularly (commodity servers, cheap disks).
  5. High sustained THROUGHPUT is more important than low LATENCY.

These assumptions drive every design decision.
Contrast with: traditional file systems (small files, random I/O, low latency).

5-8. Architecture

GFS/HDFS Architecture

Core Concepts

text

Chunk (Block):
  - Files split into fixed-size chunks (64 MB in GFS, 128 MB in HDFS)
  - Each chunk stored as a regular Linux file on a chunk server
  - Each chunk replicated to 3 chunk servers (default)
  - Chunk identified by globally unique 64-bit chunk handle

Why 64-128 MB chunks?
  - Reduces metadata on master (1 PB / 64 MB = 16M chunks, manageable in RAM)
  - Reduces client-master interactions (fewer chunks to look up)
  - Enables large sequential I/O (read an entire chunk without seeking)
  - Trade-off: small files waste space (a 1 KB file still occupies a chunk entry)

Master (NameNode):
  - Single node (simplified, but single point of failure)
  - Stores ALL metadata in memory:
    → Namespace (directory tree): ~64 bytes per file/directory
    → File → chunk mapping: ~64 bytes per chunk
    → Chunk → location mapping: from chunk server heartbeats (not persisted)
  - 1 PB of data → ~16M chunks → ~1 GB of metadata → fits in RAM easily

Read Flow

Write Flow (Append — GFS Pipeline)

9. Deep Dive: Master Fault Tolerance

text

The master is a single point of failure. How to protect it?

1. Operation Log (WAL):
   - Every metadata mutation logged to disk BEFORE responding
   - Log replicated to remote backup machines
   - On master crash → new master replays operation log

2. Checkpoints:
   - Periodically snapshot the in-memory metadata to disk
   - Compact B-tree format for fast loading
   - Recovery: load latest checkpoint + replay log entries after checkpoint

3. Shadow Masters (Stanby NameNode):
   - Read replicas of the master
   - Serve read-only metadata requests (slightly stale)
   - Can be promoted to primary on master failure

4. HDFS HA (modern approach):
   - Active/Standby NameNode pair
   - Shared edit log via JournalNodes (Quorum-based)
   - ZooKeeper for automatic failover
   - Failover time: 30-60 seconds

Chunk location data is NOT persisted by master:
   - Chunk servers report their chunks via heartbeats on startup
   - Master rebuilds chunk→location map from heartbeats
   - This simplifies master recovery (less state to persist)

Handling Chunk Server Failures

text

Detection:
  - Chunk server sends heartbeat to master every 3 seconds
  - No heartbeat for 30 seconds → master marks server as dead

Recovery:
  - Master identifies under-replicated chunks (< 3 replicas)
  - Prioritizes: chunks with 1 replica > chunks with 2 replicas
  - Instructs healthy chunk servers to copy chunks
  - Respects rack-awareness: replicas spread across racks

Rack-aware placement (default):
  Replica 1: same rack as writer (low latency)
  Replica 2: different rack (survives rack failure)
  Replica 3: same rack as replica 2 but different node

This survives: 1 node failure, 1 rack failure, 1 disk failure.

10-14. Trade-offs, Interview

Key trade-offs:

Decision	Trade-off
Large chunks (64 MB)	Pros: less metadata, fewer master calls, good throughput. Cons: wasted space for small files, hot spots if small file is popular
Single master	Pros: simple consistency. Cons: SPOF, metadata scalability limit. Mitigated: shadow masters, HA failover
Append-only writes	Pros: simpler concurrency (no random write conflicts). Cons: can't update in place (must rewrite entire file)
Chain replication (pipeline)	Pros: maximizes bandwidth. Cons: higher latency (data travels through chain)

Interview tips:

Architecture: single master (metadata in RAM) + many chunk servers (data on disk)
Large chunk size (64-128 MB) — explain WHY
Write pipeline (client → primary → secondary chain) — maximizes bandwidth
Lease mechanism for primary election per chunk
Master fault tolerance (operation log + checkpoints + HA failover)
Rack-aware replication for durability

Practice Mode

5 Questions

"Why are chunks 64 MB instead of 4 KB like a regular filesystem?" → Optimized for large sequential reads/writes. Reduces master metadata (fits in RAM). Fewer client-master round trips. Trade-off: small files waste space.
"How does the write pipeline work?" → Client pushes data to primary → primary forwards to secondaries (chain). Then client sends write request to primary → primary assigns serial number → forwards to secondaries. Chain maximizes bandwidth.
"What happens when a chunk server dies?" → Master detects via missed heartbeats (30s). Identifies under-replicated chunks. Instructs healthy servers to re-replicate. Rack-aware placement ensures fault tolerance.
"How is the master protected from being a SPOF?" → Operation log replicated to backups. Periodic checkpoints. Shadow/standby masters. HDFS HA: active/standby with shared journal and ZooKeeper failover.
"Why doesn't the master persist chunk locations?" → Chunk servers report their chunks via heartbeats. Master rebuilds the mapping dynamically. Simplifies recovery — no stale location data.

Previous← Distributed Kv Store Part1 NextGlobal Cdn Part1 →