Problem #32: Design a Distributed File System (GFS/HDFS) — Full Deep Dive
1. Problem Statement
Design a distributed file system that stores petabytes of data across thousands of commodity machines, optimized for large sequential reads/writes (not random access), with automatic replication and fault tolerance.
Think: Google File System (GFS), Hadoop Distributed File System (HDFS), Amazon EFS.
2-3. Requirements
Core
| ID | Requirement |
|---|---|
| FR-1 | Store large files (GBs to TBs) across a cluster |
| FR-2 | Read file (sequential reads optimized) |
| FR-3 | Write/append to file (append-heavy, not random writes) |
| FR-4 | Automatic replication (3 copies default) |
| FR-5 | Fault tolerance (survive node failures transparently) |
| FR-6 | Namespace management (directories, file metadata) |
| FR-7 | Support concurrent readers (multiple clients reading same file) |
Non-Functional
| Requirement | Target |
|---|---|
| Throughput | High aggregate bandwidth (100+ GB/sec across cluster) |
| File size | Optimized for 100 MB+ files (not small files) |
| Latency | Sequential read: high throughput > low latency |
| Storage | Petabytes |
| Availability | 99.9% (commodity hardware fails regularly) |
4. Key Design Assumptions (GFS Paper)
text
GFS was designed with specific workload assumptions:
1. Files are LARGE (multi-GB). Small files exist but aren't optimized for.
2. Reads are mostly SEQUENTIAL (streaming). Random reads are rare.
3. Writes are mostly APPENDS (log files, analytics output). Random writes are rare.
4. Hardware FAILS regularly (commodity servers, cheap disks).
5. High sustained THROUGHPUT is more important than low LATENCY.
These assumptions drive every design decision.
Contrast with: traditional file systems (small files, random I/O, low latency).5-8. Architecture
GFS/HDFS Architecture
Core Concepts
text
Chunk (Block):
- Files split into fixed-size chunks (64 MB in GFS, 128 MB in HDFS)
- Each chunk stored as a regular Linux file on a chunk server
- Each chunk replicated to 3 chunk servers (default)
- Chunk identified by globally unique 64-bit chunk handle
Why 64-128 MB chunks?
- Reduces metadata on master (1 PB / 64 MB = 16M chunks, manageable in RAM)
- Reduces client-master interactions (fewer chunks to look up)
- Enables large sequential I/O (read an entire chunk without seeking)
- Trade-off: small files waste space (a 1 KB file still occupies a chunk entry)
Master (NameNode):
- Single node (simplified, but single point of failure)
- Stores ALL metadata in memory:
→ Namespace (directory tree): ~64 bytes per file/directory
→ File → chunk mapping: ~64 bytes per chunk
→ Chunk → location mapping: from chunk server heartbeats (not persisted)
- 1 PB of data → ~16M chunks → ~1 GB of metadata → fits in RAM easilyRead Flow
Write Flow (Append — GFS Pipeline)
9. Deep Dive: Master Fault Tolerance
text
The master is a single point of failure. How to protect it?
1. Operation Log (WAL):
- Every metadata mutation logged to disk BEFORE responding
- Log replicated to remote backup machines
- On master crash → new master replays operation log
2. Checkpoints:
- Periodically snapshot the in-memory metadata to disk
- Compact B-tree format for fast loading
- Recovery: load latest checkpoint + replay log entries after checkpoint
3. Shadow Masters (Stanby NameNode):
- Read replicas of the master
- Serve read-only metadata requests (slightly stale)
- Can be promoted to primary on master failure
4. HDFS HA (modern approach):
- Active/Standby NameNode pair
- Shared edit log via JournalNodes (Quorum-based)
- ZooKeeper for automatic failover
- Failover time: 30-60 seconds
Chunk location data is NOT persisted by master:
- Chunk servers report their chunks via heartbeats on startup
- Master rebuilds chunk→location map from heartbeats
- This simplifies master recovery (less state to persist)Handling Chunk Server Failures
text
Detection:
- Chunk server sends heartbeat to master every 3 seconds
- No heartbeat for 30 seconds → master marks server as dead
Recovery:
- Master identifies under-replicated chunks (< 3 replicas)
- Prioritizes: chunks with 1 replica > chunks with 2 replicas
- Instructs healthy chunk servers to copy chunks
- Respects rack-awareness: replicas spread across racks
Rack-aware placement (default):
Replica 1: same rack as writer (low latency)
Replica 2: different rack (survives rack failure)
Replica 3: same rack as replica 2 but different node
This survives: 1 node failure, 1 rack failure, 1 disk failure.10-14. Trade-offs, Interview
Key trade-offs:
| Decision | Trade-off |
|---|---|
| Large chunks (64 MB) | Pros: less metadata, fewer master calls, good throughput. Cons: wasted space for small files, hot spots if small file is popular |
| Single master | Pros: simple consistency. Cons: SPOF, metadata scalability limit. Mitigated: shadow masters, HA failover |
| Append-only writes | Pros: simpler concurrency (no random write conflicts). Cons: can't update in place (must rewrite entire file) |
| Chain replication (pipeline) | Pros: maximizes bandwidth. Cons: higher latency (data travels through chain) |
Interview tips:
- Architecture: single master (metadata in RAM) + many chunk servers (data on disk)
- Large chunk size (64-128 MB) — explain WHY
- Write pipeline (client → primary → secondary chain) — maximizes bandwidth
- Lease mechanism for primary election per chunk
- Master fault tolerance (operation log + checkpoints + HA failover)
- Rack-aware replication for durability
Practice Mode
5 Questions
- "Why are chunks 64 MB instead of 4 KB like a regular filesystem?" → Optimized for large sequential reads/writes. Reduces master metadata (fits in RAM). Fewer client-master round trips. Trade-off: small files waste space.
- "How does the write pipeline work?" → Client pushes data to primary → primary forwards to secondaries (chain). Then client sends write request to primary → primary assigns serial number → forwards to secondaries. Chain maximizes bandwidth.
- "What happens when a chunk server dies?" → Master detects via missed heartbeats (30s). Identifies under-replicated chunks. Instructs healthy servers to re-replicate. Rack-aware placement ensures fault tolerance.
- "How is the master protected from being a SPOF?" → Operation log replicated to backups. Periodic checkpoints. Shadow/standby masters. HDFS HA: active/standby with shared journal and ZooKeeper failover.
- "Why doesn't the master persist chunk locations?" → Chunk servers report their chunks via heartbeats. Master rebuilds the mapping dynamically. Simplifies recovery — no stale location data.