Data-Intensive Applications Learning Repository
A hands-on .NET 10 learning repository for understanding the core principles behind data-intensive applications. Inspired by Martin Kleppmann's Designing Data-Intensive Applications , this project provides runnable simulations, comparison benchmarks, reference documentation covering replication, partitioning, database selection, file formats, and system design tradeoffs.
What You Will Learn
Topic Description Replication Leader-follower, multi-leader, and leaderless replication strategies; conflict resolution; consistency guarantees; failover Partitioning & Sharding Hash, range, consistent hashing, list, time-based, composite partitioning; rebalancing; hot spot mitigation Database Selection SQL vs Document vs Graph vs Vector vs Time-Series vs Key-Value vs Wide-Column; decision matrices; real-world scenarios File Formats CSV, JSON, XML, Avro, Parquet, ORC, Protobuf; row vs columnar storage; schema evolution System Design Tradeoffs Consistency vs availability, normalization vs denormalization, latency vs throughput; comprehensive decision guides
Repository Map
Repository Structure
text Copy
DistributedSystemWithBigData/
├── docs/
│ ├── replication/
│ │ ├── replication-overview.md # All replication strategies compared
│ │ ├── leader-follower.md # Single-leader deep dive
│ │ ├── multi-leader.md # Multi-leader + conflict resolution
│ │ ├── leaderless.md # Dynamo-style quorum replication
│ │ ├── consistency-and-lag.md # Replication lag + consistency models
│ │ └── failure-scenarios.md # Failover, split-brain, data loss
│ ├── partitioning/
│ │ ├── partitioning-overview.md # Partitioning vs sharding taxonomy
│ │ ├── hash-partitioning.md # Hash-based key distribution
│ │ ├── range-partitioning.md # Range-based for time-series + scans
│ │ ├── consistent-hashing.md # Hash ring + vnodes
│ │ ├── tenant-sharding.md # Multi-tenant isolation strategies
│ │ ├── hot-partitions-and-rebalancing.md # Skew detection + rebalancing
│ │ └── partition-key-selection.md # How to choose partition keys
│ ├── database-selection/
│ │ ├── sql-vs-nosql.md # Relational vs non-relational
│ │ ├── document-db-guide.md # MongoDB, Couchbase deep dive
│ │ ├── when-to-use-graph-db.md # Neo4j, Neptune use cases
│ │ ├── when-to-use-vector-db.md # Pinecone, pgvector, RAG
│ │ ├── when-to-use-timeseries-db.md # TimescaleDB, InfluxDB
│ │ ├── when-to-use-key-value-db.md # Redis, DynamoDB patterns
│ │ ├── when-to-use-wide-column-db.md # Cassandra, ScyllaDB
│ │ └── decision-matrix.md # Comprehensive comparison + flowcharts
│ ├── file-formats/
│ │ ├── csv.md # CSV deep dive
│ │ ├── json.md # JSON + JSON Lines
│ │ ├── xml.md # XML + XSD + SOAP
│ │ ├── avro.md # Avro + Schema Registry
│ │ ├── parquet.md # Columnar analytics format
│ │ ├── orc.md # Hive-native columnar format
│ │ ├── format-comparison-matrix.md # All formats compared
│ │ └── schema-evolution.md # Schema evolution patterns
│ ├── system-design-scenarios/
│ │ ├── ecommerce-platform.md # Orders, catalog, search, cache
│ │ ├── iot-platform.md # Sensor ingestion + time-series
│ │ ├── fraud-detection.md # Graph-based fraud ring detection
│ │ ├── observability-pipeline.md # Metrics, logs, traces
│ │ └── rag-search-platform.md # Vector DB + RAG architecture
│ ├── interview-revision-cheatsheet.md # Master cheat sheet for all topics
│ └── architecture-decision-flow.md # Decision flowcharts (Mermaid)
├── src/
│ ├── DataIntensiveLearning.Core/ # Shared domain models + abstractions
│ │ ├── Enums/ # DatabaseType, ReplicationType, etc.
│ │ ├── Interfaces/ # IPartitioner, IReplicationSimulator
│ │ └── Models/ # DataRecord, ReplicaNode, Partition
│ ├── DataIntensiveLearning.Simulations/ # Runnable console simulations
│ │ ├── Program.cs # Entry point (run with --scenario)
│ │ └── Replication/
│ │ ├── LeaderFollowerSimulator.cs # Sync/async replication + failover
│ │ ├── MultiLeaderSimulator.cs # Conflict detection + resolution
│ │ └── LeaderlessSimulator.cs # Quorum reads/writes + read repair
│ ├── DataIntensiveLearning.Partitioning/ # Partitioning strategy implementations
│ │ ├── HashPartitioner.cs # Hash-mod with distribution analysis
│ │ ├── RangePartitioner.cs # Boundary-based range partitioning
│ │ ├── ConsistentHashPartitioner.cs # Hash ring with virtual nodes
│ │ ├── ListPartitioner.cs # Explicit key-to-partition mapping
│ │ ├── TimeBasedPartitioner.cs # Time-bucketed partitioning
│ │ ├── CompositePartitioner.cs # Hash + time two-level partitioning
│ │ └── SkewAnalyzer.cs # Hot partition detection
│ ├── DataIntensiveLearning.FileFormats/ # File format examples
│ │ ├── CsvFormatExample.cs # CSV read/write
│ │ ├── JsonFormatExample.cs # JSON serialization
│ │ ├── XmlFormatExample.cs # XML serialization + namespaces
│ │ ├── AvroFormatExample.cs # Avro schema + evolution concepts
│ │ ├── ParquetFormatExample.cs # Columnar storage concepts
│ │ ├── OrcFormatExample.cs # ORC vs Parquet comparison
│ │ ├── ProtobufFormatExample.cs # Binary encoding + schema evolution
│ │ └── Formats/ # Reusable format implementations
│ ├── DataIntensiveLearning.DatabaseSelection/
│ │ ├── DatabaseSelectionAdvisor.cs # Rule-based recommendation engine
│ │ └── Models/ # WorkloadProfile, DatabaseRecommendation
│ └── DataIntensiveLearning.Api/ # ASP.NET Core API scaffold
├── tests/
│ └── DataIntensiveLearning.UnitTests/
│ ├── Partitioning/ # Hash, range, consistent hash, list, time, composite tests
│ ├── Replication/ # Quorum, leaderless simulator tests
│ ├── FileFormats/ # CSV, JSON serialization tests
│ └── DatabaseSelection/ # Advisor recommendation tests
├── docker/
│ └── Dockerfile.api # .NET 10 API container
├── docker-compose.yml # PostgreSQL, Redis, MongoDB, Neo4j, TimescaleDB
├── DataIntensiveLearning.sln # .NET solution file
└── README.md # This file
Study Paths
Path 1: Beginner (Foundations First)
Replication — How data is copied across nodes and why consistency is hard.
Partitioning — How to split data across machines for horizontal scale.
Database Selection — Choosing the right database for each workload.
File Formats — How serialization and storage format choices affect performance.
System Design — Apply everything in realistic design exercises.
Path 2: Interview Prep (Fast Track)
Interview Cheat Sheet — One-page summaries and common questions.
Architecture Decision Flow — Decision flowcharts for databases, formats, partitioning.
Database Decision Matrix — "Choose X when Y" tables.
System Design Scenarios — Practice articulating design decisions.
Path 3: Deep Dive by Topic
Topic-by-Topic Navigation
Replication
Document What You'll Learn replication-overview.md All three topologies compared, sync vs async, when to use each leader-follower.md Single-leader write flow, read replicas, failover process multi-leader.md Multi-datacenter replication, conflict resolution (LWW, merge, CRDTs) leaderless.md Quorum reads/writes (W+R>N), read repair, anti-entropy, sloppy quorums consistency-and-lag.md Read-after-write, monotonic reads, bounded staleness, causal consistency failure-scenarios.md Leader failure, split-brain, network partitions, data loss scenarios
Partitioning & Sharding
Document What You'll Learn partitioning-overview.md Partitioning vs sharding, horizontal vs vertical, OLTP vs OLAP hash-partitioning.md Hash functions, distribution analysis, the rehashing problem range-partitioning.md Boundary selection, time-series advantages, hot spot risks consistent-hashing.md Hash ring, virtual nodes, minimal redistribution on scaling tenant-sharding.md Multi-tenant isolation levels, geo-sharding, GDPR compliance hot-partitions-and-rebalancing.md Skew detection, key salting, rebalancing strategies partition-key-selection.md How to choose keys, cardinality, composite keys, common mistakes
Database Selection
Document What You'll Learn sql-vs-nosql.md Relational vs non-relational: when to choose each document-db-guide.md MongoDB, Couchbase: flexible schema, embedding vs referencing when-to-use-graph-db.md Neo4j, Neptune: fraud detection, recommendations, knowledge graphs when-to-use-vector-db.md Pinecone, pgvector: semantic search, RAG, embeddings when-to-use-timeseries-db.md TimescaleDB, InfluxDB: IoT, metrics, downsampling when-to-use-key-value-db.md Redis, DynamoDB: caching, sessions, rate limiting when-to-use-wide-column-db.md Cassandra, ScyllaDB: massive write throughput, global distribution decision-matrix.md Decision trees, comparison tables, "choose X when Y" guides
Document What You'll Learn csv.md Universal exchange, limitations, when to use json.md APIs, config, JSONL streaming, JSON Schema xml.md Enterprise integration, namespaces, SOAP avro.md Kafka serialization, schema evolution, Schema Registry parquet.md Columnar analytics, predicate pushdown, compression orc.md Hive ecosystem, stripe indexing, ACID support format-comparison-matrix.md All formats compared, decision tree, pipeline diagrams schema-evolution.md Backward/forward/full compatibility, migration patterns
System Design Scenarios
Getting Started
Prerequisites
Running the Simulations
bash Copy
# Run ALL simulations (replication, partitioning, file formats, db selection)
dotnet run --project src/DataIntensiveLearning.Simulations
# Run a specific simulation category
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario replication
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario partitioning
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario file-formats
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario db-selection
Running Tests
bash Copy
# Run all tests
dotnet test
# Run tests with detailed output
dotnet test --verbosity normal
# Run a specific test category
dotnet test --filter "FullyQualifiedName~Partitioning"
dotnet test --filter "FullyQualifiedName~Replication"
dotnet test --filter "FullyQualifiedName~DatabaseSelection"
Docker Compose Setup
The docker-compose.yml provides a full local environment with multiple database engines for hands-on comparison.
bash Copy
# Start all services
docker-compose up -d
# Verify all containers are healthy
docker-compose ps
# Stop all services
docker-compose down
# Stop and remove all data volumes
docker-compose down -v
Default connection strings (see docker-compose.yml for credentials):
Service Connection Purpose PostgreSQL Host=localhost;Port=5432;Database=datalearning;Username=admin;Password=adminSQL/relational patterns Redis localhost:6379Caching, sessions, key-value patterns MongoDB mongodb://admin:admin@localhost:27017Document DB patterns Neo4j bolt://localhost:7687 (neo4j/adminpassword)Graph DB patterns TimescaleDB Host=localhost;Port=5433;Database=timeseries;Username=admin;Password=adminTime-series patterns
Key Tradeoffs at a Glance
Tradeoff Option A Option B When to Choose A When to Choose B Consistency vs. Availability Strong consistency Eventual consistency Financial transactions, inventory Social feeds, analytics, caching Normalization vs. Denormalization Normalized (3NF) Denormalized Write-heavy, data integrity critical Read-heavy, query performance critical Row vs. Columnar Storage Row-oriented Column-oriented OLTP, point lookups, frequent writes OLAP, aggregations, scan-heavy queries SQL vs. NoSQL Relational DB Document/KV/Graph DB Complex joins, ACID needed Flexible schema, horizontal scale Replication: Sync vs. Async Synchronous Asynchronous Durability guarantees required Low latency, high throughput Partitioning: Hash vs. Range Hash partitioning Range partitioning Even distribution, no range queries Range scans, time-series data Binary vs. Text Formats Avro/Protobuf/Parquet JSON/CSV/XML Performance, bandwidth, large data Human readability, debugging, interchange Dedicated vs. Embedded DB Purpose-built (Neo4j, InfluxDB) Extension (pgvector, TimescaleDB) Max performance, specialized features Reduce operational complexity
What's in the Code
Replication Simulations
Simulator What It Demonstrates LeaderFollowerSimulatorSync/async replication, replication lag, stale reads, follower failure/recovery, leader failover MultiLeaderSimulatorCross-datacenter replication, write conflicts, LWW resolution, custom merge functions LeaderlessSimulatorQuorum writes (W nodes), quorum reads (R nodes), read repair, anti-entropy, sloppy quorums
Partitioning Implementations
Partitioner What It Demonstrates HashPartitionerHash-mod routing, distribution analysis, rehashing problem, hash function comparison RangePartitionerBoundary-based routing, range query support, partition pruning ConsistentHashPartitionerHash ring, virtual nodes, minimal redistribution on node add/remove ListPartitioner<T>Explicit mapping, geo-routing, tenant isolation, default partition TimeBasedPartitionerTime-bucketed partitioning, retention policies, partition plans CompositePartitionerTwo-level (hash + time) like Cassandra partition key + clustering column SkewAnalyzerCV calculation, hot partition detection, distribution health check
Database Selection Advisor
The DatabaseSelectionAdvisor evaluates workload profiles (read/write ratio, query complexity, consistency needs, scale, latency, data model) and recommends the best database type with reasoning, alternatives, and warnings. Run the predefined scenarios to see recommendations for e-commerce, IoT, fraud detection, RAG, caching, and more.
Example What It Demonstrates CSV/JSON/XML Serialization, deserialization, format comparison Avro Schema evolution, compatibility modes, Schema Registry concepts Parquet Row vs columnar storage, predicate pushdown, column pruning, compression ORC Stripe indexing, ACID support, ORC vs Parquet comparison Protobuf Binary encoding (varint, tags), schema evolution, size comparison
License
This repository is for educational purposes. See LICENSE for details.