blogs/DistributedSystemWithBigData
View on GitHub
C#

Data-Intensive Applications Learning Repository

A hands-on .NET 10 learning repository for understanding the core principles behind data-intensive applications. Inspired by Martin Kleppmann's Designing Data-Intensive Applications, this project provides runnable simulations, comparison benchmarks, reference documentation covering replication, partitioning, database selection, file formats, and system design tradeoffs.


What You Will Learn

TopicDescription
ReplicationLeader-follower, multi-leader, and leaderless replication strategies; conflict resolution; consistency guarantees; failover
Partitioning & ShardingHash, range, consistent hashing, list, time-based, composite partitioning; rebalancing; hot spot mitigation
Database SelectionSQL vs Document vs Graph vs Vector vs Time-Series vs Key-Value vs Wide-Column; decision matrices; real-world scenarios
File FormatsCSV, JSON, XML, Avro, Parquet, ORC, Protobuf; row vs columnar storage; schema evolution
System Design TradeoffsConsistency vs availability, normalization vs denormalization, latency vs throughput; comprehensive decision guides

Repository Map


Repository Structure

text
DistributedSystemWithBigData/
├── docs/
│   ├── replication/
│   │   ├── replication-overview.md         # All replication strategies compared
│   │   ├── leader-follower.md              # Single-leader deep dive
│   │   ├── multi-leader.md                 # Multi-leader + conflict resolution
│   │   ├── leaderless.md                   # Dynamo-style quorum replication
│   │   ├── consistency-and-lag.md          # Replication lag + consistency models
│   │   └── failure-scenarios.md            # Failover, split-brain, data loss
│   ├── partitioning/
│   │   ├── partitioning-overview.md        # Partitioning vs sharding taxonomy
│   │   ├── hash-partitioning.md            # Hash-based key distribution
│   │   ├── range-partitioning.md           # Range-based for time-series + scans
│   │   ├── consistent-hashing.md           # Hash ring + vnodes
│   │   ├── tenant-sharding.md              # Multi-tenant isolation strategies
│   │   ├── hot-partitions-and-rebalancing.md # Skew detection + rebalancing
│   │   └── partition-key-selection.md       # How to choose partition keys
│   ├── database-selection/
│   │   ├── sql-vs-nosql.md                 # Relational vs non-relational
│   │   ├── document-db-guide.md            # MongoDB, Couchbase deep dive
│   │   ├── when-to-use-graph-db.md         # Neo4j, Neptune use cases
│   │   ├── when-to-use-vector-db.md        # Pinecone, pgvector, RAG
│   │   ├── when-to-use-timeseries-db.md    # TimescaleDB, InfluxDB
│   │   ├── when-to-use-key-value-db.md     # Redis, DynamoDB patterns
│   │   ├── when-to-use-wide-column-db.md   # Cassandra, ScyllaDB
│   │   └── decision-matrix.md              # Comprehensive comparison + flowcharts
│   ├── file-formats/
│   │   ├── csv.md                          # CSV deep dive
│   │   ├── json.md                         # JSON + JSON Lines
│   │   ├── xml.md                          # XML + XSD + SOAP
│   │   ├── avro.md                         # Avro + Schema Registry
│   │   ├── parquet.md                      # Columnar analytics format
│   │   ├── orc.md                          # Hive-native columnar format
│   │   ├── format-comparison-matrix.md     # All formats compared
│   │   └── schema-evolution.md             # Schema evolution patterns
│   ├── system-design-scenarios/
│   │   ├── ecommerce-platform.md           # Orders, catalog, search, cache
│   │   ├── iot-platform.md                 # Sensor ingestion + time-series
│   │   ├── fraud-detection.md              # Graph-based fraud ring detection
│   │   ├── observability-pipeline.md       # Metrics, logs, traces
│   │   └── rag-search-platform.md          # Vector DB + RAG architecture
│   ├── interview-revision-cheatsheet.md    # Master cheat sheet for all topics
│   └── architecture-decision-flow.md       # Decision flowcharts (Mermaid)
├── src/
│   ├── DataIntensiveLearning.Core/         # Shared domain models + abstractions
│   │   ├── Enums/                          # DatabaseType, ReplicationType, etc.
│   │   ├── Interfaces/                     # IPartitioner, IReplicationSimulator
│   │   └── Models/                         # DataRecord, ReplicaNode, Partition
│   ├── DataIntensiveLearning.Simulations/  # Runnable console simulations
│   │   ├── Program.cs                      # Entry point (run with --scenario)
│   │   └── Replication/
│   │       ├── LeaderFollowerSimulator.cs   # Sync/async replication + failover
│   │       ├── MultiLeaderSimulator.cs      # Conflict detection + resolution
│   │       └── LeaderlessSimulator.cs       # Quorum reads/writes + read repair
│   ├── DataIntensiveLearning.Partitioning/ # Partitioning strategy implementations
│   │   ├── HashPartitioner.cs              # Hash-mod with distribution analysis
│   │   ├── RangePartitioner.cs             # Boundary-based range partitioning
│   │   ├── ConsistentHashPartitioner.cs    # Hash ring with virtual nodes
│   │   ├── ListPartitioner.cs              # Explicit key-to-partition mapping
│   │   ├── TimeBasedPartitioner.cs         # Time-bucketed partitioning
│   │   ├── CompositePartitioner.cs         # Hash + time two-level partitioning
│   │   └── SkewAnalyzer.cs                 # Hot partition detection
│   ├── DataIntensiveLearning.FileFormats/  # File format examples
│   │   ├── CsvFormatExample.cs             # CSV read/write
│   │   ├── JsonFormatExample.cs            # JSON serialization
│   │   ├── XmlFormatExample.cs             # XML serialization + namespaces
│   │   ├── AvroFormatExample.cs            # Avro schema + evolution concepts
│   │   ├── ParquetFormatExample.cs         # Columnar storage concepts
│   │   ├── OrcFormatExample.cs             # ORC vs Parquet comparison
│   │   ├── ProtobufFormatExample.cs        # Binary encoding + schema evolution
│   │   └── Formats/                        # Reusable format implementations
│   ├── DataIntensiveLearning.DatabaseSelection/
│   │   ├── DatabaseSelectionAdvisor.cs     # Rule-based recommendation engine
│   │   └── Models/                         # WorkloadProfile, DatabaseRecommendation
│   └── DataIntensiveLearning.Api/          # ASP.NET Core API scaffold
├── tests/
│   └── DataIntensiveLearning.UnitTests/
│       ├── Partitioning/                   # Hash, range, consistent hash, list, time, composite tests
│       ├── Replication/                    # Quorum, leaderless simulator tests
│       ├── FileFormats/                    # CSV, JSON serialization tests
│       └── DatabaseSelection/              # Advisor recommendation tests
├── docker/
│   └── Dockerfile.api                      # .NET 10 API container
├── docker-compose.yml                      # PostgreSQL, Redis, MongoDB, Neo4j, TimescaleDB
├── DataIntensiveLearning.sln               # .NET solution file
└── README.md                               # This file

Study Paths

Path 1: Beginner (Foundations First)

  1. Replication — How data is copied across nodes and why consistency is hard.
  2. Partitioning — How to split data across machines for horizontal scale.
  3. Database Selection — Choosing the right database for each workload.
  4. File Formats — How serialization and storage format choices affect performance.
  5. System Design — Apply everything in realistic design exercises.

Path 2: Interview Prep (Fast Track)

  1. Interview Cheat Sheet — One-page summaries and common questions.
  2. Architecture Decision Flow — Decision flowcharts for databases, formats, partitioning.
  3. Database Decision Matrix — "Choose X when Y" tables.
  4. System Design Scenarios — Practice articulating design decisions.

Path 3: Deep Dive by Topic

Start HereThen ReadThen Code
Replication OverviewLeader-FollowerMulti-LeaderLeaderlessRun dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario replication
Partitioning OverviewHashRangeConsistent HashingRun dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario partitioning
SQL vs NoSQLIndividual DB guides → Decision MatrixRun dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario db-selection
Format ComparisonIndividual format docs → Schema EvolutionRun dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario file-formats

Topic-by-Topic Navigation

Replication

DocumentWhat You'll Learn
replication-overview.mdAll three topologies compared, sync vs async, when to use each
leader-follower.mdSingle-leader write flow, read replicas, failover process
multi-leader.mdMulti-datacenter replication, conflict resolution (LWW, merge, CRDTs)
leaderless.mdQuorum reads/writes (W+R>N), read repair, anti-entropy, sloppy quorums
consistency-and-lag.mdRead-after-write, monotonic reads, bounded staleness, causal consistency
failure-scenarios.mdLeader failure, split-brain, network partitions, data loss scenarios

Partitioning & Sharding

DocumentWhat You'll Learn
partitioning-overview.mdPartitioning vs sharding, horizontal vs vertical, OLTP vs OLAP
hash-partitioning.mdHash functions, distribution analysis, the rehashing problem
range-partitioning.mdBoundary selection, time-series advantages, hot spot risks
consistent-hashing.mdHash ring, virtual nodes, minimal redistribution on scaling
tenant-sharding.mdMulti-tenant isolation levels, geo-sharding, GDPR compliance
hot-partitions-and-rebalancing.mdSkew detection, key salting, rebalancing strategies
partition-key-selection.mdHow to choose keys, cardinality, composite keys, common mistakes

Database Selection

DocumentWhat You'll Learn
sql-vs-nosql.mdRelational vs non-relational: when to choose each
document-db-guide.mdMongoDB, Couchbase: flexible schema, embedding vs referencing
when-to-use-graph-db.mdNeo4j, Neptune: fraud detection, recommendations, knowledge graphs
when-to-use-vector-db.mdPinecone, pgvector: semantic search, RAG, embeddings
when-to-use-timeseries-db.mdTimescaleDB, InfluxDB: IoT, metrics, downsampling
when-to-use-key-value-db.mdRedis, DynamoDB: caching, sessions, rate limiting
when-to-use-wide-column-db.mdCassandra, ScyllaDB: massive write throughput, global distribution
decision-matrix.mdDecision trees, comparison tables, "choose X when Y" guides

File Formats

DocumentWhat You'll Learn
csv.mdUniversal exchange, limitations, when to use
json.mdAPIs, config, JSONL streaming, JSON Schema
xml.mdEnterprise integration, namespaces, SOAP
avro.mdKafka serialization, schema evolution, Schema Registry
parquet.mdColumnar analytics, predicate pushdown, compression
orc.mdHive ecosystem, stripe indexing, ACID support
format-comparison-matrix.mdAll formats compared, decision tree, pipeline diagrams
schema-evolution.mdBackward/forward/full compatibility, migration patterns

System Design Scenarios

DocumentWhat You'll Learn
ecommerce-platform.mdOrders (SQL) + catalog (MongoDB) + cache (Redis) + analytics
iot-platform.mdSensor ingestion → Kafka → TimescaleDB, time partitioning
fraud-detection.mdGraph traversal for fraud rings, Neo4j + real-time streaming
observability-pipeline.mdMetrics + logs + traces, Prometheus, Parquet archival
rag-search-platform.mdEmbeddings + vector DB + hybrid search, pgvector vs Pinecone

Getting Started

Prerequisites

Running the Simulations

bash
# Run ALL simulations (replication, partitioning, file formats, db selection)
dotnet run --project src/DataIntensiveLearning.Simulations

# Run a specific simulation category
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario replication
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario partitioning
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario file-formats
dotnet run --project src/DataIntensiveLearning.Simulations -- --scenario db-selection

Running Tests

bash
# Run all tests
dotnet test

# Run tests with detailed output
dotnet test --verbosity normal

# Run a specific test category
dotnet test --filter "FullyQualifiedName~Partitioning"
dotnet test --filter "FullyQualifiedName~Replication"
dotnet test --filter "FullyQualifiedName~DatabaseSelection"

Docker Compose Setup

The docker-compose.yml provides a full local environment with multiple database engines for hands-on comparison.

bash
# Start all services
docker-compose up -d

# Verify all containers are healthy
docker-compose ps

# Stop all services
docker-compose down

# Stop and remove all data volumes
docker-compose down -v

Default connection strings (see docker-compose.yml for credentials):

ServiceConnectionPurpose
PostgreSQLHost=localhost;Port=5432;Database=datalearning;Username=admin;Password=adminSQL/relational patterns
Redislocalhost:6379Caching, sessions, key-value patterns
MongoDBmongodb://admin:admin@localhost:27017Document DB patterns
Neo4jbolt://localhost:7687 (neo4j/adminpassword)Graph DB patterns
TimescaleDBHost=localhost;Port=5433;Database=timeseries;Username=admin;Password=adminTime-series patterns

Key Tradeoffs at a Glance

TradeoffOption AOption BWhen to Choose AWhen to Choose B
Consistency vs. AvailabilityStrong consistencyEventual consistencyFinancial transactions, inventorySocial feeds, analytics, caching
Normalization vs. DenormalizationNormalized (3NF)DenormalizedWrite-heavy, data integrity criticalRead-heavy, query performance critical
Row vs. Columnar StorageRow-orientedColumn-orientedOLTP, point lookups, frequent writesOLAP, aggregations, scan-heavy queries
SQL vs. NoSQLRelational DBDocument/KV/Graph DBComplex joins, ACID neededFlexible schema, horizontal scale
Replication: Sync vs. AsyncSynchronousAsynchronousDurability guarantees requiredLow latency, high throughput
Partitioning: Hash vs. RangeHash partitioningRange partitioningEven distribution, no range queriesRange scans, time-series data
Binary vs. Text FormatsAvro/Protobuf/ParquetJSON/CSV/XMLPerformance, bandwidth, large dataHuman readability, debugging, interchange
Dedicated vs. Embedded DBPurpose-built (Neo4j, InfluxDB)Extension (pgvector, TimescaleDB)Max performance, specialized featuresReduce operational complexity

What's in the Code

Replication Simulations

SimulatorWhat It Demonstrates
LeaderFollowerSimulatorSync/async replication, replication lag, stale reads, follower failure/recovery, leader failover
MultiLeaderSimulatorCross-datacenter replication, write conflicts, LWW resolution, custom merge functions
LeaderlessSimulatorQuorum writes (W nodes), quorum reads (R nodes), read repair, anti-entropy, sloppy quorums

Partitioning Implementations

PartitionerWhat It Demonstrates
HashPartitionerHash-mod routing, distribution analysis, rehashing problem, hash function comparison
RangePartitionerBoundary-based routing, range query support, partition pruning
ConsistentHashPartitionerHash ring, virtual nodes, minimal redistribution on node add/remove
ListPartitioner<T>Explicit mapping, geo-routing, tenant isolation, default partition
TimeBasedPartitionerTime-bucketed partitioning, retention policies, partition plans
CompositePartitionerTwo-level (hash + time) like Cassandra partition key + clustering column
SkewAnalyzerCV calculation, hot partition detection, distribution health check

Database Selection Advisor

The DatabaseSelectionAdvisor evaluates workload profiles (read/write ratio, query complexity, consistency needs, scale, latency, data model) and recommends the best database type with reasoning, alternatives, and warnings. Run the predefined scenarios to see recommendations for e-commerce, IoT, fraud detection, RAG, caching, and more.

File Format Examples

ExampleWhat It Demonstrates
CSV/JSON/XMLSerialization, deserialization, format comparison
AvroSchema evolution, compatibility modes, Schema Registry concepts
ParquetRow vs columnar storage, predicate pushdown, column pruning, compression
ORCStripe indexing, ACID support, ORC vs Parquet comparison
ProtobufBinary encoding (varint, tags), schema evolution, size comparison

License

This repository is for educational purposes. See LICENSE for details.