docs/Distributed System With Big Data/decision-matrix

Database Selection Decision Matrix

1-Paragraph Summary

Choosing the right database is one of the most consequential architectural decisions in system design. This guide provides a comprehensive comparison of seven database categories -- SQL (relational), Document, Graph, Vector, Time-Series, Key-Value, and Wide-Column -- across critical dimensions including consistency guarantees, scalability characteristics, query complexity support, latency profiles, data model flexibility, operational burden, and cost. Use the decision flowchart and comparison tables below to quickly narrow candidates based on your workload requirements, then dive into the individual guides for deeper analysis.

The Seven Database Categories

Category	Representative Databases	Core Data Model	Primary Strength
SQL (Relational)	PostgreSQL, MySQL, CockroachDB, Aurora	Tables with rows and columns, enforced schema	ACID transactions, complex queries, data integrity
Document	MongoDB, CouchDB, Firestore, DynamoDB	JSON/BSON documents in collections	Schema flexibility, developer productivity
Graph	Neo4j, Neptune, TigerGraph, ArangoDB	Nodes and edges with properties	Relationship traversals, pattern matching
Vector	Pinecone, Milvus, Weaviate, pgvector	High-dimensional embedding vectors	Similarity search, AI/ML applications
Time-Series	InfluxDB, TimescaleDB, QuestDB, Prometheus	Timestamped data points in series	Time-ordered ingestion, downsampling, retention
Key-Value	Redis, DynamoDB, etcd, Memcached	Simple key-to-value mappings	Sub-millisecond reads, caching, sessions
Wide-Column	Cassandra, ScyllaDB, HBase, Bigtable	Column families with flexible columns per row	Massive write throughput, linear scalability

Comprehensive Comparison Table

Consistency and Availability

Dimension	SQL	Document	Graph	Vector	Time-Series	Key-Value	Wide-Column
Default consistency	Strong (ACID)	Eventual / Tunable	Strong (single-node)	Eventual	Strong (single-node)	Strong or Eventual	Tunable
Transaction support	Full ACID, multi-row	Single-doc atomic; limited multi-doc	Varies by engine	None typically	Limited	Single-key atomic	Lightweight (LWT)
CAP classification	CP (most)	AP or CP (varies)	CP (most)	AP	CP (most)	CP or AP (varies)	AP (Cassandra) or CP (HBase)
Isolation levels	Serializable, SI, RC, RU	Read concern levels	Varies	N/A	Varies	Single-op atomic	Per-query tunable

Scalability

Dimension	SQL	Document	Graph	Vector	Time-Series	Key-Value	Wide-Column
Horizontal scaling	Hard (sharding complex)	Native (auto-sharding)	Limited	Native	Moderate	Native	Native (linear)
Max practical scale	~10 TB (single), ~PB (NewSQL)	PB scale	~100s GB typical	TB scale	PB scale (with tiering)	TB scale (in-memory)	PB scale
Write scalability	Moderate	High	Moderate	High (batch)	Very high	Very high	Very high
Read scalability	High (with replicas)	High	High (local queries)	High	High	Very high	High
Geo-distribution	Complex (CockroachDB, Spanner)	Built-in (Atlas, Cosmos)	Limited	Managed options	Limited	Redis Enterprise	Built-in (Cassandra)

Query Capabilities

Dimension	SQL	Document	Graph	Vector	Time-Series	Key-Value	Wide-Column
Query language	SQL (standard)	MongoDB Query, SQL-like	Cypher, Gremlin, SPARQL	Vector APIs, filters	SQL-like (InfluxQL, Flux)	GET/SET/SCAN	CQL
Join support	Full (inner, outer, cross)	$lookup (limited)	Traversals (native)	None	Limited	None	None
Aggregation	Full (GROUP BY, HAVING, window)	Aggregation pipeline	Limited	None	Built-in (downsampling)	None	Limited
Ad-hoc queries	Excellent	Good	Good (for graph patterns)	Limited (similarity only)	Good (time-based)	Poor (key-only)	Poor (partition-key required)
Secondary indexes	Unlimited	Supported	Property indexes	Metadata filtering	Tag-based	Limited	Limited (use with caution)
Full-text search	Basic (or with extensions)	Atlas Search / built-in	Limited	Hybrid (some engines)	No	No	No

Latency and Performance

Dimension	SQL	Document	Graph	Vector	Time-Series	Key-Value	Wide-Column
Read latency (p50)	1-10 ms	1-10 ms	1-50 ms	10-100 ms	1-10 ms	< 1 ms	1-10 ms
Write latency (p50)	1-10 ms	1-10 ms	5-50 ms	10-100 ms	< 1 ms	< 1 ms	1-5 ms
Bulk ingestion	Moderate	High	Slow	High (batch)	Very high	High	Very high
Hot-path optimization	B-tree indexes	Document indexes	Graph traversal cache	HNSW/IVF indexes	Time-partitioned storage	In-memory hash	LSM-tree + Bloom filters

Operational Characteristics

Dimension	SQL	Document	Graph	Vector	Time-Series	Key-Value	Wide-Column
Schema management	Strict (migrations required)	Flexible (schema-on-read)	Flexible	Dimension-fixed	Semi-structured	Schemaless	Semi-structured (CQL)
Learning curve	Low (SQL is universal)	Low-moderate	Moderate-high	Moderate	Low-moderate	Very low	Moderate-high
Ops complexity	Low-moderate	Low (managed) to moderate	Moderate-high	Low (managed)	Low-moderate	Low	High
Backup/restore	Mature tooling	Built-in (managed)	Varies	Managed	Built-in	RDB/AOF snapshots	Snapshot + repair
Monitoring	Excellent ecosystem	Good	Limited	Basic	Good	Good	Good (but complex)

Cost Considerations

Dimension	SQL	Document	Graph	Vector	Time-Series	Key-Value	Wide-Column
Infrastructure cost	Moderate	Moderate	High (memory-intensive)	High (GPU/memory)	Low-moderate	High (RAM-bound)	Moderate (disk-based)
Managed service cost	$$	$$	$$$	$$$	$$	$ - $$	$$
Developer productivity	High (familiar SQL)	Very high (JSON-native)	Moderate	Moderate	High	Very high	Moderate
Vendor lock-in risk	Low (SQL standard)	Moderate	High	High	Moderate	Low	Low-moderate

Decision Flowchart

Use this flowchart to narrow your database selection based on primary workload characteristics:

Secondary Decision: Scale and Consistency Requirements

After identifying the database category from the primary flowchart, use this matrix to select the specific technology:

Use Case Quick Reference

By Industry / Domain

Use Case	Primary DB	Secondary DB	Why
E-commerce (orders, inventory)	PostgreSQL	Redis (cache)	ACID for orders, fast cache for catalog
Social network (feeds, connections)	Neo4j	Cassandra (feed storage)	Graph for connections, wide-column for feeds
IoT platform (sensor data)	Cassandra / ScyllaDB	TimescaleDB (analytics)	Massive writes with wide-column, analytics with TSDB
Content management system	MongoDB	Elasticsearch (search)	Flexible documents, full-text search
Real-time analytics dashboard	InfluxDB / TimescaleDB	Redis (cache)	Time-series ingestion, fast dashboard reads
AI/ML recommendation engine	Pinecone / Milvus	PostgreSQL (metadata)	Vector similarity search, relational metadata
Chat / messaging platform	ScyllaDB	Redis (presence)	Write-heavy message storage, in-memory presence
Financial trading platform	CockroachDB	Redis (order book)	Strong consistency, low-latency lookups
User session management	Redis	DynamoDB (persistence)	Sub-ms reads, durable backup
Fraud detection	Neo4j	PostgreSQL (case mgmt)	Graph pattern matching, relational case data
Log aggregation	Cassandra	Elasticsearch (search)	High write throughput, full-text search
Product catalog	MongoDB	Redis (cache)	Flexible schema per category, fast reads

By Access Pattern

Access Pattern	Best Fit	Rationale
Point lookups by key	Key-Value (Redis, DynamoDB)	O(1) hash-based retrieval
Range scans over time	Time-Series or Wide-Column	Contiguous disk reads, time-partitioned
Complex joins across entities	SQL (PostgreSQL)	Native JOIN optimization, query planner
Traversal of relationships	Graph (Neo4j)	Constant-time per-hop traversal
Nearest-neighbor search	Vector (Pinecone, Milvus)	ANN indexes (HNSW, IVF)
Flexible document retrieval	Document (MongoDB)	Rich query on nested structures
High-throughput writes	Wide-Column (Cassandra)	LSM-tree, append-only, linear scale
Aggregation over time windows	Time-Series (TimescaleDB)	Continuous aggregates, downsampling
Full-text search	Elasticsearch (not in matrix)	Inverted indexes, BM25 ranking
Multi-model (several patterns)	PostgreSQL + extensions	pgvector, TimescaleDB, PostGIS, JSON

Multi-Database Architecture Patterns

Most production systems use polyglot persistence -- multiple databases, each handling what it does best.

Scoring Matrix: Rate Your Workload

Use this scoring approach to systematically evaluate database options. Rate each dimension 1-5 based on your requirements, then match to the database category with the highest alignment.

Requirement Scoring Template

Dimension	Weight (1-5)	Your Score	SQL	Doc	Graph	Vector	TSDB	KV	Wide-Col
Strong consistency	?	?	5	3	4	2	4	3	3
Horizontal scalability	?	?	2	4	2	4	3	5	5
Query complexity	?	?	5	3	4	1	3	1	2
Read latency	?	?	4	4	3	3	4	5	4
Write throughput	?	?	3	4	2	3	5	5	5
Schema flexibility	?	?	1	5	4	2	3	5	3
Operational simplicity	?	?	4	4	2	3	4	4	2
Low cost at scale	?	?	3	3	2	2	4	3	4
Relationship queries	?	?	3	2	5	1	1	1	1
Similarity search	?	?	2	1	1	5	1	1	1

How to use: Multiply your weight by the database score for each dimension, then sum across all dimensions. The highest total indicates the best fit.

Interview-Ready Talking Points

"How do you choose a database for a new system?"

"I start with the access patterns -- what queries will the application run most frequently? Then I evaluate along four axes: consistency requirements, scalability needs, query complexity, and operational budget. For example, if we need ACID transactions with complex joins, SQL is the default. If we need massive write throughput with known query patterns, a wide-column store like Cassandra is appropriate. I also consider polyglot persistence -- using multiple databases where each handles what it does best, connected through CDC or event streaming."

"When would you use multiple databases?"

"Almost every production system at scale uses polyglot persistence. A typical e-commerce platform might use PostgreSQL for orders and inventory (ACID), Redis for session management and caching (sub-ms reads), Elasticsearch for product search (full-text), and Cassandra for event logging (write throughput). The key is to have clear data ownership boundaries and use CDC or event-driven architecture to synchronize when needed."

"What is the biggest mistake teams make when choosing a database?"

"Choosing a database based on hype rather than access patterns. I have seen teams adopt MongoDB because 'it scales' when their workload was 90% complex joins that would have been trivial in PostgreSQL. Another common mistake is premature optimization -- reaching for distributed databases when a single PostgreSQL instance with read replicas would handle the load for years. Start simple, measure, and scale when you hit real bottlenecks."

"How do you handle the transition when you outgrow your database?"

"First, I distinguish between vertical scaling headroom (bigger machine, more RAM, better SSDs) and actual architectural limits. PostgreSQL on a modern server can handle more than most people think. When you truly need horizontal scaling, the strategy depends on the workload: read-heavy systems benefit from replicas, write-heavy systems may need sharding or migration to a distributed database. The key is designing your application with repository patterns and clean data access layers so the database can be swapped without rewriting business logic."

Common Pitfalls by Database Type

Database Type	Common Pitfall	Consequence	Prevention
SQL	No indexes on frequently queried columns	Full table scans, slow queries	Analyze query plans with EXPLAIN
Document	Deeply nested documents with unbounded arrays	Large document sizes, slow updates	Cap nesting depth, reference large sub-collections
Graph	Using for tabular/aggregation workloads	Poor performance, over-engineering	Reserve for genuine relationship traversals
Vector	Wrong distance metric or embedding model	Poor recall, irrelevant results	Benchmark with your actual data
Time-Series	No retention policy or downsampling	Storage costs explode	Define retention + continuous aggregates
Key-Value	Using as primary database instead of cache	Data modeling nightmare	Use as cache layer in front of primary DB
Wide-Column	Poor partition key design (hot partitions)	Single-node bottleneck	Analyze cardinality, use composite keys

Technology Radar: Maturity and Momentum

Database	Maturity	Community	Managed Options	Trend
PostgreSQL	Very High	Massive	RDS, Aurora, Supabase, Neon	Rising (becoming "default database")
MySQL	Very High	Large	RDS, Aurora, PlanetScale	Stable
MongoDB	High	Large	Atlas	Stable
Redis	High	Large	ElastiCache, Redis Cloud	Stable
Cassandra	High	Medium	Astra, Instaclustr	Stable
Neo4j	Medium	Medium	Aura	Growing
InfluxDB	Medium	Medium	InfluxDB Cloud	Stable
TimescaleDB	Medium	Growing	Timescale Cloud	Growing
ScyllaDB	Medium	Growing	ScyllaDB Cloud	Growing
Pinecone	Low-Medium	Growing	Fully managed	Growing fast
Milvus	Low-Medium	Growing	Zilliz	Growing fast
CockroachDB	Medium	Medium	Cockroach Cloud	Growing

Quick Decision Cheat Sheet

text

Need ACID + joins?                    --> PostgreSQL
Need flexible JSON documents?         --> MongoDB
Need relationship traversals?         --> Neo4j
Need similarity search on embeddings? --> Pinecone / Milvus
Need time-series with downsampling?   --> TimescaleDB / InfluxDB
Need sub-ms key-value lookups?        --> Redis
Need massive write throughput?        --> Cassandra / ScyllaDB
Need it all on one engine?            --> PostgreSQL + extensions
Not sure?                             --> Start with PostgreSQL

Previous← Interview Revision Cheatsheet NextDocument Db Guide →