docs/Distributed System With Big Data/decision-matrix
Edit on GitHub

Database Selection Decision Matrix

1-Paragraph Summary

Choosing the right database is one of the most consequential architectural decisions in system design. This guide provides a comprehensive comparison of seven database categories -- SQL (relational), Document, Graph, Vector, Time-Series, Key-Value, and Wide-Column -- across critical dimensions including consistency guarantees, scalability characteristics, query complexity support, latency profiles, data model flexibility, operational burden, and cost. Use the decision flowchart and comparison tables below to quickly narrow candidates based on your workload requirements, then dive into the individual guides for deeper analysis.


The Seven Database Categories

CategoryRepresentative DatabasesCore Data ModelPrimary Strength
SQL (Relational)PostgreSQL, MySQL, CockroachDB, AuroraTables with rows and columns, enforced schemaACID transactions, complex queries, data integrity
DocumentMongoDB, CouchDB, Firestore, DynamoDBJSON/BSON documents in collectionsSchema flexibility, developer productivity
GraphNeo4j, Neptune, TigerGraph, ArangoDBNodes and edges with propertiesRelationship traversals, pattern matching
VectorPinecone, Milvus, Weaviate, pgvectorHigh-dimensional embedding vectorsSimilarity search, AI/ML applications
Time-SeriesInfluxDB, TimescaleDB, QuestDB, PrometheusTimestamped data points in seriesTime-ordered ingestion, downsampling, retention
Key-ValueRedis, DynamoDB, etcd, MemcachedSimple key-to-value mappingsSub-millisecond reads, caching, sessions
Wide-ColumnCassandra, ScyllaDB, HBase, BigtableColumn families with flexible columns per rowMassive write throughput, linear scalability

Comprehensive Comparison Table

Consistency and Availability

DimensionSQLDocumentGraphVectorTime-SeriesKey-ValueWide-Column
Default consistencyStrong (ACID)Eventual / TunableStrong (single-node)EventualStrong (single-node)Strong or EventualTunable
Transaction supportFull ACID, multi-rowSingle-doc atomic; limited multi-docVaries by engineNone typicallyLimitedSingle-key atomicLightweight (LWT)
CAP classificationCP (most)AP or CP (varies)CP (most)APCP (most)CP or AP (varies)AP (Cassandra) or CP (HBase)
Isolation levelsSerializable, SI, RC, RURead concern levelsVariesN/AVariesSingle-op atomicPer-query tunable

Scalability

DimensionSQLDocumentGraphVectorTime-SeriesKey-ValueWide-Column
Horizontal scalingHard (sharding complex)Native (auto-sharding)LimitedNativeModerateNativeNative (linear)
Max practical scale~10 TB (single), ~PB (NewSQL)PB scale~100s GB typicalTB scalePB scale (with tiering)TB scale (in-memory)PB scale
Write scalabilityModerateHighModerateHigh (batch)Very highVery highVery high
Read scalabilityHigh (with replicas)HighHigh (local queries)HighHighVery highHigh
Geo-distributionComplex (CockroachDB, Spanner)Built-in (Atlas, Cosmos)LimitedManaged optionsLimitedRedis EnterpriseBuilt-in (Cassandra)

Query Capabilities

DimensionSQLDocumentGraphVectorTime-SeriesKey-ValueWide-Column
Query languageSQL (standard)MongoDB Query, SQL-likeCypher, Gremlin, SPARQLVector APIs, filtersSQL-like (InfluxQL, Flux)GET/SET/SCANCQL
Join supportFull (inner, outer, cross)$lookup (limited)Traversals (native)NoneLimitedNoneNone
AggregationFull (GROUP BY, HAVING, window)Aggregation pipelineLimitedNoneBuilt-in (downsampling)NoneLimited
Ad-hoc queriesExcellentGoodGood (for graph patterns)Limited (similarity only)Good (time-based)Poor (key-only)Poor (partition-key required)
Secondary indexesUnlimitedSupportedProperty indexesMetadata filteringTag-basedLimitedLimited (use with caution)
Full-text searchBasic (or with extensions)Atlas Search / built-inLimitedHybrid (some engines)NoNoNo

Latency and Performance

DimensionSQLDocumentGraphVectorTime-SeriesKey-ValueWide-Column
Read latency (p50)1-10 ms1-10 ms1-50 ms10-100 ms1-10 ms< 1 ms1-10 ms
Write latency (p50)1-10 ms1-10 ms5-50 ms10-100 ms< 1 ms< 1 ms1-5 ms
Bulk ingestionModerateHighSlowHigh (batch)Very highHighVery high
Hot-path optimizationB-tree indexesDocument indexesGraph traversal cacheHNSW/IVF indexesTime-partitioned storageIn-memory hashLSM-tree + Bloom filters

Operational Characteristics

DimensionSQLDocumentGraphVectorTime-SeriesKey-ValueWide-Column
Schema managementStrict (migrations required)Flexible (schema-on-read)FlexibleDimension-fixedSemi-structuredSchemalessSemi-structured (CQL)
Learning curveLow (SQL is universal)Low-moderateModerate-highModerateLow-moderateVery lowModerate-high
Ops complexityLow-moderateLow (managed) to moderateModerate-highLow (managed)Low-moderateLowHigh
Backup/restoreMature toolingBuilt-in (managed)VariesManagedBuilt-inRDB/AOF snapshotsSnapshot + repair
MonitoringExcellent ecosystemGoodLimitedBasicGoodGoodGood (but complex)

Cost Considerations

DimensionSQLDocumentGraphVectorTime-SeriesKey-ValueWide-Column
Infrastructure costModerateModerateHigh (memory-intensive)High (GPU/memory)Low-moderateHigh (RAM-bound)Moderate (disk-based)
Managed service cost$$$$$$$$$$$$$ - $$$$
Developer productivityHigh (familiar SQL)Very high (JSON-native)ModerateModerateHighVery highModerate
Vendor lock-in riskLow (SQL standard)ModerateHighHighModerateLowLow-moderate

Decision Flowchart

Use this flowchart to narrow your database selection based on primary workload characteristics:


Secondary Decision: Scale and Consistency Requirements

After identifying the database category from the primary flowchart, use this matrix to select the specific technology:


Use Case Quick Reference

By Industry / Domain

Use CasePrimary DBSecondary DBWhy
E-commerce (orders, inventory)PostgreSQLRedis (cache)ACID for orders, fast cache for catalog
Social network (feeds, connections)Neo4jCassandra (feed storage)Graph for connections, wide-column for feeds
IoT platform (sensor data)Cassandra / ScyllaDBTimescaleDB (analytics)Massive writes with wide-column, analytics with TSDB
Content management systemMongoDBElasticsearch (search)Flexible documents, full-text search
Real-time analytics dashboardInfluxDB / TimescaleDBRedis (cache)Time-series ingestion, fast dashboard reads
AI/ML recommendation enginePinecone / MilvusPostgreSQL (metadata)Vector similarity search, relational metadata
Chat / messaging platformScyllaDBRedis (presence)Write-heavy message storage, in-memory presence
Financial trading platformCockroachDBRedis (order book)Strong consistency, low-latency lookups
User session managementRedisDynamoDB (persistence)Sub-ms reads, durable backup
Fraud detectionNeo4jPostgreSQL (case mgmt)Graph pattern matching, relational case data
Log aggregationCassandraElasticsearch (search)High write throughput, full-text search
Product catalogMongoDBRedis (cache)Flexible schema per category, fast reads

By Access Pattern

Access PatternBest FitRationale
Point lookups by keyKey-Value (Redis, DynamoDB)O(1) hash-based retrieval
Range scans over timeTime-Series or Wide-ColumnContiguous disk reads, time-partitioned
Complex joins across entitiesSQL (PostgreSQL)Native JOIN optimization, query planner
Traversal of relationshipsGraph (Neo4j)Constant-time per-hop traversal
Nearest-neighbor searchVector (Pinecone, Milvus)ANN indexes (HNSW, IVF)
Flexible document retrievalDocument (MongoDB)Rich query on nested structures
High-throughput writesWide-Column (Cassandra)LSM-tree, append-only, linear scale
Aggregation over time windowsTime-Series (TimescaleDB)Continuous aggregates, downsampling
Full-text searchElasticsearch (not in matrix)Inverted indexes, BM25 ranking
Multi-model (several patterns)PostgreSQL + extensionspgvector, TimescaleDB, PostGIS, JSON

Multi-Database Architecture Patterns

Most production systems use polyglot persistence -- multiple databases, each handling what it does best.


Scoring Matrix: Rate Your Workload

Use this scoring approach to systematically evaluate database options. Rate each dimension 1-5 based on your requirements, then match to the database category with the highest alignment.

Requirement Scoring Template

DimensionWeight (1-5)Your ScoreSQLDocGraphVectorTSDBKVWide-Col
Strong consistency??5342433
Horizontal scalability??2424355
Query complexity??5341312
Read latency??4433454
Write throughput??3423555
Schema flexibility??1542353
Operational simplicity??4423442
Low cost at scale??3322434
Relationship queries??3251111
Similarity search??2115111

How to use: Multiply your weight by the database score for each dimension, then sum across all dimensions. The highest total indicates the best fit.


Interview-Ready Talking Points

"How do you choose a database for a new system?"

"I start with the access patterns -- what queries will the application run most frequently? Then I evaluate along four axes: consistency requirements, scalability needs, query complexity, and operational budget. For example, if we need ACID transactions with complex joins, SQL is the default. If we need massive write throughput with known query patterns, a wide-column store like Cassandra is appropriate. I also consider polyglot persistence -- using multiple databases where each handles what it does best, connected through CDC or event streaming."

"When would you use multiple databases?"

"Almost every production system at scale uses polyglot persistence. A typical e-commerce platform might use PostgreSQL for orders and inventory (ACID), Redis for session management and caching (sub-ms reads), Elasticsearch for product search (full-text), and Cassandra for event logging (write throughput). The key is to have clear data ownership boundaries and use CDC or event-driven architecture to synchronize when needed."

"What is the biggest mistake teams make when choosing a database?"

"Choosing a database based on hype rather than access patterns. I have seen teams adopt MongoDB because 'it scales' when their workload was 90% complex joins that would have been trivial in PostgreSQL. Another common mistake is premature optimization -- reaching for distributed databases when a single PostgreSQL instance with read replicas would handle the load for years. Start simple, measure, and scale when you hit real bottlenecks."

"How do you handle the transition when you outgrow your database?"

"First, I distinguish between vertical scaling headroom (bigger machine, more RAM, better SSDs) and actual architectural limits. PostgreSQL on a modern server can handle more than most people think. When you truly need horizontal scaling, the strategy depends on the workload: read-heavy systems benefit from replicas, write-heavy systems may need sharding or migration to a distributed database. The key is designing your application with repository patterns and clean data access layers so the database can be swapped without rewriting business logic."


Common Pitfalls by Database Type

Database TypeCommon PitfallConsequencePrevention
SQLNo indexes on frequently queried columnsFull table scans, slow queriesAnalyze query plans with EXPLAIN
DocumentDeeply nested documents with unbounded arraysLarge document sizes, slow updatesCap nesting depth, reference large sub-collections
GraphUsing for tabular/aggregation workloadsPoor performance, over-engineeringReserve for genuine relationship traversals
VectorWrong distance metric or embedding modelPoor recall, irrelevant resultsBenchmark with your actual data
Time-SeriesNo retention policy or downsamplingStorage costs explodeDefine retention + continuous aggregates
Key-ValueUsing as primary database instead of cacheData modeling nightmareUse as cache layer in front of primary DB
Wide-ColumnPoor partition key design (hot partitions)Single-node bottleneckAnalyze cardinality, use composite keys

Technology Radar: Maturity and Momentum

DatabaseMaturityCommunityManaged OptionsTrend
PostgreSQLVery HighMassiveRDS, Aurora, Supabase, NeonRising (becoming "default database")
MySQLVery HighLargeRDS, Aurora, PlanetScaleStable
MongoDBHighLargeAtlasStable
RedisHighLargeElastiCache, Redis CloudStable
CassandraHighMediumAstra, InstaclustrStable
Neo4jMediumMediumAuraGrowing
InfluxDBMediumMediumInfluxDB CloudStable
TimescaleDBMediumGrowingTimescale CloudGrowing
ScyllaDBMediumGrowingScyllaDB CloudGrowing
PineconeLow-MediumGrowingFully managedGrowing fast
MilvusLow-MediumGrowingZillizGrowing fast
CockroachDBMediumMediumCockroach CloudGrowing

Quick Decision Cheat Sheet

text
Need ACID + joins?                    --> PostgreSQL
Need flexible JSON documents?         --> MongoDB
Need relationship traversals?         --> Neo4j
Need similarity search on embeddings? --> Pinecone / Milvus
Need time-series with downsampling?   --> TimescaleDB / InfluxDB
Need sub-ms key-value lookups?        --> Redis
Need massive write throughput?        --> Cassandra / ScyllaDB
Need it all on one engine?            --> PostgreSQL + extensions
Not sure?                             --> Start with PostgreSQL