docs/Distributed System With Big Data/range-partitioning
Edit on GitHub

Range Partitioning

Range partitioning divides data into partitions based on contiguous ranges of the partition key, so that each partition owns all records whose key falls within a defined boundary -- enabling efficient range scans, time-based queries, and ordered data access, while requiring careful boundary selection to avoid skew and hot partitions.


Table of Contents

  1. How Range Partitioning Works
  2. Boundary Selection
  3. Range Assignment Diagrams
  4. Advantages for Time-Series and Range Queries
  5. Hot Spot Risks with Monotonically Increasing Keys
  6. Real Systems Using Range Partitioning
  7. Pros and Cons
  8. Common Mistakes
  9. Interview Framing
  10. Top 5 Use Cases
  11. Top 5 Warning Signs
  12. Tradeoff Table
  13. Revision Summary
  14. Similar Concepts to Review Next

How Range Partitioning Works

Each partition is assigned a contiguous range of the key space defined by a lower bound (inclusive) and an upper bound (exclusive). Given a key, the system finds the partition whose range contains that key.

text
Partition 0: [A, F)
Partition 1: [F, K)
Partition 2: [K, P)
Partition 3: [P, U)
Partition 4: [U, Z]

A record with key "Garcia" falls in Partition 1 because "F" <= "Garcia" < "K".

The Lookup Process

Unlike hash partitioning (which computes a function), range partitioning requires knowing the boundaries. The system maintains a sorted list of split points and performs a binary search to find the correct partition.


Boundary Selection

Choosing partition boundaries is one of the most critical decisions. Poor boundaries lead to skewed partitions.

Static Boundaries

Defined at table creation time and fixed. Simple but inflexible.

sql
-- PostgreSQL range partitioning
CREATE TABLE events (
    event_id    BIGSERIAL,
    event_date  DATE,
    payload     JSONB
) PARTITION BY RANGE (event_date);

CREATE TABLE events_2025_q1 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');
CREATE TABLE events_2025_q2 PARTITION OF events
    FOR VALUES FROM ('2025-04-01') TO ('2025-07-01');
CREATE TABLE events_2025_q3 PARTITION OF events
    FOR VALUES FROM ('2025-07-01') TO ('2025-10-01');
CREATE TABLE events_2025_q4 PARTITION OF events
    FOR VALUES FROM ('2025-10-01') TO ('2026-01-01');

Dynamic Boundaries (Split and Merge)

The system automatically adjusts boundaries based on partition size or load.

Boundary Selection Strategies

StrategyHow It WorksProsCons
Uniform intervalsEqual-width ranges (e.g., monthly)Simple, predictableSkew if data is not uniformly distributed
Quantile-basedAnalyze data distribution, choose boundaries so each partition has equal record countsEven partition sizesRequires sampling; boundaries may shift
Size-based auto-splitSplit when a partition exceeds a thresholdAdapts to data growthSplit operations cause brief pauses
Load-based auto-splitSplit when a partition's QPS exceeds a thresholdAdapts to access patternsMay split cold data unnecessarily
ManualDBA sets boundaries based on domain knowledgeFull controlRequires ongoing tuning

Range Assignment Diagrams

Number Line Visualization

Time-Based Partitioning Layout

Multi-Level Range Partitioning

Some systems support composite range partitioning (range on one key, then sub-range on another).


Advantages for Time-Series and Range Queries

Range partitioning excels when queries naturally align with the partition boundaries.

Partition Pruning

The query optimizer eliminates entire partitions from the scan based on WHERE clause predicates.

sql
-- Only scans the Q1 2025 partition, skipping all others
SELECT * FROM events
WHERE event_date BETWEEN '2025-02-01' AND '2025-02-28';

TTL and Data Lifecycle

Old partitions can be dropped instantly rather than running expensive DELETE queries.

sql
-- Drop an entire quarter of data in milliseconds
DROP TABLE events_2023_q1;
-- vs. DELETE FROM events WHERE event_date < '2023-04-01';  -- hours of I/O

Efficient Ordered Scans

Within a range partition, data can be stored in sorted order, enabling efficient sequential reads.

Query PatternHash PartitioningRange Partitioning
WHERE date = '2025-02-15'Scatter to all partitionsPrune to one partition
WHERE date BETWEEN X AND YScatter to all partitionsPrune to 1-3 partitions
ORDER BY date LIMIT 100Scatter-gather + merge sortRead from one partition
WHERE user_id = 42Route to one partitionScatter to all partitions

Hot Spot Risks with Monotonically Increasing Keys

The most dangerous pitfall of range partitioning: if the partition key is monotonically increasing (timestamps, auto-increment IDs), all new writes go to the last partition.

Mitigation Strategies

StrategyHow It WorksTradeoff
Prefix with hashKey = hash(user_id) + timestampLoses pure time-range pruning
Bucket by time windowPre-create future partitions (daily/hourly)Hot partition rotates, but the "current" partition is always hot
Composite keyPartition by (region, timestamp)Spreads writes across regions; requires knowing region at query time
Reverse timestampKey = MAX_TS - timestampSpreads across old partitions; confusing semantics
Random prefix saltingKey = random(0..9) + timestampRequires 10x fan-out for reads

The Correct Pattern for Time-Series


Real Systems Using Range Partitioning

HBase Regions

  • HBase stores data sorted by row key.
  • A region is a contiguous range of row keys.
  • Regions auto-split when they exceed a configurable size (default 10 GB).
  • The RegionServer assignment is managed by the HBase Master.
  • Best practice: Pre-split tables based on expected key distribution to avoid initial hot-spotting.

PostgreSQL Range Partitioning

  • Native declarative partitioning since PostgreSQL 10.
  • Supports range, list, and hash partitioning.
  • Partition pruning is handled by the query optimizer.
  • No automatic splitting; DBA must create new partitions.

CockroachDB Ranges

  • Data is divided into 512 MB ranges (default).
  • Ranges split automatically when they exceed the threshold.
  • Ranges are replicated via Raft consensus.
  • The system continuously rebalances ranges across nodes.

Google Bigtable / Cloud Spanner

  • Bigtable splits tablets by row key range.
  • Spanner uses range-based splits with automatic resharding.
  • Both support hierarchical interleaving for parent-child data locality.

Pros and Cons

Pros

  • Efficient range queries: Scans over contiguous key ranges touch only relevant partitions.
  • Partition pruning: The optimizer skips irrelevant partitions, reducing I/O dramatically.
  • Data locality: Related records (same time period, same prefix) are stored together.
  • Simple lifecycle management: Drop old partitions for TTL; attach new partitions for incoming data.
  • Ordered access: Within a partition, data maintains sort order.

Cons

  • Hot partition risk: Monotonically increasing keys concentrate all writes on one partition.
  • Skew potential: Non-uniform key distributions create unbalanced partitions.
  • Boundary management: Static boundaries require manual maintenance; dynamic splits add complexity.
  • Point lookups may scatter: If the query key does not align with the partition key, all partitions must be checked.
  • Split overhead: Auto-splitting a hot partition causes a brief pause in writes.

Common Mistakes

  1. Partitioning by auto-increment primary key: Creates a permanent hot spot on the last partition.
  2. Too few partitions for time-series data: Monthly partitions with high write throughput overload a single partition for 30 days.
  3. Not pre-creating future partitions: Inserts into a non-existent partition fail. Automate partition creation with cron jobs or pg_partman.
  4. Choosing boundaries based on current data: Data distributions change over time. What is balanced today may be skewed in six months.
  5. Ignoring partition pruning in query design: If queries do not include the partition key in the WHERE clause, the optimizer cannot prune.
  6. Mixing hot and cold data in the same partition: Place recent (hot) data on fast storage and old (cold) data on cheap storage by using separate tablespaces.

Interview Framing

When an interviewer asks: "How would you store and query time-series sensor data at scale?"

  1. Identify the access pattern: "Queries are almost always time-bounded: 'give me sensor readings from the last hour/day/week.'"
  2. Choose range partitioning: "I would range-partition by timestamp, creating daily partitions."
  3. Address the hot partition: "Today's partition receives all current writes. To spread the load, I would use a composite key of (sensor_id, timestamp) so writes for the same time window are spread across sensor buckets."
  4. Discuss lifecycle: "Partitions older than 90 days are moved to cold storage (S3), and partitions older than 1 year are dropped."
  5. Mention partition pruning: "Queries like 'last 24 hours of sensor X' only scan 1-2 partitions instead of the full dataset."

Top 5 Use Cases

  1. Time-series databases: IoT, monitoring, logging -- partition by day/hour for efficient time-range queries and TTL.
  2. Financial transaction history: Partition by transaction date for regulatory reporting and archival.
  3. Data warehouses: Fact tables partitioned by date for partition pruning in analytical queries (BigQuery, Redshift).
  4. Geographic data: Partition by region code for data locality and compliance.
  5. Versioned data stores: Partition by version range for efficient rollback and point-in-time queries.

Top 5 Warning Signs

  1. One partition is receiving all writes: You are using a monotonically increasing partition key.
  2. Full table scans despite having partitions: Queries are not including the partition key in WHERE clauses.
  3. Partition sizes are wildly uneven: Your boundaries do not match the actual data distribution.
  4. Insert failures on new data: You have not created partitions for future date ranges.
  5. DROP TABLE operations are slow: You are deleting rows instead of dropping entire partitions.

Tradeoff Table

TradeoffFavoring Narrow RangesFavoring Wide Ranges
Partition countMany small partitionsFew large partitions
Pruning granularityMore precise, less data scannedCoarser, may scan extra data
Metadata overheadMore partition metadataLess metadata
Lifecycle managementMore partitions to manage (cron, monitoring)Fewer but larger drops
Write concentrationWrites spread across more partitionsWrites concentrated in fewer partitions
Auto-split frequencyLess frequent (partitions are small)More frequent (partitions grow faster)

Revision Summary

  • Range partitioning assigns contiguous key ranges to partitions, enabling efficient range scans and partition pruning.
  • Boundaries can be static (DBA-defined) or dynamic (auto-split on size/load thresholds).
  • The biggest risk is hot partitions from monotonically increasing keys; mitigate with composite keys or time-bucketed pre-creation.
  • Time-series is the canonical use case: partition by day/hour, prune on time predicates, drop old partitions for TTL.
  • Real systems like HBase, CockroachDB, and Spanner use range partitioning with automatic splitting.

Similar Concepts to Review Next