docs/System Design/36-uber-ride-sharing-part1
Edit on GitHub

Problem #36: Design Uber/Lyft (Ride-Sharing Platform) — Full Deep Dive


1. Problem Statement

Design a ride-sharing platform that matches riders with nearby drivers in real-time, handles dynamic pricing, provides ETA estimation, tracks rides in progress, and processes payments — all at city-scale with millions of concurrent users.


2-3. Requirements

Core

IDRequirement
FR-1Rider requests a ride (pickup + destination)
FR-2Match rider with the best nearby available driver
FR-3Real-time driver location tracking (every 3-5 seconds)
FR-4ETA estimation (pickup ETA + trip ETA)
FR-5Dynamic pricing (surge during high demand)
FR-6Ride lifecycle (request → match → pickup → in-progress → complete)
FR-7Payment processing at trip end
FR-8Driver/rider ratings after trip

Non-Functional

RequirementTarget
Matching latency< 10 seconds
Location update frequencyEvery 3-5 seconds per active driver
Location freshness< 5 seconds stale
Concurrent rides1M+ globally
Active drivers5M+ sending location updates
Availability99.99%

4. Capacity Estimation

text
Active drivers: 5M (sending GPS every 4 seconds)
Location updates/sec: 5M / 4 = 1.25M updates/sec
Location update size: ~100 bytes (driver_id, lat, lng, timestamp, heading, speed)
Location bandwidth: 1.25M * 100 = 125 MB/sec

Ride requests: 10M rides/day = ~115 rides/sec
Matching: each request searches nearby drivers → geospatial query
ETA calculations: per ride request + ongoing navigation updates

Storage:
  Trip history: 10M * 2 KB = 20 GB/day
  Location traces: 5M drivers * 86400/4 updates * 100 bytes = 10.8 TB/day (raw)
  → Downsample to 30-sec intervals for history: ~360 GB/day

5-8. Architecture

Location Tracking: The GeoHash Approach

text
Challenge: 5M drivers sending GPS every 4s = 1.25M updates/sec.
For each ride request: "Find available drivers within 3 km of rider."
Scanning 5M drivers per request is too slow.

Solution: GeoHash-based spatial indexing

GeoHash: Encodes lat/lng into a string where shared prefix = nearby.
  (37.7749, -122.4194) → "9q8yyk"  (San Francisco)
  (37.7751, -122.4190) → "9q8yym"  (1 block away — shares "9q8yy" prefix!)
  (40.7128, -74.0060)  → "dr5reg"  (New York — completely different prefix)

GeoHash precision:
  4 chars → ~40 km × 20 km grid cell
  5 chars → ~5 km × 5 km grid cell   ← good for initial search
  6 chars → ~1.2 km × 600m grid cell ← good for nearby matching
  7 chars → ~150m × 150m grid cell

Redis implementation:
  GEOADD drivers:{city} lng lat driver_id    // Add/update driver position
  GEORADIUS drivers:{city} lng lat 3 km      // Find drivers within 3 km
  → O(log(N) + M) where M = results. Fast for millions of entries.

Alternative: QuadTree or S2 Geometry (Google's approach).
  S2 uses hierarchical cell decomposition of the sphere.
  Better for planet-scale (handles poles, anti-meridian correctly).

Matching Algorithm

Matching Scoring Function

text
Score = w1 * (1 / ETA_minutes) +       // Closer is better
        w2 * driver_rating +             // Higher rated preferred
        w3 * acceptance_rate +            // Reliable drivers preferred
        w4 * (1 / detour_factor)          // Minimize driver's detour

Weights tuned per market.

Fallback cascade:
  Round 1: Search 3 km radius, top 5 candidates
  Round 2 (no accept in 15s): Expand to 5 km, next 5 candidates
  Round 3 (still no accept): Expand to 8 km, offer to more drivers
  Round 4: "No drivers available. Try again in a few minutes."

9. Deep Dive: Dynamic Pricing (Surge)

text
Why surge pricing?
  Supply: 1000 available drivers in downtown
  Demand: 3000 ride requests in downtown
  Without surge: 2000 riders get "no drivers available" → terrible UX
  With surge: price goes up → some riders wait → some drivers come to area → equilibrium

Surge calculation:
  For each geo-cell (GeoHash level 5, ~5km × 5km):
    demand = ride_requests_last_5_min / historical_average
    supply = available_drivers / historical_average
    surge_multiplier = demand / supply

    if surge < 1.0 → no surge (supply exceeds demand)
    if surge 1.0-1.5 → mild surge
    if surge 1.5-3.0 → moderate surge
    if surge > 3.0 → cap at 3.0x (or market-specific cap)

  Smoothing:
    Don't spike instantly (one burst of requests → temporary spike)
    Use exponentially weighted moving average (EWMA) over 5-minute windows
    Ramp up gradually, ramp down more slowly (hysteresis)

  Geo-smoothing:
    Adjacent cells should have similar surge (no sharp boundaries)
    Average with neighboring cells for smooth pricing gradient

Ride Lifecycle State Machine

text
                    ┌──────────────┐
                    │  REQUESTED   │
                    └──────┬───────┘
                           │ Driver matched
                    ┌──────▼───────┐
              ┌─────│  MATCHED     │─────┐
              │     └──────┬───────┘     │
              │            │ Driver      │ Driver cancels
              │            │ arrives     │ or timeout
              │     ┌──────▼───────┐     │
              │     │  ARRIVED     │     │
              │     └──────┬───────┘     │
              │            │ Rider       │
              │            │ picked up   │
              │     ┌──────▼───────┐     │
              │     │ IN_PROGRESS  │     ├──► CANCELLED
              │     └──────┬───────┘     │
              │            │ Arrived     │
              │            │ at dest     │
              │     ┌──────▼───────┐     │
              │     │  COMPLETED   │     │
              │     └──────┬───────┘     │
              │            │             │
              │     ┌──────▼───────┐     │
              └────►│   PAYMENT    │◄────┘
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │   RATED      │
                    └──────────────┘

Each state transition:
  → Persisted to DB (strong consistency)
  → Published to Kafka (for analytics, notifications, tracking)
  → Push notification to rider and/or driver

10-14. Trade-offs, Failures, Interview

Key trade-offs:

DecisionTrade-off
GeoHash vs QuadTree vs S2GeoHash: simple, Redis-native. QuadTree: better density-aware. S2: best for planet-scale
Matching: nearest vs scoredNearest: fastest pickup. Scored: considers rating, acceptance, fairness. Scored is better UX
Surge: reactive vs predictiveReactive: responds to current demand. Predictive (ML): anticipates events/weather
Location in Redis vs dedicated DBRedis: fast, in-memory. Dedicated (PostGIS): richer queries, more features. Redis for real-time, PostGIS for analytics

Failure scenarios:

  • Driver app crashes mid-ride → rider still tracked via last known location. Ride auto-completes after timeout. Safety alert if no update for 5 min.
  • Matching service down → ride requests queue in Kafka → resume when service recovers. Rider sees "Finding driver..." longer.
  • Payment fails → trip still completes. Retry payment async. If persistent failure → flag account.

Interview tips — the 5 pillars:

  1. Location tracking at scale (1.25M updates/sec → Redis GeoHash)
  2. Matching algorithm (geo search → score → offer → fallback cascade)
  3. Dynamic pricing (supply/demand ratio per geo-cell, EWMA smoothing)
  4. Ride state machine (each transition persisted + event published)
  5. ETA estimation (graph-based routing + ML for traffic prediction)

Practice Mode

5 Questions

  1. "How do you track millions of driver locations in real-time?" → Redis GEOADD/GEORADIUS with GeoHash indexing. 1.25M updates/sec. Each city is a separate Redis key. GEORADIUS finds drivers within X km in O(log N + M).
  2. "How does the matching algorithm work?" → Search nearby drivers (3km) → rank by score (ETA, rating, acceptance rate) → offer to top driver (15s timeout) → fallback cascade expanding radius if no accept.
  3. "How does surge pricing work?" → Per geo-cell: demand/supply ratio over 5-min EWMA. Surge multiplier capped (e.g., 3x). Geo-smoothed across adjacent cells. Ramp up gradually, ramp down slowly.
  4. "What happens if a driver's phone dies mid-ride?" → Last known location cached. Ride stays in IN_PROGRESS. Safety alert if no GPS update for 5 min. Rider can report. Ride auto-completes at expected time + buffer.
  5. "How do you estimate ETA?" → Road network graph (OSM data). Dijkstra/A* for shortest path. Historical traffic data per road segment per time of day. ML model adjusts for live traffic, weather, events.