Failure Scenarios
A catalog of the failures the platform must survive, with detection, blast radius, recovery, user impact, and whether data loss is possible. This is the operational contract; runbook links go in deploy/runbooks/ (referenced, not in this tree).
1. Failure Matrix
| # | Failure | Detection | Blast Radius | Recovery | User Impact | Data Loss Possible? |
|---|---|---|---|---|---|---|
| 1 | Single Kafka broker down | under_replicated_partitions > 0; Prom alert KafkaBrokerDown | Zero data-plane impact if RF=3 + min.ISR=2. Leaders rebalance. | K8s restarts pod; PVC reattaches; broker catches up from ISR. If PVC gone: rebuild from replica. | None (blip in produce/consume latency p99 for ~30s) | No |
| 2 | Kafka quorum loss (2/3 controllers or brokers down) | KafkaControllerQuorumUnavailable; produce returns NOT_ENOUGH_REPLICAS | All produce blocked; ingest-gateway returns 503 | Restore at least one controller PVC; KRaft quorum recovers. Run recover_cluster_metadata if corruption. | Ingest stalls; edge-gateway spools to disk (72h capacity) | No (edge spool absorbs) |
| 3 | Consumer crash (any consumer group instance) | Pod restart; kafka_consumergroup_lag spike | One consumer instance gone; partitions redistribute | K8s restarts pod; lag catches up. | None (consumer lag briefly rises) | No |
| 4 | All archive-writer replicas unavailable | kafka_consumergroup_lag{group="archive-writer"} > 60000 for 5m | Archive freshness degrades; COLD becomes stale | Scale up replicas; investigate root cause; lag burns down. Kafka retention (7d) bounds recovery window. | Queries to COLD show gap; banner in UI | No (as long as recovery < 7d) |
| 5 | OpenSearch node down (1 of N) | opensearch_cluster_status=yellow | Shards rebalance to survivors; query latency up. | Node recovers from PVC or is replaced; re-replication. | Minor latency bump | No (RF=1) |
| 6 | OpenSearch full cluster outage | opensearch_cluster_status=red or unreachable | HOT search unavailable | Restart; worst case, restore from snapshot (6h RPO). | Searches fall back to COLD via query-api; UI banner shows "HOT degraded, using COLD (slower)". Alerts for last-N-hour queries may be delayed. | No (archive is source of truth; indexer resumes from Kafka offsets) |
| 7 | MinIO node down (1 of N) | EC:4+2 quorum metric | Zero data-plane impact; object-lock intact. | Node recovers; heal runs. | None | No |
| 8 | MinIO cluster outage | All nodes unreachable | archive-writer cannot flush; buffers in memory up to cap | archive-writer slows consumer commits (backpressure) → Kafka retains messages → edge spools if prolonged. Restore MinIO or fail over to secondary bucket. | New events pile up but are eventually archived; COLD queries fail until back | No (bounded by Kafka retention and edge spool) |
| 9 | Network partition (intra-cluster) | Kubernetes events, node NotReady, Kafka ISR shrinks | Tainted nodes evicted; workloads reschedule on healthy side | Partition heals; standard reconciliation. | Latency spike; possible double-processing handled by idempotent writes (eventId dedup) | No |
| 10 | Network partition (edge ↔ cluster) | Edge heartbeat stops; edge_connected{site} == 0 | One hospital's uplink down | Edge-gateway spools locally on disk (72h buffer). On reconnect, batch replay in order per-device. | No current-event visibility at the central UI for that site during the outage. | No (spool) if < 72h; otherwise oldest spooled events may be dropped |
| 11 | Edge device disconnect (single device) | device_last_seen age > threshold | One device's data gap | When device reconnects, it resumes from its sequenceNumber. Operator sees the gap in timeline. | Visible gap in that device's timeline | Yes, if device loses local buffer power (depends on device class) — platform side is fine |
| 12 | Schema mismatch (payload fails validation) | DLQ topic rate up; ingestion_validation_failures_total | Specific event type | DLQ Explorer UI; operator decides: reject, or update schema + replay from DLQ | Affected event type missing from HOT until replayed; archive not affected (DLQ is pre-archive) | No (events persisted in DLQ quarantine bucket) |
| 13 | Schema mismatch detected post-ingest (consumer found a malformed event somehow indexed) | consumer_envelope_defense_failures_total > 0 | A few records in OpenSearch | Investigate chain; replay-from-cold with corrected mapping; delete polluted HOT index range. | Minor, scoped to investigation | No (archive canonical) |
| 14 | PVC corruption (Kafka or OpenSearch) | Broker/node fails to start; FS errors in logs | The affected replica | Delete PVC; recreate; data re-replicates from peers. If all replicas corrupt, restore from snapshot / replay from COLD. | Brief capacity loss | No if replicated; Yes if all replicas corrupt AND snapshot missing |
| 15 | Disk full (Kafka broker) | kubelet_volume_stats_used_bytes / capacity > 0.9; Kafka LogDirFailure | Broker drops offline | Expand PVC (if storage class supports); scale up before hitting 85%. Retention-based auto-eviction should normally prevent this. | Same as #1 for one broker | No |
| 16 | Disk full (MinIO) | PV metrics; MinIO heal events | New writes fail | Expand; add drive set; tiering to S3 cold if in cloud. archive-writer backpressures. | See #8 | No (bounded as above) |
| 17 | Disk full (OpenSearch data node) | Watermarks cluster.routing.allocation.disk.watermark.flood_stage | Index set to read-only | Rotate ILM forward; delete very old WARM if archive verified; expand PVC. | Write errors on affected indices | No (can re-index from Kafka/archive) |
| 18 | Helm upgrade bad config | Readiness probe fails, kubectl rollout stuck | New pods unhealthy | helm rollback to previous revision (Helmfile --atomic ensures). For stateful sets, PDBs + surge=0 prevent simultaneous loss. | Intermittent 503s during rollout (rare with surge=0) | No |
| 19 | Helm upgrade of stateful app breaks storage layout (e.g., Kafka log dir changed) | Pod crashloop | One stateful set | Rollback; if config can't be rolled back cleanly, restore from snapshot. PR review + helm diff in CI. | Brief outage on that tier | No (with snapshot policy) |
| 20 | Schema Registry down | Probe failure | Cold path only: new pods can't fetch a not-yet-cached schema | Running pods keep validating (in-proc cache). Fix/redeploy registry; pods resync. | None in steady state; new event type rollouts blocked | No |
| 21 | Postgres outage (metadata) | Probe failure, connection errors | Tenant config, investigations, manifests read | ingestion-gateway keeps going using last-known config snapshot (5-min TTL cache). Query API degrades (no manifests → COLD reads limited to today's prefix scan). UI cannot update investigations. | Degraded admin/query; ingest unaffected | No |
| 22 | Redis outage (rate limits / cursors) | Probe failure | Rate-limit enforcement and cursors lost | Ingestion falls back to per-pod local token bucket (approximate) and returns 503 with Retry-After on safety margin. Cursors invalidated — clients restart paging. | Pagination reset; stricter ingest limit for safety | No |
| 23 | OIDC provider outage | Token introspection failures | New sign-ins blocked | Existing JWTs still accepted until expiry (15 min). Platform-admin has a break-glass mTLS path. | Users can't log in until IdP restored | No |
| 24 | Ingress / LB failure | External probe fails | Public edge of the cluster | Traefik/NGINX HPA; LB is cloud-managed (self-heals) or on-prem with keepalived + HA pair. | Edge-gateway 5xx → spool engages | No (spool) |
| 25 | Cloud cutover (planned migration local → AWS) | Operator-initiated | All services migrate | See portability-matrix. Dual-write from ingestion-gateway during cutover; replay historical from COLD; cutover the edge-gateway config. | Planned maintenance window; archive queries continue | No (planned, tested) |
| 26 | Cloud cutover (unplanned, cloud region down) | Cloud provider status | One region | Fail over to secondary region if multi-region enabled; otherwise wait. Edge spools. | Regional outage visibility | No if multi-region + spool within budget |
| 27 | Chain break detected (integrity-verifier) | audit.integrity.v1 event | Forensic flag, scoped to (tenantId, deviceId, day) | Quarantine the affected day in COLD; investigate; replay from Kafka retention if within 7d, else accept gap with signed incident record. | Investigation flag in UI | Possibly (only if a genuine tamper occurred) |
| 28 | Malformed manifest signature | Verifier | One day of one tenant | Treat day as unverified; re-archive from Kafka if in retention; otherwise flag + incident record. | Investigator badge "unverified day" in UI | Same as above |
| 29 | Clock skew on edge (device time far off) | eventTime - ingestionTime histograms | Per-device | ingestionTime is authoritative; queries by ingestionTime work; queries by eventTime are flagged when skew > 5m. | Timeline shows skew warning | No |
| 30 | Upgrade of OpenSearch ILM breaks transitions | ILM step: error | Index stuck in hot past 30d | Manually advance state after fixing policy; no data loss. | Disk pressure rises until fixed | No |
| 31 | Secret rotation failure (Vault / ExternalSecrets) | ExternalSecret SecretSyncError | Service may restart with stale secret | Roll back rotation, fix, re-sync. Short-lived creds mean blast radius is bounded to TTL. | Service restart storm if cert rotation cascades | No |
| 32 | Runaway query (very large aggregation) | Slow-query log; circuit breaker | Per-tenant pool saturation | Circuit breaker cancels; scheduler rejects the next. Tenant sees 429 or 499. | That query fails; others unaffected | No |
| 33 | KEDA operator down | up{job="keda-operator"} == 0 for 5m | No new scaling decisions; existing HPAs keep running on last metric values | Restart operator (2 replicas, HA); KEDA has no persistent state besides CRs | None in steady state; slow response to sudden load spikes during outage | No |
| 34 | KEDA metrics-apiserver down | HPA FailedGetExternalMetric events | HPAs can't fetch external metrics; fall back to last known values | Restart pods; redundant replicas | None in steady state; stale scaling during outage | No |
| 35 | Prometheus unreachable (KEDA prometheus trigger stale) | KEDA scaler_errors_total spike | prometheus-triggered scalers mark inactive → fall back to minReplicaCount. Kafka-triggered scalers unaffected. | Restore Prometheus; next poll recovers | HTTP services may under-scale during outage | No |
| 36 | Kafka SASL secret rotation misconfigured | KEDA kafka scaler errors; consumers also fail | Both KEDA scalers and live consumers affected | Roll back secret; ExternalSecrets retry loop. Keep old + new key versions during rotation (dual-trust window). | Consumer lag grows; ingest unaffected | No (bounded by Kafka retention) |
2. Recovery Objectives
| Tier | RPO | RTO |
|---|---|---|
| Ingest (edge to ingestion-gateway) | 0 (spool) | 5 min |
| Kafka | 0 (replication) | 5 min broker; 30 min cluster |
| HOT (OpenSearch) | 6h (snapshot) | 1h (restart) / 4h (snapshot restore) |
| WARM | same as HOT | same |
| COLD (MinIO/S3) | 0 (object storage durability) | 15 min for regional fail-over |
| Postgres | 5 min (WAL ship) | 30 min |
| Redis | best-effort | N/A (fallback mode) |
| Full region | 15 min RPO / 2h RTO (multi-region, opt-in) | — |
3. Blast Radius Diagram
COLD is the green "source of truth" anchor. Any higher tier can be rebuilt from COLD as long as the archive is current (< 10 min lag).
4. Game Day
Recurring chaos exercises (per roadmap.md week 5):
| Exercise | Cadence | Expected outcome |
|---|---|---|
| Kill a random Kafka broker | weekly | No ingest impact, alert fires and clears within 3 min |
| Delete HOT index for tenant X | quarterly | Replay from COLD rebuilds HOT within SLA |
| Sever edge → cluster uplink for 6h | monthly | Edge spool absorbs; reconnect replays in order |
| Cloud region simulated outage | quarterly | Secondary region promoted (if opt-in) |
| Corrupt a Parquet file (inject bit-flip) | monthly | Chain verifier flags; alert fires |
| Schema break deliberately pushed | every PR | CI blocks; merge prevented |
Results and runbooks live in deploy/runbooks/ (not in this tree).