docs/DIstributed Audit Logs/failure-scenarios
Edit on GitHub

Failure Scenarios

A catalog of the failures the platform must survive, with detection, blast radius, recovery, user impact, and whether data loss is possible. This is the operational contract; runbook links go in deploy/runbooks/ (referenced, not in this tree).


1. Failure Matrix

#FailureDetectionBlast RadiusRecoveryUser ImpactData Loss Possible?
1Single Kafka broker downunder_replicated_partitions > 0; Prom alert KafkaBrokerDownZero data-plane impact if RF=3 + min.ISR=2. Leaders rebalance.K8s restarts pod; PVC reattaches; broker catches up from ISR. If PVC gone: rebuild from replica.None (blip in produce/consume latency p99 for ~30s)No
2Kafka quorum loss (2/3 controllers or brokers down)KafkaControllerQuorumUnavailable; produce returns NOT_ENOUGH_REPLICASAll produce blocked; ingest-gateway returns 503Restore at least one controller PVC; KRaft quorum recovers. Run recover_cluster_metadata if corruption.Ingest stalls; edge-gateway spools to disk (72h capacity)No (edge spool absorbs)
3Consumer crash (any consumer group instance)Pod restart; kafka_consumergroup_lag spikeOne consumer instance gone; partitions redistributeK8s restarts pod; lag catches up.None (consumer lag briefly rises)No
4All archive-writer replicas unavailablekafka_consumergroup_lag{group="archive-writer"} > 60000 for 5mArchive freshness degrades; COLD becomes staleScale up replicas; investigate root cause; lag burns down. Kafka retention (7d) bounds recovery window.Queries to COLD show gap; banner in UINo (as long as recovery < 7d)
5OpenSearch node down (1 of N)opensearch_cluster_status=yellowShards rebalance to survivors; query latency up.Node recovers from PVC or is replaced; re-replication.Minor latency bumpNo (RF=1)
6OpenSearch full cluster outageopensearch_cluster_status=red or unreachableHOT search unavailableRestart; worst case, restore from snapshot (6h RPO).Searches fall back to COLD via query-api; UI banner shows "HOT degraded, using COLD (slower)". Alerts for last-N-hour queries may be delayed.No (archive is source of truth; indexer resumes from Kafka offsets)
7MinIO node down (1 of N)EC:4+2 quorum metricZero data-plane impact; object-lock intact.Node recovers; heal runs.NoneNo
8MinIO cluster outageAll nodes unreachablearchive-writer cannot flush; buffers in memory up to caparchive-writer slows consumer commits (backpressure) → Kafka retains messages → edge spools if prolonged. Restore MinIO or fail over to secondary bucket.New events pile up but are eventually archived; COLD queries fail until backNo (bounded by Kafka retention and edge spool)
9Network partition (intra-cluster)Kubernetes events, node NotReady, Kafka ISR shrinksTainted nodes evicted; workloads reschedule on healthy sidePartition heals; standard reconciliation.Latency spike; possible double-processing handled by idempotent writes (eventId dedup)No
10Network partition (edge ↔ cluster)Edge heartbeat stops; edge_connected{site} == 0One hospital's uplink downEdge-gateway spools locally on disk (72h buffer). On reconnect, batch replay in order per-device.No current-event visibility at the central UI for that site during the outage.No (spool) if < 72h; otherwise oldest spooled events may be dropped
11Edge device disconnect (single device)device_last_seen age > thresholdOne device's data gapWhen device reconnects, it resumes from its sequenceNumber. Operator sees the gap in timeline.Visible gap in that device's timelineYes, if device loses local buffer power (depends on device class) — platform side is fine
12Schema mismatch (payload fails validation)DLQ topic rate up; ingestion_validation_failures_totalSpecific event typeDLQ Explorer UI; operator decides: reject, or update schema + replay from DLQAffected event type missing from HOT until replayed; archive not affected (DLQ is pre-archive)No (events persisted in DLQ quarantine bucket)
13Schema mismatch detected post-ingest (consumer found a malformed event somehow indexed)consumer_envelope_defense_failures_total > 0A few records in OpenSearchInvestigate chain; replay-from-cold with corrected mapping; delete polluted HOT index range.Minor, scoped to investigationNo (archive canonical)
14PVC corruption (Kafka or OpenSearch)Broker/node fails to start; FS errors in logsThe affected replicaDelete PVC; recreate; data re-replicates from peers. If all replicas corrupt, restore from snapshot / replay from COLD.Brief capacity lossNo if replicated; Yes if all replicas corrupt AND snapshot missing
15Disk full (Kafka broker)kubelet_volume_stats_used_bytes / capacity > 0.9; Kafka LogDirFailureBroker drops offlineExpand PVC (if storage class supports); scale up before hitting 85%. Retention-based auto-eviction should normally prevent this.Same as #1 for one brokerNo
16Disk full (MinIO)PV metrics; MinIO heal eventsNew writes failExpand; add drive set; tiering to S3 cold if in cloud. archive-writer backpressures.See #8No (bounded as above)
17Disk full (OpenSearch data node)Watermarks cluster.routing.allocation.disk.watermark.flood_stageIndex set to read-onlyRotate ILM forward; delete very old WARM if archive verified; expand PVC.Write errors on affected indicesNo (can re-index from Kafka/archive)
18Helm upgrade bad configReadiness probe fails, kubectl rollout stuckNew pods unhealthyhelm rollback to previous revision (Helmfile --atomic ensures). For stateful sets, PDBs + surge=0 prevent simultaneous loss.Intermittent 503s during rollout (rare with surge=0)No
19Helm upgrade of stateful app breaks storage layout (e.g., Kafka log dir changed)Pod crashloopOne stateful setRollback; if config can't be rolled back cleanly, restore from snapshot. PR review + helm diff in CI.Brief outage on that tierNo (with snapshot policy)
20Schema Registry downProbe failureCold path only: new pods can't fetch a not-yet-cached schemaRunning pods keep validating (in-proc cache). Fix/redeploy registry; pods resync.None in steady state; new event type rollouts blockedNo
21Postgres outage (metadata)Probe failure, connection errorsTenant config, investigations, manifests readingestion-gateway keeps going using last-known config snapshot (5-min TTL cache). Query API degrades (no manifests → COLD reads limited to today's prefix scan). UI cannot update investigations.Degraded admin/query; ingest unaffectedNo
22Redis outage (rate limits / cursors)Probe failureRate-limit enforcement and cursors lostIngestion falls back to per-pod local token bucket (approximate) and returns 503 with Retry-After on safety margin. Cursors invalidated — clients restart paging.Pagination reset; stricter ingest limit for safetyNo
23OIDC provider outageToken introspection failuresNew sign-ins blockedExisting JWTs still accepted until expiry (15 min). Platform-admin has a break-glass mTLS path.Users can't log in until IdP restoredNo
24Ingress / LB failureExternal probe failsPublic edge of the clusterTraefik/NGINX HPA; LB is cloud-managed (self-heals) or on-prem with keepalived + HA pair.Edge-gateway 5xx → spool engagesNo (spool)
25Cloud cutover (planned migration local → AWS)Operator-initiatedAll services migrateSee portability-matrix. Dual-write from ingestion-gateway during cutover; replay historical from COLD; cutover the edge-gateway config.Planned maintenance window; archive queries continueNo (planned, tested)
26Cloud cutover (unplanned, cloud region down)Cloud provider statusOne regionFail over to secondary region if multi-region enabled; otherwise wait. Edge spools.Regional outage visibilityNo if multi-region + spool within budget
27Chain break detected (integrity-verifier)audit.integrity.v1 eventForensic flag, scoped to (tenantId, deviceId, day)Quarantine the affected day in COLD; investigate; replay from Kafka retention if within 7d, else accept gap with signed incident record.Investigation flag in UIPossibly (only if a genuine tamper occurred)
28Malformed manifest signatureVerifierOne day of one tenantTreat day as unverified; re-archive from Kafka if in retention; otherwise flag + incident record.Investigator badge "unverified day" in UISame as above
29Clock skew on edge (device time far off)eventTime - ingestionTime histogramsPer-deviceingestionTime is authoritative; queries by ingestionTime work; queries by eventTime are flagged when skew > 5m.Timeline shows skew warningNo
30Upgrade of OpenSearch ILM breaks transitionsILM step: errorIndex stuck in hot past 30dManually advance state after fixing policy; no data loss.Disk pressure rises until fixedNo
31Secret rotation failure (Vault / ExternalSecrets)ExternalSecret SecretSyncErrorService may restart with stale secretRoll back rotation, fix, re-sync. Short-lived creds mean blast radius is bounded to TTL.Service restart storm if cert rotation cascadesNo
32Runaway query (very large aggregation)Slow-query log; circuit breakerPer-tenant pool saturationCircuit breaker cancels; scheduler rejects the next. Tenant sees 429 or 499.That query fails; others unaffectedNo
33KEDA operator downup{job="keda-operator"} == 0 for 5mNo new scaling decisions; existing HPAs keep running on last metric valuesRestart operator (2 replicas, HA); KEDA has no persistent state besides CRsNone in steady state; slow response to sudden load spikes during outageNo
34KEDA metrics-apiserver downHPA FailedGetExternalMetric eventsHPAs can't fetch external metrics; fall back to last known valuesRestart pods; redundant replicasNone in steady state; stale scaling during outageNo
35Prometheus unreachable (KEDA prometheus trigger stale)KEDA scaler_errors_total spikeprometheus-triggered scalers mark inactive → fall back to minReplicaCount. Kafka-triggered scalers unaffected.Restore Prometheus; next poll recoversHTTP services may under-scale during outageNo
36Kafka SASL secret rotation misconfiguredKEDA kafka scaler errors; consumers also failBoth KEDA scalers and live consumers affectedRoll back secret; ExternalSecrets retry loop. Keep old + new key versions during rotation (dual-trust window).Consumer lag grows; ingest unaffectedNo (bounded by Kafka retention)

2. Recovery Objectives

TierRPORTO
Ingest (edge to ingestion-gateway)0 (spool)5 min
Kafka0 (replication)5 min broker; 30 min cluster
HOT (OpenSearch)6h (snapshot)1h (restart) / 4h (snapshot restore)
WARMsame as HOTsame
COLD (MinIO/S3)0 (object storage durability)15 min for regional fail-over
Postgres5 min (WAL ship)30 min
Redisbest-effortN/A (fallback mode)
Full region15 min RPO / 2h RTO (multi-region, opt-in)

3. Blast Radius Diagram

COLD is the green "source of truth" anchor. Any higher tier can be rebuilt from COLD as long as the archive is current (< 10 min lag).


4. Game Day

Recurring chaos exercises (per roadmap.md week 5):

ExerciseCadenceExpected outcome
Kill a random Kafka brokerweeklyNo ingest impact, alert fires and clears within 3 min
Delete HOT index for tenant XquarterlyReplay from COLD rebuilds HOT within SLA
Sever edge → cluster uplink for 6hmonthlyEdge spool absorbs; reconnect replays in order
Cloud region simulated outagequarterlySecondary region promoted (if opt-in)
Corrupt a Parquet file (inject bit-flip)monthlyChain verifier flags; alert fires
Schema break deliberately pushedevery PRCI blocks; merge prevented

Results and runbooks live in deploy/runbooks/ (not in this tree).