docs/DIstributed Audit Logs/failure-scenarios

Failure Scenarios

A catalog of the failures the platform must survive, with detection, blast radius, recovery, user impact, and whether data loss is possible. This is the operational contract; runbook links go in deploy/runbooks/ (referenced, not in this tree).

1. Failure Matrix

#	Failure	Detection	Blast Radius	Recovery	User Impact	Data Loss Possible?
1	Single Kafka broker down	`under_replicated_partitions > 0`; Prom alert `KafkaBrokerDown`	Zero data-plane impact if RF=3 + min.ISR=2. Leaders rebalance.	K8s restarts pod; PVC reattaches; broker catches up from ISR. If PVC gone: rebuild from replica.	None (blip in produce/consume latency p99 for ~30s)	No
2	Kafka quorum loss (2/3 controllers or brokers down)	`KafkaControllerQuorumUnavailable`; produce returns `NOT_ENOUGH_REPLICAS`	All produce blocked; ingest-gateway returns 503	Restore at least one controller PVC; KRaft quorum recovers. Run `recover_cluster_metadata` if corruption.	Ingest stalls; edge-gateway spools to disk (72h capacity)	No (edge spool absorbs)
3	Consumer crash (any consumer group instance)	Pod restart; `kafka_consumergroup_lag` spike	One consumer instance gone; partitions redistribute	K8s restarts pod; lag catches up.	None (consumer lag briefly rises)	No
4	All archive-writer replicas unavailable	`kafka_consumergroup_lag{group="archive-writer"} > 60000` for 5m	Archive freshness degrades; COLD becomes stale	Scale up replicas; investigate root cause; lag burns down. Kafka retention (7d) bounds recovery window.	Queries to COLD show gap; banner in UI	No (as long as recovery < 7d)
5	OpenSearch node down (1 of N)	`opensearch_cluster_status=yellow`	Shards rebalance to survivors; query latency up.	Node recovers from PVC or is replaced; re-replication.	Minor latency bump	No (RF=1)
6	OpenSearch full cluster outage	`opensearch_cluster_status=red` or unreachable	HOT search unavailable	Restart; worst case, restore from snapshot (6h RPO).	Searches fall back to COLD via query-api; UI banner shows "HOT degraded, using COLD (slower)". Alerts for last-N-hour queries may be delayed.	No (archive is source of truth; indexer resumes from Kafka offsets)
7	MinIO node down (1 of N)	EC:4+2 quorum metric	Zero data-plane impact; object-lock intact.	Node recovers; heal runs.	None	No
8	MinIO cluster outage	All nodes unreachable	archive-writer cannot flush; buffers in memory up to cap	archive-writer slows consumer commits (backpressure) → Kafka retains messages → edge spools if prolonged. Restore MinIO or fail over to secondary bucket.	New events pile up but are eventually archived; COLD queries fail until back	No (bounded by Kafka retention and edge spool)
9	Network partition (intra-cluster)	Kubernetes events, `node NotReady`, Kafka ISR shrinks	Tainted nodes evicted; workloads reschedule on healthy side	Partition heals; standard reconciliation.	Latency spike; possible double-processing handled by idempotent writes (`eventId` dedup)	No
10	Network partition (edge ↔ cluster)	Edge heartbeat stops; `edge_connected{site} == 0`	One hospital's uplink down	Edge-gateway spools locally on disk (72h buffer). On reconnect, batch replay in order per-device.	No current-event visibility at the central UI for that site during the outage.	No (spool) if < 72h; otherwise oldest spooled events may be dropped
11	Edge device disconnect (single device)	`device_last_seen` age > threshold	One device's data gap	When device reconnects, it resumes from its `sequenceNumber`. Operator sees the gap in timeline.	Visible gap in that device's timeline	Yes, if device loses local buffer power (depends on device class) — platform side is fine
12	Schema mismatch (payload fails validation)	DLQ topic rate up; `ingestion_validation_failures_total`	Specific event type	DLQ Explorer UI; operator decides: reject, or update schema + replay from DLQ	Affected event type missing from HOT until replayed; archive not affected (DLQ is pre-archive)	No (events persisted in DLQ quarantine bucket)
13	Schema mismatch detected post-ingest (consumer found a malformed event somehow indexed)	`consumer_envelope_defense_failures_total > 0`	A few records in OpenSearch	Investigate chain; replay-from-cold with corrected mapping; delete polluted HOT index range.	Minor, scoped to investigation	No (archive canonical)
14	PVC corruption (Kafka or OpenSearch)	Broker/node fails to start; FS errors in logs	The affected replica	Delete PVC; recreate; data re-replicates from peers. If all replicas corrupt, restore from snapshot / replay from COLD.	Brief capacity loss	No if replicated; Yes if all replicas corrupt AND snapshot missing
15	Disk full (Kafka broker)	`kubelet_volume_stats_used_bytes / capacity > 0.9`; Kafka `LogDirFailure`	Broker drops offline	Expand PVC (if storage class supports); scale up before hitting 85%. Retention-based auto-eviction should normally prevent this.	Same as #1 for one broker	No
16	Disk full (MinIO)	PV metrics; MinIO `heal` events	New writes fail	Expand; add drive set; tiering to S3 cold if in cloud. archive-writer backpressures.	See #8	No (bounded as above)
17	Disk full (OpenSearch data node)	Watermarks `cluster.routing.allocation.disk.watermark.flood_stage`	Index set to read-only	Rotate ILM forward; delete very old WARM if archive verified; expand PVC.	Write errors on affected indices	No (can re-index from Kafka/archive)
18	Helm upgrade bad config	Readiness probe fails, `kubectl rollout` stuck	New pods unhealthy	`helm rollback` to previous revision (Helmfile `--atomic` ensures). For stateful sets, PDBs + surge=0 prevent simultaneous loss.	Intermittent 503s during rollout (rare with surge=0)	No
19	Helm upgrade of stateful app breaks storage layout (e.g., Kafka log dir changed)	Pod crashloop	One stateful set	Rollback; if config can't be rolled back cleanly, restore from snapshot. PR review + `helm diff` in CI.	Brief outage on that tier	No (with snapshot policy)
20	Schema Registry down	Probe failure	Cold path only: new pods can't fetch a not-yet-cached schema	Running pods keep validating (in-proc cache). Fix/redeploy registry; pods resync.	None in steady state; new event type rollouts blocked	No
21	Postgres outage (metadata)	Probe failure, connection errors	Tenant config, investigations, manifests read	ingestion-gateway keeps going using last-known config snapshot (5-min TTL cache). Query API degrades (no manifests → COLD reads limited to today's prefix scan). UI cannot update investigations.	Degraded admin/query; ingest unaffected	No
22	Redis outage (rate limits / cursors)	Probe failure	Rate-limit enforcement and cursors lost	Ingestion falls back to per-pod local token bucket (approximate) and returns 503 with Retry-After on safety margin. Cursors invalidated — clients restart paging.	Pagination reset; stricter ingest limit for safety	No
23	OIDC provider outage	Token introspection failures	New sign-ins blocked	Existing JWTs still accepted until expiry (15 min). Platform-admin has a break-glass mTLS path.	Users can't log in until IdP restored	No
24	Ingress / LB failure	External probe fails	Public edge of the cluster	Traefik/NGINX HPA; LB is cloud-managed (self-heals) or on-prem with keepalived + HA pair.	Edge-gateway 5xx → spool engages	No (spool)
25	Cloud cutover (planned migration local → AWS)	Operator-initiated	All services migrate	See portability-matrix. Dual-write from `ingestion-gateway` during cutover; replay historical from COLD; cutover the edge-gateway config.	Planned maintenance window; archive queries continue	No (planned, tested)
26	Cloud cutover (unplanned, cloud region down)	Cloud provider status	One region	Fail over to secondary region if multi-region enabled; otherwise wait. Edge spools.	Regional outage visibility	No if multi-region + spool within budget
27	Chain break detected (integrity-verifier)	`audit.integrity.v1` event	Forensic flag, scoped to `(tenantId, deviceId, day)`	Quarantine the affected day in COLD; investigate; replay from Kafka retention if within 7d, else accept gap with signed incident record.	Investigation flag in UI	Possibly (only if a genuine tamper occurred)
28	Malformed manifest signature	Verifier	One day of one tenant	Treat day as unverified; re-archive from Kafka if in retention; otherwise flag + incident record.	Investigator badge "unverified day" in UI	Same as above
29	Clock skew on edge (device time far off)	`eventTime` - `ingestionTime` histograms	Per-device	`ingestionTime` is authoritative; queries by `ingestionTime` work; queries by `eventTime` are flagged when skew > 5m.	Timeline shows skew warning	No
30	Upgrade of OpenSearch ILM breaks transitions	ILM `step: error`	Index stuck in hot past 30d	Manually advance state after fixing policy; no data loss.	Disk pressure rises until fixed	No
31	Secret rotation failure (Vault / ExternalSecrets)	ExternalSecret `SecretSyncError`	Service may restart with stale secret	Roll back rotation, fix, re-sync. Short-lived creds mean blast radius is bounded to TTL.	Service restart storm if cert rotation cascades	No
32	Runaway query (very large aggregation)	Slow-query log; circuit breaker	Per-tenant pool saturation	Circuit breaker cancels; scheduler rejects the next. Tenant sees 429 or 499.	That query fails; others unaffected	No
33	KEDA operator down	`up{job="keda-operator"} == 0` for 5m	No new scaling decisions; existing HPAs keep running on last metric values	Restart operator (2 replicas, HA); KEDA has no persistent state besides CRs	None in steady state; slow response to sudden load spikes during outage	No
34	KEDA metrics-apiserver down	HPA `FailedGetExternalMetric` events	HPAs can't fetch external metrics; fall back to last known values	Restart pods; redundant replicas	None in steady state; stale scaling during outage	No
35	Prometheus unreachable (KEDA prometheus trigger stale)	KEDA `scaler_errors_total` spike	prometheus-triggered scalers mark inactive → fall back to `minReplicaCount`. Kafka-triggered scalers unaffected.	Restore Prometheus; next poll recovers	HTTP services may under-scale during outage	No
36	Kafka SASL secret rotation misconfigured	KEDA kafka scaler errors; consumers also fail	Both KEDA scalers and live consumers affected	Roll back secret; ExternalSecrets retry loop. Keep old + new key versions during rotation (dual-trust window).	Consumer lag grows; ingest unaffected	No (bounded by Kafka retention)

2. Recovery Objectives

Tier	RPO	RTO
Ingest (edge to ingestion-gateway)	0 (spool)	5 min
Kafka	0 (replication)	5 min broker; 30 min cluster
HOT (OpenSearch)	6h (snapshot)	1h (restart) / 4h (snapshot restore)
WARM	same as HOT	same
COLD (MinIO/S3)	0 (object storage durability)	15 min for regional fail-over
Postgres	5 min (WAL ship)	30 min
Redis	best-effort	N/A (fallback mode)
Full region	15 min RPO / 2h RTO (multi-region, opt-in)	—

3. Blast Radius Diagram

COLD is the green "source of truth" anchor. Any higher tier can be rebuilt from COLD as long as the archive is current (< 10 min lag).

4. Game Day

Recurring chaos exercises (per roadmap.md week 5):

Exercise	Cadence	Expected outcome
Kill a random Kafka broker	weekly	No ingest impact, alert fires and clears within 3 min
Delete HOT index for tenant X	quarterly	Replay from COLD rebuilds HOT within SLA
Sever edge → cluster uplink for 6h	monthly	Edge spool absorbs; reconnect replays in order
Cloud region simulated outage	quarterly	Secondary region promoted (if opt-in)
Corrupt a Parquet file (inject bit-flip)	monthly	Chain verifier flags; alert fires
Schema break deliberately pushed	every PR	CI blocks; merge prevented

Results and runbooks live in deploy/runbooks/ (not in this tree).

Previous← Cost NextKafka Design →