docs/DIstributed Audit Logs/autoscaling

Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) owns all autoscaling decisions for the platform. Native HorizontalPodAutoscaler objects are kept in the charts only as a fallback (rendered when keda.enabled=false). This doc explains the design, the scaler inventory, the scaling math, failure modes, and the cloud-portable path.

Why KEDA, not plain HPA

Plain HPA scales on CPU and memory. That is the wrong signal for a Kafka consumer: a consumer that is falling behind may be perfectly healthy on CPU while its partition lag is growing to disaster. Plain HPA also cannot scale to zero, cannot react to external signals (queue depth, Prometheus queries, cron schedules), and cannot run multi-source decisions.

KEDA:

Capability	HPA	KEDA
CPU / memory scaling	yes	yes
Kafka consumer-group lag	no	yes
Prometheus-query scaling	no	yes
Scale to zero	no	yes
Scale triggered by events	no	yes
Multi-source triggers (OR logic)	no	yes
`ScaledJob` for batch workloads	no	yes
ClusterTriggerAuthentication	no	yes

KEDA itself still creates an HPA behind the scenes — it is the external metrics provider feeding that HPA. You get the kernel of HPA's reconcile loop plus KEDA's richer signals.

Control plane

keda-operator reconciles ScaledObject / ScaledJob CRs. It creates and owns the underlying HPA for each ScaledObject and dispatches Jobs for ScaledJobs.
keda-metrics-apiserver registers on the Kubernetes external.metrics.k8s.io API and serves metrics that HPA queries.
admission-webhooks validates CRs on create/update.

Scaler inventory

Service	Workload	Triggers	Min	Max	Lag threshold	Scale-to-zero
ingestion-gateway	Deployment	cpu (70%), prometheus (RPS/replica)	2	20 (5 local)	—	no
query-api	Deployment	cpu (70%), prometheus (RPS/replica)	2	10 (3 local)	—	no
web (Next.js)	Deployment	cpu (70%)	2	6 (2 local)	—	no
schema-registry	Deployment	cpu (70%)	2	4	—	no
archive-writer	StatefulSet	kafka (audit.events.v1)	2	= partitions	2000	no
search-indexer	Deployment	kafka (audit.events.v1)	1	= partitions	1000	no
replay-service	Deployment	kafka (audit.events.replay.v1) + prometheus (RPS)	1	5	500	no
report-service	Deployment	kafka (report.jobs.v1)	1	5	10	optional
report-service (ScaledJob mode)	Job	kafka (report.jobs.v1)	0	20 jobs	1	yes

mode on report-service switches the chart between Deployment + ScaledObject (steady-state) and ScaledJob (scale-to-zero between investigations). Local default is deployment; prod default is job because report generation is bursty and expensive.

Scaling math (ingestion @ 100K events/sec per hospital)

Assumptions:

Average envelope: ~1 KB post-validation
Kafka topic audit.events.v1: 24 partitions in prod, 6 in local
Per-consumer throughput: ~10K msg/s sustained with batch size 500
Lag threshold: 2000 messages per partition (archive-writer), 1000 (search-indexer)

At steady state:

archive-writer: 100K/s ÷ 10K/s per replica ≈ 10 replicas. Ceiling is 24 partitions; we sit comfortably below.
search-indexer: similar shape; OpenSearch bulk indexing is typically the bottleneck, not CPU. Replicas track bulk-queue saturation at ~8-12 during peak.
Burst handling: a 3× spike pushes consumers toward ceiling. Because KEDA's lag trigger is linear in currentLag / threshold, desired replicas grow proportionally until the partition ceiling caps them.

Partition ceiling is the hard limit. Adding replicas beyond the partition count gives you idle pods that will never join the consumer group. If you need more parallelism, either shard the topic (create a new topic with more partitions and migrate producers) or shard at the application level (per-tenant topic).

Cooldowns, polling, and thrash

Knob	Default	Why
`pollingInterval`	15s	Fast enough to catch growing lag; slow enough to avoid hammering Kafka admin API
`cooldownPeriod`	120s	Prevents churn during short lulls
`minReplicaCount`	1 or 2	Availability guardrail
`idleReplicaCount`	unset	Only report-service (ScaledObject mode) can go idle at 0
Scale-up policy	+100% / 30s	Aggressive up; lag is expensive
Scale-down policy	-10% / 60s, 300s stabilization window	Gentle down; avoids flapping

Rules of thumb:

Pollfing too fast: lots of metric load, little benefit below 10s.
Polling too slow: backlog accumulates before the operator notices.
Cooldown too short: pod thrash — each restart pays the JVM/Node warm-up cost and may rebalance the Kafka consumer group.
Cooldown too long: money left on the table during slack periods.

All defaults are tunable per service via .Values.keda.* in the chart values and overrideable per environment via apps.yaml / apps.local.yaml.

`ScaledObject` vs `ScaledJob`

Use ScaledObject when:

The workload is long-running (consumer, HTTP service).
You want to scale replicas up and down but keep the process resident.
Cold start would hurt tail latency.

Use ScaledJob when:

The workload is event-triggered and each event takes minutes to hours.
Scale-to-zero saves real money.
Each invocation is independent and idempotent.
Tail latency on cold start is acceptable.

report-service fits both shapes. Bundle generation for a legal case reads from cold archive (MinIO/S3), queries OpenSearch, renders a PDF, and uploads the result — it can take 30-120 seconds per investigation and may see long gaps between requests. ScaledJob mode is ideal in prod. ScaledObject mode (min 1 replica) is easier to debug in local.

TriggerAuthentication

The keda-triggers chart installs shared TriggerAuthentication objects in every namespace that has scaled consumers:

Object	Purpose
`TriggerAuthentication/kafka-auth`	Kafka SASL credentials for `kafka` scaler
`TriggerAuthentication/prometheus-auth`	Optional bearer token for authed Prometheus
`ClusterTriggerAuthentication/kafka-auth`	Same, cluster-scoped for multi-namespace

In local (KRaft + SASL_PLAINTEXT + PLAIN) the chart creates the secret itself from values. In prod, the chart expects the secret to be pre-created by ExternalSecrets (pulling from Vault / AWS Secrets Manager / Azure Key Vault / GCP Secret Manager). Rotate by updating the source — the TriggerAuthentication reference is stable.

yaml

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: kafka-auth
spec:
  secretTargetRef:
    - parameter: sasl
      name: kafka-sasl
      key: mechanism
    - parameter: username
      name: kafka-sasl
      key: username
    - parameter: password
      name: kafka-sasl
      key: password
    - parameter: tls
      name: kafka-sasl
      key: tls

Interaction with PDBs

Every consumer chart ships a PodDisruptionBudget with maxUnavailable: 1. Scale-down under a PDB is cooperative — KEDA may request a lower target but the PDB can delay evictions. This is the desired behavior: voluntary disruptions during scale-down respect availability. For involuntary disruptions (node failure), the PDB is a hint; Kubernetes will still evict.

Observability

KEDA exports Prometheus metrics out of the box:

Metric	Meaning
`keda_scaler_metrics_value`	Current value returned by a scaler (e.g. Kafka lag)
`keda_scaler_active`	1 if the scaler is active; 0 means min-replica-mode
`keda_scaler_errors`	Scaler error count
`keda_scaled_object_errors`	ScaledObject reconcile errors
`keda_resource_totals`	Count of ScaledObjects / ScaledJobs

Suggested Grafana panels:

Lag per topic/partition (stacked) — keda_scaler_metrics_value{scaledObject=~".*archive-writer.*"}
Active state per scaler — timeline of keda_scaler_active
Replica count per deployment — kube_deployment_status_replicas overlayed with KEDA active markers
Error rate — rate(keda_scaler_errors_total[5m])

Alert rules (Prometheus):

yaml

- alert: KedaScaledObjectErrors
  expr: increase(keda_scaled_object_errors_total[5m]) > 0
  for: 5m
  labels: {severity: warning}
  annotations:
    summary: "KEDA ScaledObject errors in {{ $labels.namespace }}/{{ $labels.scaledObject }}"

- alert: KedaConsumerLagHighSustained
  expr: keda_scaler_metrics_value{scaler="kafka"} > 10000
  for: 10m
  labels: {severity: warning}
  annotations:
    summary: "Kafka lag sustained >10k on {{ $labels.scaledObject }}"

- alert: KedaOperatorDown
  expr: up{job="keda-operator"} == 0
  for: 5m
  labels: {severity: critical}
  annotations:
    summary: "KEDA operator is down — no new scaling decisions"

Failure modes

Failure	Effect	Recovery
KEDA operator down	No new ScaledObjects reconciled; existing HPAs keep running on last metric values	Restart operator; pods self-heal in < 1 min
KEDA metrics-apiserver down	HPA `external.metrics.k8s.io` queries fail; HPA holds last replicas	Redundant replicas (2+); rolling restart
Prometheus unreachable	Prometheus triggers return errors → scaler inactive. Kafka triggers unaffected.	Fix Prometheus; triggers recover next poll
Kafka unreachable	Kafka triggers inactive → services scale to `minReplicaCount`. Ingestion is also offline.	Once Kafka recovers, scalers reactivate
Partition reshard	MaxReplicaCount may now exceed partitions (waste) or fall below (bottleneck). Update values and redeploy.	Chart upgrade with new partition ceiling
Secret rotation (Kafka SASL)	Scaler auths with new creds on next poll; no restart needed	`ExternalSecrets` refresh loop

Cloud portability

KEDA runs identically on every managed Kubernetes because it uses only standard k8s primitives. The triggers themselves are cloud-agnostic:

Cloud	Install	Notes
EKS	Helm chart `kedacore/keda`	No add-on; install like we do locally
AKS	AKS KEDA add-on or Helm	Microsoft maintains a managed add-on
GKE	Helm chart	Compatible with Workload Identity for Kafka auth
OpenShift	`keda` Operator via OperatorHub	OLM-managed

Managed Kafka replacements (MSK, Event Hubs Kafka-surface, Confluent Cloud) all work with KEDA's kafka scaler. Update the TriggerAuthentication secret and bootstrapServers value. No app or chart change.

Cost impact

report-service scale-to-zero (ScaledJob mode) saves one pod × memory × time between investigations. A hospital might run ~5 investigations per month. At steady state of 1 idle replica × 512 MiB × 730h ≈ 373 GiB-h/month. Scale-to-zero reclaims ~95 % of that.
Consumer right-sizing: without KEDA, you over-provision for peak. With KEDA you provision for p95 and burst up in 30-60 s. Typical savings 30-40 % of compute cost on event-driven workloads.

Roadmap touchpoints

Day 4 (MVP): enable KEDA on search-indexer first — easy to observe via OpenSearch bulk queue.
Week 3 (2-month plan): extend to all consumers; add ScaledJob for report-service.
Week 5: KEDA HTTP add-on evaluation for request-based scaling of ingestion-gateway (currently CPU + Prometheus RPS approximate the same signal).
Week 7: wire ExternalSecrets → TriggerAuthentication for SASL credential rotation.

Previous← Architecture NextCost →