Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) owns all autoscaling decisions
for the platform. Native HorizontalPodAutoscaler objects are kept in the
charts only as a fallback (rendered when keda.enabled=false). This doc
explains the design, the scaler inventory, the scaling math, failure modes,
and the cloud-portable path.
Why KEDA, not plain HPA
Plain HPA scales on CPU and memory. That is the wrong signal for a Kafka consumer: a consumer that is falling behind may be perfectly healthy on CPU while its partition lag is growing to disaster. Plain HPA also cannot scale to zero, cannot react to external signals (queue depth, Prometheus queries, cron schedules), and cannot run multi-source decisions.
KEDA:
| Capability | HPA | KEDA |
|---|---|---|
| CPU / memory scaling | yes | yes |
| Kafka consumer-group lag | no | yes |
| Prometheus-query scaling | no | yes |
| Scale to zero | no | yes |
| Scale triggered by events | no | yes |
| Multi-source triggers (OR logic) | no | yes |
ScaledJob for batch workloads | no | yes |
| ClusterTriggerAuthentication | no | yes |
KEDA itself still creates an HPA behind the scenes — it is the external metrics provider feeding that HPA. You get the kernel of HPA's reconcile loop plus KEDA's richer signals.
Control plane
keda-operatorreconcilesScaledObject/ScaledJobCRs. It creates and owns the underlying HPA for eachScaledObjectand dispatches Jobs forScaledJobs.keda-metrics-apiserverregisters on the Kubernetesexternal.metrics.k8s.ioAPI and serves metrics that HPA queries.admission-webhooksvalidates CRs on create/update.
Scaler inventory
| Service | Workload | Triggers | Min | Max | Lag threshold | Scale-to-zero |
|---|---|---|---|---|---|---|
| ingestion-gateway | Deployment | cpu (70%), prometheus (RPS/replica) | 2 | 20 (5 local) | — | no |
| query-api | Deployment | cpu (70%), prometheus (RPS/replica) | 2 | 10 (3 local) | — | no |
| web (Next.js) | Deployment | cpu (70%) | 2 | 6 (2 local) | — | no |
| schema-registry | Deployment | cpu (70%) | 2 | 4 | — | no |
| archive-writer | StatefulSet | kafka (audit.events.v1) | 2 | = partitions | 2000 | no |
| search-indexer | Deployment | kafka (audit.events.v1) | 1 | = partitions | 1000 | no |
| replay-service | Deployment | kafka (audit.events.replay.v1) + prometheus (RPS) | 1 | 5 | 500 | no |
| report-service | Deployment | kafka (report.jobs.v1) | 1 | 5 | 10 | optional |
| report-service (ScaledJob mode) | Job | kafka (report.jobs.v1) | 0 | 20 jobs | 1 | yes |
mode on report-service switches the chart between Deployment + ScaledObject
(steady-state) and ScaledJob (scale-to-zero between investigations).
Local default is deployment; prod default is job because report
generation is bursty and expensive.
Scaling math (ingestion @ 100K events/sec per hospital)
Assumptions:
- Average envelope: ~1 KB post-validation
- Kafka topic
audit.events.v1: 24 partitions in prod, 6 in local - Per-consumer throughput: ~10K msg/s sustained with batch size 500
- Lag threshold: 2000 messages per partition (archive-writer), 1000 (search-indexer)
At steady state:
- archive-writer: 100K/s ÷ 10K/s per replica ≈ 10 replicas. Ceiling is 24 partitions; we sit comfortably below.
- search-indexer: similar shape; OpenSearch bulk indexing is typically the bottleneck, not CPU. Replicas track bulk-queue saturation at ~8-12 during peak.
- Burst handling: a 3× spike pushes consumers toward ceiling. Because
KEDA's lag trigger is linear in
currentLag / threshold, desired replicas grow proportionally until the partition ceiling caps them.
Partition ceiling is the hard limit. Adding replicas beyond the partition count gives you idle pods that will never join the consumer group. If you need more parallelism, either shard the topic (create a new topic with more partitions and migrate producers) or shard at the application level (per-tenant topic).
Cooldowns, polling, and thrash
| Knob | Default | Why |
|---|---|---|
pollingInterval | 15s | Fast enough to catch growing lag; slow enough to avoid hammering Kafka admin API |
cooldownPeriod | 120s | Prevents churn during short lulls |
minReplicaCount | 1 or 2 | Availability guardrail |
idleReplicaCount | unset | Only report-service (ScaledObject mode) can go idle at 0 |
| Scale-up policy | +100% / 30s | Aggressive up; lag is expensive |
| Scale-down policy | -10% / 60s, 300s stabilization window | Gentle down; avoids flapping |
Rules of thumb:
- Pollfing too fast: lots of metric load, little benefit below 10s.
- Polling too slow: backlog accumulates before the operator notices.
- Cooldown too short: pod thrash — each restart pays the JVM/Node warm-up cost and may rebalance the Kafka consumer group.
- Cooldown too long: money left on the table during slack periods.
All defaults are tunable per service via .Values.keda.* in the chart
values and overrideable per environment via apps.yaml /
apps.local.yaml.
ScaledObject vs ScaledJob
Use ScaledObject when:
- The workload is long-running (consumer, HTTP service).
- You want to scale replicas up and down but keep the process resident.
- Cold start would hurt tail latency.
Use ScaledJob when:
- The workload is event-triggered and each event takes minutes to hours.
- Scale-to-zero saves real money.
- Each invocation is independent and idempotent.
- Tail latency on cold start is acceptable.
report-service fits both shapes. Bundle generation for a legal case
reads from cold archive (MinIO/S3), queries OpenSearch, renders a PDF,
and uploads the result — it can take 30-120 seconds per investigation
and may see long gaps between requests. ScaledJob mode is ideal in
prod. ScaledObject mode (min 1 replica) is easier to debug in local.
TriggerAuthentication
The keda-triggers chart installs shared TriggerAuthentication objects
in every namespace that has scaled consumers:
| Object | Purpose |
|---|---|
TriggerAuthentication/kafka-auth | Kafka SASL credentials for kafka scaler |
TriggerAuthentication/prometheus-auth | Optional bearer token for authed Prometheus |
ClusterTriggerAuthentication/kafka-auth | Same, cluster-scoped for multi-namespace |
In local (KRaft + SASL_PLAINTEXT + PLAIN) the chart creates the
secret itself from values. In prod, the chart expects the secret to be
pre-created by ExternalSecrets (pulling from Vault / AWS Secrets
Manager / Azure Key Vault / GCP Secret Manager). Rotate by updating the
source — the TriggerAuthentication reference is stable.
kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
name: kafka-auth
spec:
secretTargetRef:
- parameter: sasl
name: kafka-sasl
key: mechanism
- parameter: username
name: kafka-sasl
key: username
- parameter: password
name: kafka-sasl
key: password
- parameter: tls
name: kafka-sasl
key: tlsInteraction with PDBs
Every consumer chart ships a PodDisruptionBudget with maxUnavailable: 1.
Scale-down under a PDB is cooperative — KEDA may request a lower target
but the PDB can delay evictions. This is the desired behavior: voluntary
disruptions during scale-down respect availability. For involuntary
disruptions (node failure), the PDB is a hint; Kubernetes will still
evict.
Observability
KEDA exports Prometheus metrics out of the box:
| Metric | Meaning |
|---|---|
keda_scaler_metrics_value | Current value returned by a scaler (e.g. Kafka lag) |
keda_scaler_active | 1 if the scaler is active; 0 means min-replica-mode |
keda_scaler_errors | Scaler error count |
keda_scaled_object_errors | ScaledObject reconcile errors |
keda_resource_totals | Count of ScaledObjects / ScaledJobs |
Suggested Grafana panels:
- Lag per topic/partition (stacked) —
keda_scaler_metrics_value{scaledObject=~".*archive-writer.*"} - Active state per scaler — timeline of
keda_scaler_active - Replica count per deployment —
kube_deployment_status_replicasoverlayed with KEDA active markers - Error rate —
rate(keda_scaler_errors_total[5m])
Alert rules (Prometheus):
- alert: KedaScaledObjectErrors
expr: increase(keda_scaled_object_errors_total[5m]) > 0
for: 5m
labels: {severity: warning}
annotations:
summary: "KEDA ScaledObject errors in {{ $labels.namespace }}/{{ $labels.scaledObject }}"
- alert: KedaConsumerLagHighSustained
expr: keda_scaler_metrics_value{scaler="kafka"} > 10000
for: 10m
labels: {severity: warning}
annotations:
summary: "Kafka lag sustained >10k on {{ $labels.scaledObject }}"
- alert: KedaOperatorDown
expr: up{job="keda-operator"} == 0
for: 5m
labels: {severity: critical}
annotations:
summary: "KEDA operator is down — no new scaling decisions"Failure modes
| Failure | Effect | Recovery |
|---|---|---|
| KEDA operator down | No new ScaledObjects reconciled; existing HPAs keep running on last metric values | Restart operator; pods self-heal in < 1 min |
| KEDA metrics-apiserver down | HPA external.metrics.k8s.io queries fail; HPA holds last replicas | Redundant replicas (2+); rolling restart |
| Prometheus unreachable | Prometheus triggers return errors → scaler inactive. Kafka triggers unaffected. | Fix Prometheus; triggers recover next poll |
| Kafka unreachable | Kafka triggers inactive → services scale to minReplicaCount. Ingestion is also offline. | Once Kafka recovers, scalers reactivate |
| Partition reshard | MaxReplicaCount may now exceed partitions (waste) or fall below (bottleneck). Update values and redeploy. | Chart upgrade with new partition ceiling |
| Secret rotation (Kafka SASL) | Scaler auths with new creds on next poll; no restart needed | ExternalSecrets refresh loop |
Cloud portability
KEDA runs identically on every managed Kubernetes because it uses only standard k8s primitives. The triggers themselves are cloud-agnostic:
| Cloud | Install | Notes |
|---|---|---|
| EKS | Helm chart kedacore/keda | No add-on; install like we do locally |
| AKS | AKS KEDA add-on or Helm | Microsoft maintains a managed add-on |
| GKE | Helm chart | Compatible with Workload Identity for Kafka auth |
| OpenShift | keda Operator via OperatorHub | OLM-managed |
Managed Kafka replacements (MSK, Event Hubs Kafka-surface, Confluent
Cloud) all work with KEDA's kafka scaler. Update the
TriggerAuthentication secret and bootstrapServers value. No app or
chart change.
Cost impact
- report-service scale-to-zero (ScaledJob mode) saves one pod × memory × time between investigations. A hospital might run ~5 investigations per month. At steady state of 1 idle replica × 512 MiB × 730h ≈ 373 GiB-h/month. Scale-to-zero reclaims ~95 % of that.
- Consumer right-sizing: without KEDA, you over-provision for peak. With KEDA you provision for p95 and burst up in 30-60 s. Typical savings 30-40 % of compute cost on event-driven workloads.
Roadmap touchpoints
- Day 4 (MVP): enable KEDA on
search-indexerfirst — easy to observe via OpenSearch bulk queue. - Week 3 (2-month plan): extend to all consumers; add
ScaledJobfor report-service. - Week 5: KEDA HTTP add-on evaluation for request-based scaling of ingestion-gateway (currently CPU + Prometheus RPS approximate the same signal).
- Week 7: wire
ExternalSecrets→TriggerAuthenticationfor SASL credential rotation.