docs/DIstributed Audit Logs/autoscaling
Edit on GitHub

Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) owns all autoscaling decisions for the platform. Native HorizontalPodAutoscaler objects are kept in the charts only as a fallback (rendered when keda.enabled=false). This doc explains the design, the scaler inventory, the scaling math, failure modes, and the cloud-portable path.


Why KEDA, not plain HPA

Plain HPA scales on CPU and memory. That is the wrong signal for a Kafka consumer: a consumer that is falling behind may be perfectly healthy on CPU while its partition lag is growing to disaster. Plain HPA also cannot scale to zero, cannot react to external signals (queue depth, Prometheus queries, cron schedules), and cannot run multi-source decisions.

KEDA:

CapabilityHPAKEDA
CPU / memory scalingyesyes
Kafka consumer-group lagnoyes
Prometheus-query scalingnoyes
Scale to zeronoyes
Scale triggered by eventsnoyes
Multi-source triggers (OR logic)noyes
ScaledJob for batch workloadsnoyes
ClusterTriggerAuthenticationnoyes

KEDA itself still creates an HPA behind the scenes — it is the external metrics provider feeding that HPA. You get the kernel of HPA's reconcile loop plus KEDA's richer signals.


Control plane

  • keda-operator reconciles ScaledObject / ScaledJob CRs. It creates and owns the underlying HPA for each ScaledObject and dispatches Jobs for ScaledJobs.
  • keda-metrics-apiserver registers on the Kubernetes external.metrics.k8s.io API and serves metrics that HPA queries.
  • admission-webhooks validates CRs on create/update.

Scaler inventory

ServiceWorkloadTriggersMinMaxLag thresholdScale-to-zero
ingestion-gatewayDeploymentcpu (70%), prometheus (RPS/replica)220 (5 local)no
query-apiDeploymentcpu (70%), prometheus (RPS/replica)210 (3 local)no
web (Next.js)Deploymentcpu (70%)26 (2 local)no
schema-registryDeploymentcpu (70%)24no
archive-writerStatefulSetkafka (audit.events.v1)2= partitions2000no
search-indexerDeploymentkafka (audit.events.v1)1= partitions1000no
replay-serviceDeploymentkafka (audit.events.replay.v1) + prometheus (RPS)15500no
report-serviceDeploymentkafka (report.jobs.v1)1510optional
report-service (ScaledJob mode)Jobkafka (report.jobs.v1)020 jobs1yes

mode on report-service switches the chart between Deployment + ScaledObject (steady-state) and ScaledJob (scale-to-zero between investigations). Local default is deployment; prod default is job because report generation is bursty and expensive.


Scaling math (ingestion @ 100K events/sec per hospital)

Assumptions:

  • Average envelope: ~1 KB post-validation
  • Kafka topic audit.events.v1: 24 partitions in prod, 6 in local
  • Per-consumer throughput: ~10K msg/s sustained with batch size 500
  • Lag threshold: 2000 messages per partition (archive-writer), 1000 (search-indexer)

At steady state:

  • archive-writer: 100K/s ÷ 10K/s per replica ≈ 10 replicas. Ceiling is 24 partitions; we sit comfortably below.
  • search-indexer: similar shape; OpenSearch bulk indexing is typically the bottleneck, not CPU. Replicas track bulk-queue saturation at ~8-12 during peak.
  • Burst handling: a 3× spike pushes consumers toward ceiling. Because KEDA's lag trigger is linear in currentLag / threshold, desired replicas grow proportionally until the partition ceiling caps them.

Partition ceiling is the hard limit. Adding replicas beyond the partition count gives you idle pods that will never join the consumer group. If you need more parallelism, either shard the topic (create a new topic with more partitions and migrate producers) or shard at the application level (per-tenant topic).


Cooldowns, polling, and thrash

KnobDefaultWhy
pollingInterval15sFast enough to catch growing lag; slow enough to avoid hammering Kafka admin API
cooldownPeriod120sPrevents churn during short lulls
minReplicaCount1 or 2Availability guardrail
idleReplicaCountunsetOnly report-service (ScaledObject mode) can go idle at 0
Scale-up policy+100% / 30sAggressive up; lag is expensive
Scale-down policy-10% / 60s, 300s stabilization windowGentle down; avoids flapping

Rules of thumb:

  • Pollfing too fast: lots of metric load, little benefit below 10s.
  • Polling too slow: backlog accumulates before the operator notices.
  • Cooldown too short: pod thrash — each restart pays the JVM/Node warm-up cost and may rebalance the Kafka consumer group.
  • Cooldown too long: money left on the table during slack periods.

All defaults are tunable per service via .Values.keda.* in the chart values and overrideable per environment via apps.yaml / apps.local.yaml.


ScaledObject vs ScaledJob

Use ScaledObject when:

  • The workload is long-running (consumer, HTTP service).
  • You want to scale replicas up and down but keep the process resident.
  • Cold start would hurt tail latency.

Use ScaledJob when:

  • The workload is event-triggered and each event takes minutes to hours.
  • Scale-to-zero saves real money.
  • Each invocation is independent and idempotent.
  • Tail latency on cold start is acceptable.

report-service fits both shapes. Bundle generation for a legal case reads from cold archive (MinIO/S3), queries OpenSearch, renders a PDF, and uploads the result — it can take 30-120 seconds per investigation and may see long gaps between requests. ScaledJob mode is ideal in prod. ScaledObject mode (min 1 replica) is easier to debug in local.


TriggerAuthentication

The keda-triggers chart installs shared TriggerAuthentication objects in every namespace that has scaled consumers:

ObjectPurpose
TriggerAuthentication/kafka-authKafka SASL credentials for kafka scaler
TriggerAuthentication/prometheus-authOptional bearer token for authed Prometheus
ClusterTriggerAuthentication/kafka-authSame, cluster-scoped for multi-namespace

In local (KRaft + SASL_PLAINTEXT + PLAIN) the chart creates the secret itself from values. In prod, the chart expects the secret to be pre-created by ExternalSecrets (pulling from Vault / AWS Secrets Manager / Azure Key Vault / GCP Secret Manager). Rotate by updating the source — the TriggerAuthentication reference is stable.

yaml
kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: kafka-auth
spec:
  secretTargetRef:
    - parameter: sasl
      name: kafka-sasl
      key: mechanism
    - parameter: username
      name: kafka-sasl
      key: username
    - parameter: password
      name: kafka-sasl
      key: password
    - parameter: tls
      name: kafka-sasl
      key: tls

Interaction with PDBs

Every consumer chart ships a PodDisruptionBudget with maxUnavailable: 1. Scale-down under a PDB is cooperative — KEDA may request a lower target but the PDB can delay evictions. This is the desired behavior: voluntary disruptions during scale-down respect availability. For involuntary disruptions (node failure), the PDB is a hint; Kubernetes will still evict.


Observability

KEDA exports Prometheus metrics out of the box:

MetricMeaning
keda_scaler_metrics_valueCurrent value returned by a scaler (e.g. Kafka lag)
keda_scaler_active1 if the scaler is active; 0 means min-replica-mode
keda_scaler_errorsScaler error count
keda_scaled_object_errorsScaledObject reconcile errors
keda_resource_totalsCount of ScaledObjects / ScaledJobs

Suggested Grafana panels:

  • Lag per topic/partition (stacked)keda_scaler_metrics_value{scaledObject=~".*archive-writer.*"}
  • Active state per scaler — timeline of keda_scaler_active
  • Replica count per deploymentkube_deployment_status_replicas overlayed with KEDA active markers
  • Error raterate(keda_scaler_errors_total[5m])

Alert rules (Prometheus):

yaml
- alert: KedaScaledObjectErrors
  expr: increase(keda_scaled_object_errors_total[5m]) > 0
  for: 5m
  labels: {severity: warning}
  annotations:
    summary: "KEDA ScaledObject errors in {{ $labels.namespace }}/{{ $labels.scaledObject }}"

- alert: KedaConsumerLagHighSustained
  expr: keda_scaler_metrics_value{scaler="kafka"} > 10000
  for: 10m
  labels: {severity: warning}
  annotations:
    summary: "Kafka lag sustained >10k on {{ $labels.scaledObject }}"

- alert: KedaOperatorDown
  expr: up{job="keda-operator"} == 0
  for: 5m
  labels: {severity: critical}
  annotations:
    summary: "KEDA operator is down — no new scaling decisions"

Failure modes

FailureEffectRecovery
KEDA operator downNo new ScaledObjects reconciled; existing HPAs keep running on last metric valuesRestart operator; pods self-heal in < 1 min
KEDA metrics-apiserver downHPA external.metrics.k8s.io queries fail; HPA holds last replicasRedundant replicas (2+); rolling restart
Prometheus unreachablePrometheus triggers return errors → scaler inactive. Kafka triggers unaffected.Fix Prometheus; triggers recover next poll
Kafka unreachableKafka triggers inactive → services scale to minReplicaCount. Ingestion is also offline.Once Kafka recovers, scalers reactivate
Partition reshardMaxReplicaCount may now exceed partitions (waste) or fall below (bottleneck). Update values and redeploy.Chart upgrade with new partition ceiling
Secret rotation (Kafka SASL)Scaler auths with new creds on next poll; no restart neededExternalSecrets refresh loop

Cloud portability

KEDA runs identically on every managed Kubernetes because it uses only standard k8s primitives. The triggers themselves are cloud-agnostic:

CloudInstallNotes
EKSHelm chart kedacore/kedaNo add-on; install like we do locally
AKSAKS KEDA add-on or HelmMicrosoft maintains a managed add-on
GKEHelm chartCompatible with Workload Identity for Kafka auth
OpenShiftkeda Operator via OperatorHubOLM-managed

Managed Kafka replacements (MSK, Event Hubs Kafka-surface, Confluent Cloud) all work with KEDA's kafka scaler. Update the TriggerAuthentication secret and bootstrapServers value. No app or chart change.


Cost impact

  • report-service scale-to-zero (ScaledJob mode) saves one pod × memory × time between investigations. A hospital might run ~5 investigations per month. At steady state of 1 idle replica × 512 MiB × 730h ≈ 373 GiB-h/month. Scale-to-zero reclaims ~95 % of that.
  • Consumer right-sizing: without KEDA, you over-provision for peak. With KEDA you provision for p95 and burst up in 30-60 s. Typical savings 30-40 % of compute cost on event-driven workloads.

Roadmap touchpoints

  • Day 4 (MVP): enable KEDA on search-indexer first — easy to observe via OpenSearch bulk queue.
  • Week 3 (2-month plan): extend to all consumers; add ScaledJob for report-service.
  • Week 5: KEDA HTTP add-on evaluation for request-based scaling of ingestion-gateway (currently CPU + Prometheus RPS approximate the same signal).
  • Week 7: wire ExternalSecretsTriggerAuthentication for SASL credential rotation.