docs/System Design/10-batch-auditing-part1

Problem #10: Design a Database Batch Auditing Service — Full Deep Dive

1. Problem Statement

Design a database auditing system that captures every data change (INSERT, UPDATE, DELETE) across multiple databases, stores an immutable audit trail, and supports batch verification and compliance reporting.

Core business goals:

Regulatory compliance (SOX, GDPR, HIPAA) — know who changed what and when
Data integrity verification — detect unauthorized modifications
Change history — rollback to any point in time
Forensic analysis — investigate incidents with full change timeline

2. Functional Requirements

Core

ID	Requirement
FR-1	Capture all data changes (INSERT/UPDATE/DELETE) with before/after values
FR-2	Record who made the change, when, and from where
FR-3	Immutable audit log (append-only, tamper-evident)
FR-4	Batch verification: compare current DB state against audit log
FR-5	Query audit history by entity, time range, user, or change type
FR-6	Support multiple source databases

Optional

ID	Requirement
FR-7	Real-time change notification (stream processing)
FR-8	Data lineage tracking
FR-9	Automated compliance report generation
FR-10	Point-in-time reconstruction of any entity
FR-11	Hash chain for tamper detection (blockchain-like)

3. Non-Functional Requirements

Requirement	Target
Completeness	100% of changes captured (zero data loss)
Latency	Change captured within 5 seconds of commit
Immutability	Audit log cannot be modified after write
Retention	7 years for financial data (regulatory)
Query performance	< 5s for entity history lookup
Scalability	Handle 100K changes/sec across all databases

4. Capacity Estimation

text

Changes per second: 50,000 across all databases
Average change record size: 2 KB (before + after values + metadata)
Daily volume: 50,000 * 86,400 = 4.3B changes/day
Daily storage: 4.3B * 2 KB = 8.6 TB/day
Annual storage: 8.6 * 365 = ~3.1 PB/year
7-year retention: ~22 PB (use tiered storage: hot/warm/cold)

Hot (< 30 days): Elasticsearch (~260 TB)
Warm (30-365 days): PostgreSQL / ClickHouse (~3 PB)
Cold (> 1 year): S3 / Glacier (compressed, ~5 PB compressed)

5-6. Architecture & Data Model

Change Data Capture (CDC) Approaches

Approach	How	Pros	Cons
Application-level	App writes audit log alongside business data	Simple, full context	Bypass risk (direct DB edits missed), performance overhead
Database triggers	DB triggers capture changes	Catches all DB-level changes	DB-coupled, performance impact, complex to maintain
WAL-based CDC	Read database write-ahead log	Zero performance impact, catches everything	Complex setup, DB-specific
Debezium (chosen)	Reads PostgreSQL WAL / MySQL binlog via CDC connector	Zero app changes, zero DB impact, real-time streaming	Infrastructure complexity (Kafka + Debezium)

Audit Record Schema

json

{
  "audit_id": "uuid",
  "source_database": "orders-db",
  "schema": "public",
  "table": "orders",
  "operation": "UPDATE",
  "entity_id": "ORD-123",
  "before": { "status": "pending", "total": 99.99 },
  "after": { "status": "confirmed", "total": 99.99 },
  "changed_fields": ["status"],
  "timestamp": "2024-01-15T10:04:37.123Z",
  "transaction_id": "tx-789",
  "user_id": "user-456",
  "source_ip": "10.0.1.42",
  "application": "order-service",
  "checksum": "sha256:abc123...",
  "previous_checksum": "sha256:xyz789..."
}

Hash Chain for Tamper Detection

text

Record N:   checksum = SHA256(content_N + checksum_N-1)
Record N+1: checksum = SHA256(content_N+1 + checksum_N)

To verify integrity: recompute checksums from the beginning.
If any record was tampered with, all subsequent checksums are invalid.
This is a simplified blockchain — provides tamper evidence.

7-8. High-Level Design & Diagrams

Batch Verification Flow

9. Deep Dive: Debezium CDC Pipeline

text

PostgreSQL WAL → Debezium Connector → Kafka Topic → Audit Consumer

Debezium captures:
1. Reads PostgreSQL's Write-Ahead Log (logical replication)
2. Converts WAL entries to structured change events
3. Publishes to Kafka topic: "cdc.public.orders"

Event format (Debezium envelope):
{
  "before": { "id": 123, "status": "pending" },    // null for INSERT
  "after":  { "id": 123, "status": "confirmed" },   // null for DELETE
  "source": {
    "db": "orders-db",
    "table": "orders",
    "txId": 789,
    "lsn": 123456789
  },
  "op": "u",    // c=create, u=update, d=delete, r=read(snapshot)
  "ts_ms": 1705312877123
}

Benefits:
- Zero impact on source DB performance (reads WAL, not queries)
- Captures ALL changes (even direct SQL, migrations, admin fixes)
- Exactly-once via Kafka consumer offsets
- Schema evolution via Schema Registry

10-14. Trade-offs, Failures, Observability

Key trade-offs:

WAL-based CDC vs application-level: CDC catches everything but is DB-specific; app-level has full context but can be bypassed
Elasticsearch vs PostgreSQL for audit: ES is faster for search but more expensive; PG is cheaper but slower for full-text queries
Hash chain: Provides tamper evidence but adds computation overhead

Failure scenarios:

Debezium connector lag → audit records delayed (monitor consumer lag)
Kafka partition loss → potential audit gap (replication factor 3 mitigates)
Hash chain broken → indicates tampering OR bug (alert immediately)
Storage tier migration failure → data stuck in hot tier (monitor tier transitions)

Key metrics:

text

audit_changes_captured_total{source_db, table, operation}
audit_capture_lag_seconds  # time from DB commit to audit write
audit_verification_result{table, result="pass|fail"}
audit_hash_chain_breaks_total
audit_storage_bytes{tier="hot|warm|cold"}

15-17. .NET Implementation

Key code: See AuditEventConsumer.cs and BatchVerificationWorker.cs.

18. Database Choice

Data	Storage	Why
Hot audit records (< 30 days)	Elasticsearch	Fast full-text search, aggregations
Warm audit records (30-365 days)	PostgreSQL (partitioned by month)	Cost-effective, SQL queries for compliance
Cold audit records (> 1 year)	S3 + Parquet	Cheapest long-term storage, queryable via Athena/Spark
Hash chain state	PostgreSQL	ACID for checksum integrity
CDC events	Kafka	Durable streaming, exactly-once semantics

19-20. Interview Strategy & Summary

Interview tips:

Start with CDC (Debezium) — captures ALL changes with zero app impact
Discuss tiered storage for cost optimization (hot/warm/cold)
Hash chain for tamper detection (simplified blockchain)
Batch verification as a safety net (nightly reconciliation)
Mention compliance requirements (SOX, GDPR, HIPAA) to show business awareness

Common mistakes:

Only using application-level auditing (misses direct DB changes)
Storing all audit data in one tier (cost explosion)
Forgetting about tamper detection
Not discussing retention policies

Practice Mode

5 Questions

"Why use CDC instead of application-level auditing?" → CDC captures ALL changes including direct SQL, migrations, and admin edits. Zero application code changes needed.
"How do you detect if someone tampered with the audit log?" → Hash chain: each record's checksum includes the previous record's checksum. Tampering breaks the chain.
"How do you handle 50K changes/sec without impacting source databases?" → WAL-based CDC reads the database's write-ahead log asynchronously. Zero additional queries on the source DB.
"How do you manage storage cost for 7-year retention?" → Tiered storage: Elasticsearch (hot, 30 days) → PostgreSQL (warm, 1 year) → S3/Glacier (cold, 7 years).
"How does batch verification work?" → Nightly job computes hash of current DB state, reconstructs expected state from audit log, compares. Mismatch = integrity violation.

Previous← Distributed Locking Part1 NextTwitter Part1 →