Problem #10: Design a Database Batch Auditing Service — Full Deep Dive
1. Problem Statement
Design a database auditing system that captures every data change (INSERT, UPDATE, DELETE) across multiple databases, stores an immutable audit trail, and supports batch verification and compliance reporting.
Core business goals:
- Regulatory compliance (SOX, GDPR, HIPAA) — know who changed what and when
- Data integrity verification — detect unauthorized modifications
- Change history — rollback to any point in time
- Forensic analysis — investigate incidents with full change timeline
2. Functional Requirements
Core
| ID | Requirement |
|---|---|
| FR-1 | Capture all data changes (INSERT/UPDATE/DELETE) with before/after values |
| FR-2 | Record who made the change, when, and from where |
| FR-3 | Immutable audit log (append-only, tamper-evident) |
| FR-4 | Batch verification: compare current DB state against audit log |
| FR-5 | Query audit history by entity, time range, user, or change type |
| FR-6 | Support multiple source databases |
Optional
| ID | Requirement |
|---|---|
| FR-7 | Real-time change notification (stream processing) |
| FR-8 | Data lineage tracking |
| FR-9 | Automated compliance report generation |
| FR-10 | Point-in-time reconstruction of any entity |
| FR-11 | Hash chain for tamper detection (blockchain-like) |
3. Non-Functional Requirements
| Requirement | Target |
|---|---|
| Completeness | 100% of changes captured (zero data loss) |
| Latency | Change captured within 5 seconds of commit |
| Immutability | Audit log cannot be modified after write |
| Retention | 7 years for financial data (regulatory) |
| Query performance | < 5s for entity history lookup |
| Scalability | Handle 100K changes/sec across all databases |
4. Capacity Estimation
text
Changes per second: 50,000 across all databases
Average change record size: 2 KB (before + after values + metadata)
Daily volume: 50,000 * 86,400 = 4.3B changes/day
Daily storage: 4.3B * 2 KB = 8.6 TB/day
Annual storage: 8.6 * 365 = ~3.1 PB/year
7-year retention: ~22 PB (use tiered storage: hot/warm/cold)
Hot (< 30 days): Elasticsearch (~260 TB)
Warm (30-365 days): PostgreSQL / ClickHouse (~3 PB)
Cold (> 1 year): S3 / Glacier (compressed, ~5 PB compressed)5-6. Architecture & Data Model
Change Data Capture (CDC) Approaches
| Approach | How | Pros | Cons |
|---|---|---|---|
| Application-level | App writes audit log alongside business data | Simple, full context | Bypass risk (direct DB edits missed), performance overhead |
| Database triggers | DB triggers capture changes | Catches all DB-level changes | DB-coupled, performance impact, complex to maintain |
| WAL-based CDC | Read database write-ahead log | Zero performance impact, catches everything | Complex setup, DB-specific |
| Debezium (chosen) | Reads PostgreSQL WAL / MySQL binlog via CDC connector | Zero app changes, zero DB impact, real-time streaming | Infrastructure complexity (Kafka + Debezium) |
Audit Record Schema
json
{
"audit_id": "uuid",
"source_database": "orders-db",
"schema": "public",
"table": "orders",
"operation": "UPDATE",
"entity_id": "ORD-123",
"before": { "status": "pending", "total": 99.99 },
"after": { "status": "confirmed", "total": 99.99 },
"changed_fields": ["status"],
"timestamp": "2024-01-15T10:04:37.123Z",
"transaction_id": "tx-789",
"user_id": "user-456",
"source_ip": "10.0.1.42",
"application": "order-service",
"checksum": "sha256:abc123...",
"previous_checksum": "sha256:xyz789..."
}Hash Chain for Tamper Detection
text
Record N: checksum = SHA256(content_N + checksum_N-1)
Record N+1: checksum = SHA256(content_N+1 + checksum_N)
To verify integrity: recompute checksums from the beginning.
If any record was tampered with, all subsequent checksums are invalid.
This is a simplified blockchain — provides tamper evidence.7-8. High-Level Design & Diagrams
Batch Verification Flow
9. Deep Dive: Debezium CDC Pipeline
text
PostgreSQL WAL → Debezium Connector → Kafka Topic → Audit Consumer
Debezium captures:
1. Reads PostgreSQL's Write-Ahead Log (logical replication)
2. Converts WAL entries to structured change events
3. Publishes to Kafka topic: "cdc.public.orders"
Event format (Debezium envelope):
{
"before": { "id": 123, "status": "pending" }, // null for INSERT
"after": { "id": 123, "status": "confirmed" }, // null for DELETE
"source": {
"db": "orders-db",
"table": "orders",
"txId": 789,
"lsn": 123456789
},
"op": "u", // c=create, u=update, d=delete, r=read(snapshot)
"ts_ms": 1705312877123
}
Benefits:
- Zero impact on source DB performance (reads WAL, not queries)
- Captures ALL changes (even direct SQL, migrations, admin fixes)
- Exactly-once via Kafka consumer offsets
- Schema evolution via Schema Registry10-14. Trade-offs, Failures, Observability
Key trade-offs:
- WAL-based CDC vs application-level: CDC catches everything but is DB-specific; app-level has full context but can be bypassed
- Elasticsearch vs PostgreSQL for audit: ES is faster for search but more expensive; PG is cheaper but slower for full-text queries
- Hash chain: Provides tamper evidence but adds computation overhead
Failure scenarios:
- Debezium connector lag → audit records delayed (monitor consumer lag)
- Kafka partition loss → potential audit gap (replication factor 3 mitigates)
- Hash chain broken → indicates tampering OR bug (alert immediately)
- Storage tier migration failure → data stuck in hot tier (monitor tier transitions)
Key metrics:
text
audit_changes_captured_total{source_db, table, operation}
audit_capture_lag_seconds # time from DB commit to audit write
audit_verification_result{table, result="pass|fail"}
audit_hash_chain_breaks_total
audit_storage_bytes{tier="hot|warm|cold"}15-17. .NET Implementation
Key code: See AuditEventConsumer.cs and BatchVerificationWorker.cs.
18. Database Choice
| Data | Storage | Why |
|---|---|---|
| Hot audit records (< 30 days) | Elasticsearch | Fast full-text search, aggregations |
| Warm audit records (30-365 days) | PostgreSQL (partitioned by month) | Cost-effective, SQL queries for compliance |
| Cold audit records (> 1 year) | S3 + Parquet | Cheapest long-term storage, queryable via Athena/Spark |
| Hash chain state | PostgreSQL | ACID for checksum integrity |
| CDC events | Kafka | Durable streaming, exactly-once semantics |
19-20. Interview Strategy & Summary
Interview tips:
- Start with CDC (Debezium) — captures ALL changes with zero app impact
- Discuss tiered storage for cost optimization (hot/warm/cold)
- Hash chain for tamper detection (simplified blockchain)
- Batch verification as a safety net (nightly reconciliation)
- Mention compliance requirements (SOX, GDPR, HIPAA) to show business awareness
Common mistakes:
- Only using application-level auditing (misses direct DB changes)
- Storing all audit data in one tier (cost explosion)
- Forgetting about tamper detection
- Not discussing retention policies
Practice Mode
5 Questions
- "Why use CDC instead of application-level auditing?" → CDC captures ALL changes including direct SQL, migrations, and admin edits. Zero application code changes needed.
- "How do you detect if someone tampered with the audit log?" → Hash chain: each record's checksum includes the previous record's checksum. Tampering breaks the chain.
- "How do you handle 50K changes/sec without impacting source databases?" → WAL-based CDC reads the database's write-ahead log asynchronously. Zero additional queries on the source DB.
- "How do you manage storage cost for 7-year retention?" → Tiered storage: Elasticsearch (hot, 30 days) → PostgreSQL (warm, 1 year) → S3/Glacier (cold, 7 years).
- "How does batch verification work?" → Nightly job computes hash of current DB state, reconstructs expected state from audit log, compares. Mismatch = integrity violation.