blogs/DistributedLog
View on GitHub
C#

OpenTelemetry-Based Logging & Distributed Event-Log Tracking

A Staff-Level Architecture Learning Package for Healthcare IoT Observability


What This Repository Teaches

This repository is a comprehensive, architecture-first learning package for designing OpenTelemetry-based logging and distributed event-log tracking at scale. It targets a realistic domain: a healthcare IoT platform managing ~1 million medical devices with a hybrid infrastructure spanning Kubernetes microservices and legacy on-premises systems.

This is not a toy example. Every section addresses real tradeoffs that arise at staff-engineer scale: cost, cardinality, compliance, reliability, and operational complexity.


Who This Is For

  • Staff / Principal Engineers designing observability platforms
  • Platform / SRE teams building centralized logging pipelines
  • Architects integrating legacy and modern systems under a single observability umbrella
  • Engineers preparing for system design interviews on observability topics
  • Teams in regulated industries (healthcare, fintech, government) with compliance requirements

Domain Context

A healthcare IoT platform with:

  • ~1,000,000 medical devices in the field (patient monitors, infusion pumps, wearables, diagnostic equipment)
  • Devices sending telemetry and events upstream via MQTT/HTTP/gRPC
  • Backend platform services running in Kubernetes (ingestion, validation, enrichment, persistence, notification, analytics)
  • Some modern microservices (Go, Java, Python) with structured logging
  • Some legacy services on VMs with filesystem-based plain-text logs
  • Central observability requirements for operations, debugging, compliance, and incident response
  • Strict PHI/PII handling under HIPAA and related regulations


.NET 10 Implementation

This repository includes a working .NET 10 implementation that demonstrates the architecture in practice:

Quick Start

bash
# Prerequisites: .NET 10 SDK, Docker, kubectl (minikube/kind/k3s)

# Option 1: Deploy everything to local K8s with Elasticsearch
./scripts/deploy.sh

# Option 2: Deploy with Datadog instead
./scripts/deploy.sh --with-datadog

# Test the pipeline
kubectl -n healthcare-iot port-forward svc/ingestion-service 8080:8080 &
./scripts/test-pipeline.sh

# View logs in Kibana
kubectl -n observability port-forward svc/kibana 5601:5601
# Open http://localhost:5601, create data view for "logs-*"

What the Implementation Includes

ComponentTechnologyPurpose
Shared Logging Library.NET 10, Serilog, OTel SDKStructured logging contract, event envelope, correlation middleware
Ingestion ServiceASP.NET Core Minimal APIReceives device alerts, generates correlation_id, forwards downstream
Validation ServiceASP.NET Core Minimal APIValidates alerts, logs event journal, forwards to notification
Notification ServiceASP.NET Core Minimal APISends notifications, logs success/failure event journal entries
OTel Collector DaemonSetOTel Collector ContribTails pod logs, enriches with K8s metadata, exports to ES
Elasticsearch + KibanaElastic 8.17Local log storage and visualization
Datadog AgentDatadog Agent 7Alternative backend via OTLP ingestion

Architecture (Implementation)

text
Device Alert → Ingestion Service → Validation Service → Notification Service
                    │                      │                      │
                    └── stdout JSON ───────┴── stdout JSON ───────┘
                              │
                    OTel Collector DaemonSet
                    (filelog + k8sattributes)
                              │
                   ┌──────────┴──────────┐
                   │                     │
            Elasticsearch           Datadog (optional)
                   │
                Kibana

Log Output Example

Each service emits structured JSON to stdout, which the OTel Collector DaemonSet collects:

json
{
  "@t": "2024-01-15T10:30:45.123Z",
  "@l": "Information",
  "service.name": "ingestion-service",
  "service.version": "1.0.0",
  "environment": "production",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "correlation_id": "corr-abc123def456",
  "tenant_id": "HOSP-789",
  "device_id": "DEV-PM100-0042",
  "EventType": "device.alert.received",
  "Outcome": "success",
  "@mt": "[EventJournal] {EventType} | correlation={CorrelationId} event={EventId} outcome={Outcome}"
}

Repository Structure

text
.
├── README.md                                    # This file
├── global.json                                  # .NET SDK version
├── HealthcareIoT.sln                            # .NET solution file
├── src/
│   ├── HealthcareIoT.Logging.Shared/            # Shared structured logging library
│   │   ├── ObservabilityExtensions.cs            # One-call OTel + Serilog + ES/DD setup
│   │   ├── CorrelationMiddleware.cs              # HTTP correlation header propagation
│   │   ├── CorrelationContext.cs                 # Business correlation context
│   │   ├── EventEnvelope.cs                      # Distributed event log schema
│   │   ├── EventLogger.cs                        # Event journal structured logger
│   │   ├── LoggingConstants.cs                   # Standardized field names
│   │   └── DeviceAlertRequest.cs                 # Shared DTOs
│   ├── HealthcareIoT.Ingestion/                  # Ingestion microservice
│   ├── HealthcareIoT.Validation/                 # Validation microservice
│   └── HealthcareIoT.Notification/               # Notification microservice
├── k8s/
│   ├── base/namespace.yaml                       # K8s namespaces
│   ├── otel/otel-collector-daemonset.yaml        # OTel Collector DaemonSet + RBAC
│   ├── elasticsearch/elasticsearch.yaml          # ES + Kibana StatefulSet
│   ├── datadog/datadog-agent.yaml                # Datadog Agent DaemonSet
│   └── services/                                 # Service deployments
├── scripts/
│   ├── deploy.sh                                 # One-command deployment
│   └── test-pipeline.sh                          # End-to-end test
├── configs/                                      # Reference OTel/legacy/ELK configs
├── docs/
│   ├── observability-overview.md                # High-level architecture and philosophy
│   ├── opentelemetry-logging-foundations.md      # OTel logging concepts deep dive
│   ├── distributed-event-logging.md             # Distributed event-log tracking across services
│   ├── structured-logging-strategy.md           # Structured logging contract and schema design
│   ├── kubernetes-pod-log-collection.md         # K8s pod log collection patterns
│   ├── legacy-filesystem-log-collection.md      # Legacy file-based log collection
│   ├── legacy-log-transformation.md             # Transforming legacy logs to structured format
│   ├── elk-architecture.md                      # ELK pipeline architecture
│   ├── datadog-architecture.md                  # Datadog pipeline architecture
│   ├── hybrid-modern-and-legacy-logging.md      # Unified hybrid observability architecture
│   ├── log-correlation-strategy.md              # Correlation IDs, trace context, troubleshooting
│   ├── medical-device-platform-considerations.md # Healthcare IoT domain specifics
│   ├── security-and-compliance-for-logs.md      # PHI/PII, HIPAA, encryption, access control
│   ├── cost-cardinality-and-retention.md        # Cost management, cardinality, retention tiers
│   ├── deployment-patterns-for-collectors.md    # DaemonSet vs sidecar vs gateway patterns
│   ├── incident-debugging-playbook.md           # Incident response with logs
│   ├── elk-vs-datadog.md                        # Detailed comparison
│   └── staff-level-cheatsheet.md                # Quick-reference cheat sheets
├── configs/
│   ├── otel/
│   │   ├── collector-k8s-daemonset.yaml         # OTel Collector config for K8s pod logs
│   │   ├── collector-gateway.yaml               # Gateway collector config
│   │   └── collector-legacy-filelog.yaml         # Filelog receiver for legacy systems
│   ├── legacy/
│   │   ├── filebeat-legacy.yaml                 # Filebeat config for legacy tailing
│   │   └── logstash-transform-pipeline.conf     # Logstash transformation pipeline
│   ├── elk/
│   │   └── elasticsearch-ilm-policy.json        # Index lifecycle management
│   └── datadog/
│       └── datadog-agent-otel.yaml              # Datadog agent with OTLP ingestion
└── diagrams/
    └── architecture-diagrams.md                 # All Mermaid diagrams in one reference file

How to Use This Repository

  1. Start with docs/observability-overview.md for the big picture
  2. Deep dive into specific topics based on your interest
  3. Study the diagrams in diagrams/architecture-diagrams.md and inline in each doc
  4. Review configs in configs/ to see realistic collector/pipeline configurations
  5. Use the cheat sheets in docs/staff-level-cheatsheet.md for revision
  6. Follow the incident playbook in docs/incident-debugging-playbook.md for practical troubleshooting patterns
OrderDocumentWhy
1observability-overview.mdUnderstand the full architecture
2opentelemetry-logging-foundations.mdUnderstand OTel logging primitives
3distributed-event-logging.mdUnderstand event tracking across services
4structured-logging-strategy.mdUnderstand the logging contract
5kubernetes-pod-log-collection.mdUnderstand modern log collection
6legacy-filesystem-log-collection.mdUnderstand legacy log collection
7legacy-log-transformation.mdUnderstand transformation pipelines
8log-correlation-strategy.mdUnderstand correlation and debugging
9elk-architecture.mdUnderstand ELK pipeline
10datadog-architecture.mdUnderstand Datadog pipeline
11hybrid-modern-and-legacy-logging.mdUnderstand the unified architecture
12deployment-patterns-for-collectors.mdUnderstand deployment tradeoffs
13medical-device-platform-considerations.mdUnderstand domain specifics
14security-and-compliance-for-logs.mdUnderstand compliance requirements
15cost-cardinality-and-retention.mdUnderstand cost and scale
16elk-vs-datadog.mdCompare backends
17incident-debugging-playbook.mdPractice troubleshooting
18staff-level-cheatsheet.mdQuick revision

High-Level Architecture (Quick View)


Key Principles

  1. Logs are a first-class signal — not an afterthought bolted onto metrics and traces
  2. Structured logging is non-negotiable at scale — unstructured text cannot be queried reliably across 1M devices
  3. Distributed event logs are distinct from diagnostic logs — they serve different consumers and have different retention/immutability requirements
  4. Legacy systems must be integrated, not ignored — transformation pipelines bridge the gap
  5. Correlation is the superpower — trace IDs, correlation IDs, and device event IDs make logs useful during incidents
  6. Cost is an architecture concern — at 1M devices, every field, every log level, every retention day has a dollar cost
  7. Compliance is not optional — PHI/PII redaction, audit trails, and access control are first-order design constraints in healthcare

License

This is an educational learning package. Use freely for learning, training, and internal architecture discussions.