blogs/DistributedLog

OpenTelemetry-Based Logging & Distributed Event-Log Tracking

A Staff-Level Architecture Learning Package for Healthcare IoT Observability

What This Repository Teaches

This repository is a comprehensive, architecture-first learning package for designing OpenTelemetry-based logging and distributed event-log tracking at scale. It targets a realistic domain: a healthcare IoT platform managing ~1 million medical devices with a hybrid infrastructure spanning Kubernetes microservices and legacy on-premises systems.

This is not a toy example. Every section addresses real tradeoffs that arise at staff-engineer scale: cost, cardinality, compliance, reliability, and operational complexity.

Who This Is For

Staff / Principal Engineers designing observability platforms
Platform / SRE teams building centralized logging pipelines
Architects integrating legacy and modern systems under a single observability umbrella
Engineers preparing for system design interviews on observability topics
Teams in regulated industries (healthcare, fintech, government) with compliance requirements

Domain Context

A healthcare IoT platform with:

~1,000,000 medical devices in the field (patient monitors, infusion pumps, wearables, diagnostic equipment)
Devices sending telemetry and events upstream via MQTT/HTTP/gRPC
Backend platform services running in Kubernetes (ingestion, validation, enrichment, persistence, notification, analytics)
Some modern microservices (Go, Java, Python) with structured logging
Some legacy services on VMs with filesystem-based plain-text logs
Central observability requirements for operations, debugging, compliance, and incident response
Strict PHI/PII handling under HIPAA and related regulations

.NET 10 Implementation

This repository includes a working .NET 10 implementation that demonstrates the architecture in practice:

Quick Start

bash

# Prerequisites: .NET 10 SDK, Docker, kubectl (minikube/kind/k3s)

# Option 1: Deploy everything to local K8s with Elasticsearch
./scripts/deploy.sh

# Option 2: Deploy with Datadog instead
./scripts/deploy.sh --with-datadog

# Test the pipeline
kubectl -n healthcare-iot port-forward svc/ingestion-service 8080:8080 &
./scripts/test-pipeline.sh

# View logs in Kibana
kubectl -n observability port-forward svc/kibana 5601:5601
# Open http://localhost:5601, create data view for "logs-*"

What the Implementation Includes

Component	Technology	Purpose
Shared Logging Library	.NET 10, Serilog, OTel SDK	Structured logging contract, event envelope, correlation middleware
Ingestion Service	ASP.NET Core Minimal API	Receives device alerts, generates correlation_id, forwards downstream
Validation Service	ASP.NET Core Minimal API	Validates alerts, logs event journal, forwards to notification
Notification Service	ASP.NET Core Minimal API	Sends notifications, logs success/failure event journal entries
OTel Collector DaemonSet	OTel Collector Contrib	Tails pod logs, enriches with K8s metadata, exports to ES
Elasticsearch + Kibana	Elastic 8.17	Local log storage and visualization
Datadog Agent	Datadog Agent 7	Alternative backend via OTLP ingestion

Architecture (Implementation)

text

Device Alert → Ingestion Service → Validation Service → Notification Service
                    │                      │                      │
                    └── stdout JSON ───────┴── stdout JSON ───────┘
                              │
                    OTel Collector DaemonSet
                    (filelog + k8sattributes)
                              │
                   ┌──────────┴──────────┐
                   │                     │
            Elasticsearch           Datadog (optional)
                   │
                Kibana

Log Output Example

Each service emits structured JSON to stdout, which the OTel Collector DaemonSet collects:

json

{
  "@t": "2024-01-15T10:30:45.123Z",
  "@l": "Information",
  "service.name": "ingestion-service",
  "service.version": "1.0.0",
  "environment": "production",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "correlation_id": "corr-abc123def456",
  "tenant_id": "HOSP-789",
  "device_id": "DEV-PM100-0042",
  "EventType": "device.alert.received",
  "Outcome": "success",
  "@mt": "[EventJournal] {EventType} | correlation={CorrelationId} event={EventId} outcome={Outcome}"
}

Repository Structure

text

.
├── README.md                                    # This file
├── global.json                                  # .NET SDK version
├── HealthcareIoT.sln                            # .NET solution file
├── src/
│   ├── HealthcareIoT.Logging.Shared/            # Shared structured logging library
│   │   ├── ObservabilityExtensions.cs            # One-call OTel + Serilog + ES/DD setup
│   │   ├── CorrelationMiddleware.cs              # HTTP correlation header propagation
│   │   ├── CorrelationContext.cs                 # Business correlation context
│   │   ├── EventEnvelope.cs                      # Distributed event log schema
│   │   ├── EventLogger.cs                        # Event journal structured logger
│   │   ├── LoggingConstants.cs                   # Standardized field names
│   │   └── DeviceAlertRequest.cs                 # Shared DTOs
│   ├── HealthcareIoT.Ingestion/                  # Ingestion microservice
│   ├── HealthcareIoT.Validation/                 # Validation microservice
│   └── HealthcareIoT.Notification/               # Notification microservice
├── k8s/
│   ├── base/namespace.yaml                       # K8s namespaces
│   ├── otel/otel-collector-daemonset.yaml        # OTel Collector DaemonSet + RBAC
│   ├── elasticsearch/elasticsearch.yaml          # ES + Kibana StatefulSet
│   ├── datadog/datadog-agent.yaml                # Datadog Agent DaemonSet
│   └── services/                                 # Service deployments
├── scripts/
│   ├── deploy.sh                                 # One-command deployment
│   └── test-pipeline.sh                          # End-to-end test
├── configs/                                      # Reference OTel/legacy/ELK configs
├── docs/
│   ├── observability-overview.md                # High-level architecture and philosophy
│   ├── opentelemetry-logging-foundations.md      # OTel logging concepts deep dive
│   ├── distributed-event-logging.md             # Distributed event-log tracking across services
│   ├── structured-logging-strategy.md           # Structured logging contract and schema design
│   ├── kubernetes-pod-log-collection.md         # K8s pod log collection patterns
│   ├── legacy-filesystem-log-collection.md      # Legacy file-based log collection
│   ├── legacy-log-transformation.md             # Transforming legacy logs to structured format
│   ├── elk-architecture.md                      # ELK pipeline architecture
│   ├── datadog-architecture.md                  # Datadog pipeline architecture
│   ├── hybrid-modern-and-legacy-logging.md      # Unified hybrid observability architecture
│   ├── log-correlation-strategy.md              # Correlation IDs, trace context, troubleshooting
│   ├── medical-device-platform-considerations.md # Healthcare IoT domain specifics
│   ├── security-and-compliance-for-logs.md      # PHI/PII, HIPAA, encryption, access control
│   ├── cost-cardinality-and-retention.md        # Cost management, cardinality, retention tiers
│   ├── deployment-patterns-for-collectors.md    # DaemonSet vs sidecar vs gateway patterns
│   ├── incident-debugging-playbook.md           # Incident response with logs
│   ├── elk-vs-datadog.md                        # Detailed comparison
│   └── staff-level-cheatsheet.md                # Quick-reference cheat sheets
├── configs/
│   ├── otel/
│   │   ├── collector-k8s-daemonset.yaml         # OTel Collector config for K8s pod logs
│   │   ├── collector-gateway.yaml               # Gateway collector config
│   │   └── collector-legacy-filelog.yaml         # Filelog receiver for legacy systems
│   ├── legacy/
│   │   ├── filebeat-legacy.yaml                 # Filebeat config for legacy tailing
│   │   └── logstash-transform-pipeline.conf     # Logstash transformation pipeline
│   ├── elk/
│   │   └── elasticsearch-ilm-policy.json        # Index lifecycle management
│   └── datadog/
│       └── datadog-agent-otel.yaml              # Datadog agent with OTLP ingestion
└── diagrams/
    └── architecture-diagrams.md                 # All Mermaid diagrams in one reference file

How to Use This Repository

Start with docs/observability-overview.md for the big picture
Deep dive into specific topics based on your interest
Study the diagrams in diagrams/architecture-diagrams.md and inline in each doc
Review configs in configs/ to see realistic collector/pipeline configurations
Use the cheat sheets in docs/staff-level-cheatsheet.md for revision
Follow the incident playbook in docs/incident-debugging-playbook.md for practical troubleshooting patterns

Reading Order (Recommended)

Order	Document	Why
1	`observability-overview.md`	Understand the full architecture
2	`opentelemetry-logging-foundations.md`	Understand OTel logging primitives
3	`distributed-event-logging.md`	Understand event tracking across services
4	`structured-logging-strategy.md`	Understand the logging contract
5	`kubernetes-pod-log-collection.md`	Understand modern log collection
6	`legacy-filesystem-log-collection.md`	Understand legacy log collection
7	`legacy-log-transformation.md`	Understand transformation pipelines
8	`log-correlation-strategy.md`	Understand correlation and debugging
9	`elk-architecture.md`	Understand ELK pipeline
10	`datadog-architecture.md`	Understand Datadog pipeline
11	`hybrid-modern-and-legacy-logging.md`	Understand the unified architecture
12	`deployment-patterns-for-collectors.md`	Understand deployment tradeoffs
13	`medical-device-platform-considerations.md`	Understand domain specifics
14	`security-and-compliance-for-logs.md`	Understand compliance requirements
15	`cost-cardinality-and-retention.md`	Understand cost and scale
16	`elk-vs-datadog.md`	Compare backends
17	`incident-debugging-playbook.md`	Practice troubleshooting
18	`staff-level-cheatsheet.md`	Quick revision

High-Level Architecture (Quick View)

Key Principles

Logs are a first-class signal — not an afterthought bolted onto metrics and traces
Structured logging is non-negotiable at scale — unstructured text cannot be queried reliably across 1M devices
Distributed event logs are distinct from diagnostic logs — they serve different consumers and have different retention/immutability requirements
Legacy systems must be integrated, not ignored — transformation pipelines bridge the gap
Correlation is the superpower — trace IDs, correlation IDs, and device event IDs make logs useful during incidents
Cost is an architecture concern — at 1M devices, every field, every log level, every retention day has a dollar cost
Compliance is not optional — PHI/PII redaction, audit trails, and access control are first-order design constraints in healthcare

License

This is an educational learning package. Use freely for learning, training, and internal architecture discussions.

← Back to all blogs