docs/Distributed System With Big Data/json

JSON (JavaScript Object Notation) -- Deep Dive

One-Paragraph Summary

JSON is a lightweight, self-describing, human-readable data interchange format that has become the default for web APIs, configuration files, and event payloads. Its support for nested structures, arrays, and basic types (string, number, boolean, null) makes it far more expressive than CSV, but its text-based encoding, lack of a built-in schema, and row-oriented nature make it a poor choice for large-scale analytics. JSON Lines (JSONL) extends JSON's utility to streaming use cases by placing one JSON object per line.

What Is JSON?

JSON (JavaScript Object Notation) is a text-based format for representing structured data based on JavaScript object syntax. Despite its name, JSON is language-independent and supported by virtually every programming language.

json

{
  "user_id": 42,
  "name": "Alice Johnson",
  "email": "alice@example.com",
  "signup_date": "2024-01-15T10:30:00Z",
  "active": true,
  "tags": ["premium", "early-adopter"],
  "address": {
    "city": "San Francisco",
    "state": "CA",
    "zip": "94105"
  }
}

JSON Data Types

JSON Lines (JSONL) for Streaming

JSON Lines (also called JSONL or newline-delimited JSON) places one valid JSON object per line. This solves JSON's biggest streaming problem: you do not need to parse the entire file to process individual records.

jsonl

{"event":"click","user_id":42,"timestamp":"2024-01-15T10:30:00Z","page":"/home"}
{"event":"purchase","user_id":42,"timestamp":"2024-01-15T10:35:00Z","amount":99.99}
{"event":"click","user_id":87,"timestamp":"2024-01-15T10:36:00Z","page":"/products"}

Why JSONL Matters

Property	JSON (single array)	JSON Lines
Streaming parse	No (need full file)	Yes (line-by-line)
Append-friendly	No (must rewrite)	Yes (append lines)
Splittable (Spark)	No	Yes
File concatenation	No (invalid JSON)	Yes (just `cat`)
Partial failure	Lose entire file	Lose one record

Schema Validation (JSON Schema)

JSON has no built-in schema, but JSON Schema provides a vocabulary for annotating and validating JSON documents.

json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["user_id", "name", "email"],
  "properties": {
    "user_id": { "type": "integer", "minimum": 1 },
    "name": { "type": "string", "minLength": 1 },
    "email": { "type": "string", "format": "email" },
    "active": { "type": "boolean", "default": true },
    "tags": {
      "type": "array",
      "items": { "type": "string" }
    }
  },
  "additionalProperties": false
}

Schema Validation Workflow

Key limitation: JSON Schema is opt-in and enforced at the application layer. Unlike Avro, there is no wire-level schema enforcement.

Row vs. Columnar

JSON is row-oriented. Each object contains all fields for a single entity. This means:

Full record retrieval is natural (one parse per object)
Column-level operations (e.g., SELECT AVG(amount)) require reading every object and extracting the field
No column pruning possible at the storage layer

Schema Support

Self-describing but not schema-enforced. Each JSON object carries its own field names, which means:

Field names are repeated in every record (storage overhead)
Different records in the same collection can have different fields
No type enforcement (a field can be a string in one record and a number in another)
JSON Schema provides external validation but is not part of the format

Schema Evolution

Implicit, not managed. JSON's flexibility allows:

Adding new fields (consumers ignore unknown fields)
Making fields optional (missing fields are treated as absent)
No mechanism for removing fields safely
No compatibility guarantees without external tooling

This "schema-on-read" approach works for small teams but becomes dangerous at scale without governance.

Compression

No built-in compression. JSON is highly repetitive (field names, braces, quotes) and compresses well:

Format	Typical Ratio	Notes
Gzip	8-15x	Good for JSON due to repeated keys
Zstandard	10-18x	Better ratio, faster decompression
LZ4	4-7x	Fastest, lowest ratio
Brotli	12-20x	Best ratio, slower compression

Pro tip: JSON compresses better than CSV because field names (repeated in every record) create highly predictable patterns for dictionary-based compressors.

Human Readability

Very good. JSON is easy to read and write for humans:

Pretty-printing with jq . makes nested structures clear
Widely supported in IDEs with syntax highlighting
Browser developer tools render JSON natively
jq provides powerful command-line querying

bash

# Pretty-print
cat data.json | jq .

# Extract specific field
cat events.jsonl | jq -r '.user_id'

# Filter and transform
cat events.jsonl | jq 'select(.event == "purchase") | .amount'

Machine Efficiency

Moderate.

Text-based encoding is slower to parse than binary formats (Avro, Protobuf)
Field names repeated in every record waste storage and bandwidth
No native date, timestamp, or binary types (dates are strings, binary is Base64)
Numbers have no fixed precision (floating-point ambiguity)
Parsing nested structures is more expensive than flat records

Benchmark context: JSON parsing is typically 2-5x slower than Avro deserialization and 10-30x slower than reading equivalent Parquet data for analytical queries.

Streaming Suitability

Good (with JSONL).

JSONL is ideal for event streaming, log aggregation, and message queues
Each line is independently parseable
Natural fit for HTTP streaming APIs (Server-Sent Events)
Kafka commonly uses JSON for message values (though Avro is preferred at scale)
Append-only writes are trivial

Limitations:

No schema registry integration (unlike Avro)
Schema drift is silent and dangerous
Larger wire size than binary alternatives

Batch Analytics Suitability

Poor.

Full file scan required for every query
No column pruning or predicate pushdown
Repeated field names waste I/O bandwidth
Nested structures require recursive parsing
No built-in statistics (min/max/count/null_count)

For analytics, convert JSON/JSONL to Parquet at ingestion time.

When to Use JSON

REST APIs -- request/response payloads
Configuration files -- application settings, feature flags
Event payloads -- webhook bodies, notification data
Small-to-medium datasets (< 500 MB) where human readability matters
Document databases -- MongoDB, CouchDB, Elasticsearch
Message queues -- Kafka, RabbitMQ, SQS (for simplicity)
Logging -- structured logs in JSONL format (ELK stack)
Data exchange between teams with no shared schema infrastructure

When NOT to Use JSON

Large-scale analytics (> 1 GB) -- use Parquet or ORC
High-throughput streaming -- use Avro with schema registry
Columnar queries (aggregations on few columns) -- use Parquet
Bandwidth-constrained environments -- use Protobuf or Avro (2-5x smaller)
Schema evolution at scale -- use Avro with compatibility rules
Binary data -- Base64 encoding adds 33% overhead
Fixed-schema, high-volume pipelines -- binary formats are strictly better

Example Workloads

Workload	JSON Fit	Better Alternative
REST API responses	Excellent	--
Application config	Excellent	--
Structured logging (JSONL)	Good	--
Kafka event streaming	Moderate	Avro + Schema Registry
Data lake storage	Poor	Parquet
Clickstream analytics	Poor	Parquet
IoT sensor data (high volume)	Poor	Avro or Protobuf
ML feature store	Poor	Parquet

Tradeoffs Table

Property	Rating	Notes
Human readability	5/5	Ubiquitous, well-tooled (jq)
Schema enforcement	2/5	JSON Schema exists but is opt-in
Schema evolution	2/5	Flexible but ungoverned
Compression efficiency	2/5	Repetitive keys compress well, but still text
Query performance	1/5	Full scan, no pruning
Streaming support	4/5	JSONL is excellent for streaming
Ecosystem support	5/5	Universal language support
Nested data	5/5	First-class support
Type safety	2/5	Basic types, no dates/timestamps
Write speed	4/5	Simple text serialization

Common Mistakes

1. Using JSON for data lake storage

Every analytical query does a full scan. Convert to Parquet on landing.

2. No schema validation in pipelines

Without JSON Schema validation, producers can silently change field names, types, or structure. Always validate at ingestion boundaries.

3. Floating-point precision loss

JSON numbers follow IEEE 754. Large integers (> 2^53) lose precision when parsed by JavaScript. Use strings for large IDs:

json

{"order_id": "9007199254740993"}

4. Date/time as unformatted strings

JSON has no date type. Without a convention, you get "2024-01-15", "01/15/2024", "1705305600", and "Jan 15, 2024" in the same dataset. Standardize on ISO 8601.

5. Single large JSON array instead of JSONL

A single JSON array [{...}, {...}, ...] cannot be streamed, appended, or split. Use JSONL for record-oriented data.

6. Ignoring encoding

JSON must be UTF-8 (per RFC 8259). Producing JSON in other encodings causes silent corruption.

7. Deep nesting without limits

Deeply nested JSON (10+ levels) is hard to query, hard to flatten, and often indicates a design problem.

Interview Framing

"JSON is the default data format for the web -- APIs, configs, and event payloads all use it. Its self-describing nature and support for nested structures make it extremely versatile for data interchange. However, for analytical workloads, JSON's text encoding, repeated field names, and lack of columnar access make it 10-30x slower than Parquet. In my pipelines, JSON is the format data arrives in (API responses, Kafka messages), and Parquet is the format data lives in (data lake, query engine). The conversion boundary is the ETL/ELT ingestion layer."

Follow-up points:

JSONL vs JSON arrays for streaming (splittability, append-friendly)
JSON Schema for contract enforcement at API boundaries
Why Avro replaces JSON in high-throughput Kafka pipelines (schema registry, binary encoding, 2-5x smaller)

Top 5 Use Cases

REST API payloads -- the universal request/response format
Structured logging -- JSONL to ELK/Datadog/Splunk for searchable logs
Configuration files -- application settings, infrastructure-as-code parameters
Document databases -- MongoDB, CouchDB, Elasticsearch native format
Webhook/event payloads -- GitHub webhooks, Stripe events, Slack notifications

Top 5 Warning Signs You Should Switch Away from JSON

Your JSONL files exceed 1 GB -- convert to Parquet for analytics, Avro for streaming
Schema drift is causing production incidents -- adopt Avro with a schema registry
Kafka consumer lag is growing -- JSON serialization overhead may be the bottleneck
Analytical queries scan TBs of JSON in S3 -- column pruning with Parquet could reduce I/O by 90%
You are Base64-encoding binary data in JSON -- use a format with native binary support

JSON (JavaScript Object Notation) -- Deep Dive

One-Paragraph Summary

What Is JSON?

JSON Data Types

JSON Lines (JSONL) for Streaming

Why JSONL Matters

Schema Validation (JSON Schema)

Schema Validation Workflow

Row vs. Columnar

Schema Support

Schema Evolution

Compression

Human Readability

Machine Efficiency

Streaming Suitability

Batch Analytics Suitability

When to Use JSON

When NOT to Use JSON

Example Workloads

Tradeoffs Table

Common Mistakes

1. Using JSON for data lake storage

2. No schema validation in pipelines

3. Floating-point precision loss

4. Date/time as unformatted strings

5. Single large JSON array instead of JSONL

6. Ignoring encoding

7. Deep nesting without limits

Interview Framing

Top 5 Use Cases

Top 5 Warning Signs You Should Switch Away from JSON

JSON Ecosystem at a Glance

Further Reading