docs/Distributed System With Big Data/json
Edit on GitHub

JSON (JavaScript Object Notation) -- Deep Dive

One-Paragraph Summary

JSON is a lightweight, self-describing, human-readable data interchange format that has become the default for web APIs, configuration files, and event payloads. Its support for nested structures, arrays, and basic types (string, number, boolean, null) makes it far more expressive than CSV, but its text-based encoding, lack of a built-in schema, and row-oriented nature make it a poor choice for large-scale analytics. JSON Lines (JSONL) extends JSON's utility to streaming use cases by placing one JSON object per line.


What Is JSON?

JSON (JavaScript Object Notation) is a text-based format for representing structured data based on JavaScript object syntax. Despite its name, JSON is language-independent and supported by virtually every programming language.

json
{
  "user_id": 42,
  "name": "Alice Johnson",
  "email": "alice@example.com",
  "signup_date": "2024-01-15T10:30:00Z",
  "active": true,
  "tags": ["premium", "early-adopter"],
  "address": {
    "city": "San Francisco",
    "state": "CA",
    "zip": "94105"
  }
}

JSON Data Types


JSON Lines (JSONL) for Streaming

JSON Lines (also called JSONL or newline-delimited JSON) places one valid JSON object per line. This solves JSON's biggest streaming problem: you do not need to parse the entire file to process individual records.

jsonl
{"event":"click","user_id":42,"timestamp":"2024-01-15T10:30:00Z","page":"/home"}
{"event":"purchase","user_id":42,"timestamp":"2024-01-15T10:35:00Z","amount":99.99}
{"event":"click","user_id":87,"timestamp":"2024-01-15T10:36:00Z","page":"/products"}

Why JSONL Matters

PropertyJSON (single array)JSON Lines
Streaming parseNo (need full file)Yes (line-by-line)
Append-friendlyNo (must rewrite)Yes (append lines)
Splittable (Spark)NoYes
File concatenationNo (invalid JSON)Yes (just cat)
Partial failureLose entire fileLose one record

Schema Validation (JSON Schema)

JSON has no built-in schema, but JSON Schema provides a vocabulary for annotating and validating JSON documents.

json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["user_id", "name", "email"],
  "properties": {
    "user_id": { "type": "integer", "minimum": 1 },
    "name": { "type": "string", "minLength": 1 },
    "email": { "type": "string", "format": "email" },
    "active": { "type": "boolean", "default": true },
    "tags": {
      "type": "array",
      "items": { "type": "string" }
    }
  },
  "additionalProperties": false
}

Schema Validation Workflow

Key limitation: JSON Schema is opt-in and enforced at the application layer. Unlike Avro, there is no wire-level schema enforcement.


Row vs. Columnar

JSON is row-oriented. Each object contains all fields for a single entity. This means:

  • Full record retrieval is natural (one parse per object)
  • Column-level operations (e.g., SELECT AVG(amount)) require reading every object and extracting the field
  • No column pruning possible at the storage layer

Schema Support

Self-describing but not schema-enforced. Each JSON object carries its own field names, which means:

  • Field names are repeated in every record (storage overhead)
  • Different records in the same collection can have different fields
  • No type enforcement (a field can be a string in one record and a number in another)
  • JSON Schema provides external validation but is not part of the format

Schema Evolution

Implicit, not managed. JSON's flexibility allows:

  • Adding new fields (consumers ignore unknown fields)
  • Making fields optional (missing fields are treated as absent)
  • No mechanism for removing fields safely
  • No compatibility guarantees without external tooling

This "schema-on-read" approach works for small teams but becomes dangerous at scale without governance.


Compression

No built-in compression. JSON is highly repetitive (field names, braces, quotes) and compresses well:

FormatTypical RatioNotes
Gzip8-15xGood for JSON due to repeated keys
Zstandard10-18xBetter ratio, faster decompression
LZ44-7xFastest, lowest ratio
Brotli12-20xBest ratio, slower compression

Pro tip: JSON compresses better than CSV because field names (repeated in every record) create highly predictable patterns for dictionary-based compressors.


Human Readability

Very good. JSON is easy to read and write for humans:

  • Pretty-printing with jq . makes nested structures clear
  • Widely supported in IDEs with syntax highlighting
  • Browser developer tools render JSON natively
  • jq provides powerful command-line querying
bash
# Pretty-print
cat data.json | jq .

# Extract specific field
cat events.jsonl | jq -r '.user_id'

# Filter and transform
cat events.jsonl | jq 'select(.event == "purchase") | .amount'

Machine Efficiency

Moderate.

  • Text-based encoding is slower to parse than binary formats (Avro, Protobuf)
  • Field names repeated in every record waste storage and bandwidth
  • No native date, timestamp, or binary types (dates are strings, binary is Base64)
  • Numbers have no fixed precision (floating-point ambiguity)
  • Parsing nested structures is more expensive than flat records

Benchmark context: JSON parsing is typically 2-5x slower than Avro deserialization and 10-30x slower than reading equivalent Parquet data for analytical queries.


Streaming Suitability

Good (with JSONL).

  • JSONL is ideal for event streaming, log aggregation, and message queues
  • Each line is independently parseable
  • Natural fit for HTTP streaming APIs (Server-Sent Events)
  • Kafka commonly uses JSON for message values (though Avro is preferred at scale)
  • Append-only writes are trivial

Limitations:

  • No schema registry integration (unlike Avro)
  • Schema drift is silent and dangerous
  • Larger wire size than binary alternatives

Batch Analytics Suitability

Poor.

  • Full file scan required for every query
  • No column pruning or predicate pushdown
  • Repeated field names waste I/O bandwidth
  • Nested structures require recursive parsing
  • No built-in statistics (min/max/count/null_count)

For analytics, convert JSON/JSONL to Parquet at ingestion time.


When to Use JSON

  1. REST APIs -- request/response payloads
  2. Configuration files -- application settings, feature flags
  3. Event payloads -- webhook bodies, notification data
  4. Small-to-medium datasets (< 500 MB) where human readability matters
  5. Document databases -- MongoDB, CouchDB, Elasticsearch
  6. Message queues -- Kafka, RabbitMQ, SQS (for simplicity)
  7. Logging -- structured logs in JSONL format (ELK stack)
  8. Data exchange between teams with no shared schema infrastructure

When NOT to Use JSON

  1. Large-scale analytics (> 1 GB) -- use Parquet or ORC
  2. High-throughput streaming -- use Avro with schema registry
  3. Columnar queries (aggregations on few columns) -- use Parquet
  4. Bandwidth-constrained environments -- use Protobuf or Avro (2-5x smaller)
  5. Schema evolution at scale -- use Avro with compatibility rules
  6. Binary data -- Base64 encoding adds 33% overhead
  7. Fixed-schema, high-volume pipelines -- binary formats are strictly better

Example Workloads

WorkloadJSON FitBetter Alternative
REST API responsesExcellent--
Application configExcellent--
Structured logging (JSONL)Good--
Kafka event streamingModerateAvro + Schema Registry
Data lake storagePoorParquet
Clickstream analyticsPoorParquet
IoT sensor data (high volume)PoorAvro or Protobuf
ML feature storePoorParquet

Tradeoffs Table

PropertyRatingNotes
Human readability5/5Ubiquitous, well-tooled (jq)
Schema enforcement2/5JSON Schema exists but is opt-in
Schema evolution2/5Flexible but ungoverned
Compression efficiency2/5Repetitive keys compress well, but still text
Query performance1/5Full scan, no pruning
Streaming support4/5JSONL is excellent for streaming
Ecosystem support5/5Universal language support
Nested data5/5First-class support
Type safety2/5Basic types, no dates/timestamps
Write speed4/5Simple text serialization

Common Mistakes

1. Using JSON for data lake storage

Every analytical query does a full scan. Convert to Parquet on landing.

2. No schema validation in pipelines

Without JSON Schema validation, producers can silently change field names, types, or structure. Always validate at ingestion boundaries.

3. Floating-point precision loss

JSON numbers follow IEEE 754. Large integers (> 2^53) lose precision when parsed by JavaScript. Use strings for large IDs:

json
{"order_id": "9007199254740993"}

4. Date/time as unformatted strings

JSON has no date type. Without a convention, you get "2024-01-15", "01/15/2024", "1705305600", and "Jan 15, 2024" in the same dataset. Standardize on ISO 8601.

5. Single large JSON array instead of JSONL

A single JSON array [{...}, {...}, ...] cannot be streamed, appended, or split. Use JSONL for record-oriented data.

6. Ignoring encoding

JSON must be UTF-8 (per RFC 8259). Producing JSON in other encodings causes silent corruption.

7. Deep nesting without limits

Deeply nested JSON (10+ levels) is hard to query, hard to flatten, and often indicates a design problem.


Interview Framing

"JSON is the default data format for the web -- APIs, configs, and event payloads all use it. Its self-describing nature and support for nested structures make it extremely versatile for data interchange. However, for analytical workloads, JSON's text encoding, repeated field names, and lack of columnar access make it 10-30x slower than Parquet. In my pipelines, JSON is the format data arrives in (API responses, Kafka messages), and Parquet is the format data lives in (data lake, query engine). The conversion boundary is the ETL/ELT ingestion layer."

Follow-up points:

  • JSONL vs JSON arrays for streaming (splittability, append-friendly)
  • JSON Schema for contract enforcement at API boundaries
  • Why Avro replaces JSON in high-throughput Kafka pipelines (schema registry, binary encoding, 2-5x smaller)

Top 5 Use Cases

  1. REST API payloads -- the universal request/response format
  2. Structured logging -- JSONL to ELK/Datadog/Splunk for searchable logs
  3. Configuration files -- application settings, infrastructure-as-code parameters
  4. Document databases -- MongoDB, CouchDB, Elasticsearch native format
  5. Webhook/event payloads -- GitHub webhooks, Stripe events, Slack notifications

Top 5 Warning Signs You Should Switch Away from JSON

  1. Your JSONL files exceed 1 GB -- convert to Parquet for analytics, Avro for streaming
  2. Schema drift is causing production incidents -- adopt Avro with a schema registry
  3. Kafka consumer lag is growing -- JSON serialization overhead may be the bottleneck
  4. Analytical queries scan TBs of JSON in S3 -- column pruning with Parquet could reduce I/O by 90%
  5. You are Base64-encoding binary data in JSON -- use a format with native binary support

JSON Ecosystem at a Glance


Further Reading