JSON (JavaScript Object Notation) -- Deep Dive
One-Paragraph Summary
JSON is a lightweight, self-describing, human-readable data interchange format that has become the default for web APIs, configuration files, and event payloads. Its support for nested structures, arrays, and basic types (string, number, boolean, null) makes it far more expressive than CSV, but its text-based encoding, lack of a built-in schema, and row-oriented nature make it a poor choice for large-scale analytics. JSON Lines (JSONL) extends JSON's utility to streaming use cases by placing one JSON object per line.
What Is JSON?
JSON (JavaScript Object Notation) is a text-based format for representing structured data based on JavaScript object syntax. Despite its name, JSON is language-independent and supported by virtually every programming language.
{
"user_id": 42,
"name": "Alice Johnson",
"email": "alice@example.com",
"signup_date": "2024-01-15T10:30:00Z",
"active": true,
"tags": ["premium", "early-adopter"],
"address": {
"city": "San Francisco",
"state": "CA",
"zip": "94105"
}
}JSON Data Types
JSON Lines (JSONL) for Streaming
JSON Lines (also called JSONL or newline-delimited JSON) places one valid JSON object per line. This solves JSON's biggest streaming problem: you do not need to parse the entire file to process individual records.
{"event":"click","user_id":42,"timestamp":"2024-01-15T10:30:00Z","page":"/home"}
{"event":"purchase","user_id":42,"timestamp":"2024-01-15T10:35:00Z","amount":99.99}
{"event":"click","user_id":87,"timestamp":"2024-01-15T10:36:00Z","page":"/products"}Why JSONL Matters
| Property | JSON (single array) | JSON Lines |
|---|---|---|
| Streaming parse | No (need full file) | Yes (line-by-line) |
| Append-friendly | No (must rewrite) | Yes (append lines) |
| Splittable (Spark) | No | Yes |
| File concatenation | No (invalid JSON) | Yes (just cat) |
| Partial failure | Lose entire file | Lose one record |
Schema Validation (JSON Schema)
JSON has no built-in schema, but JSON Schema provides a vocabulary for annotating and validating JSON documents.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["user_id", "name", "email"],
"properties": {
"user_id": { "type": "integer", "minimum": 1 },
"name": { "type": "string", "minLength": 1 },
"email": { "type": "string", "format": "email" },
"active": { "type": "boolean", "default": true },
"tags": {
"type": "array",
"items": { "type": "string" }
}
},
"additionalProperties": false
}Schema Validation Workflow
Key limitation: JSON Schema is opt-in and enforced at the application layer. Unlike Avro, there is no wire-level schema enforcement.
Row vs. Columnar
JSON is row-oriented. Each object contains all fields for a single entity. This means:
- Full record retrieval is natural (one parse per object)
- Column-level operations (e.g.,
SELECT AVG(amount)) require reading every object and extracting the field - No column pruning possible at the storage layer
Schema Support
Self-describing but not schema-enforced. Each JSON object carries its own field names, which means:
- Field names are repeated in every record (storage overhead)
- Different records in the same collection can have different fields
- No type enforcement (a field can be a string in one record and a number in another)
- JSON Schema provides external validation but is not part of the format
Schema Evolution
Implicit, not managed. JSON's flexibility allows:
- Adding new fields (consumers ignore unknown fields)
- Making fields optional (missing fields are treated as absent)
- No mechanism for removing fields safely
- No compatibility guarantees without external tooling
This "schema-on-read" approach works for small teams but becomes dangerous at scale without governance.
Compression
No built-in compression. JSON is highly repetitive (field names, braces, quotes) and compresses well:
| Format | Typical Ratio | Notes |
|---|---|---|
| Gzip | 8-15x | Good for JSON due to repeated keys |
| Zstandard | 10-18x | Better ratio, faster decompression |
| LZ4 | 4-7x | Fastest, lowest ratio |
| Brotli | 12-20x | Best ratio, slower compression |
Pro tip: JSON compresses better than CSV because field names (repeated in every record) create highly predictable patterns for dictionary-based compressors.
Human Readability
Very good. JSON is easy to read and write for humans:
- Pretty-printing with
jq .makes nested structures clear - Widely supported in IDEs with syntax highlighting
- Browser developer tools render JSON natively
jqprovides powerful command-line querying
# Pretty-print
cat data.json | jq .
# Extract specific field
cat events.jsonl | jq -r '.user_id'
# Filter and transform
cat events.jsonl | jq 'select(.event == "purchase") | .amount'Machine Efficiency
Moderate.
- Text-based encoding is slower to parse than binary formats (Avro, Protobuf)
- Field names repeated in every record waste storage and bandwidth
- No native date, timestamp, or binary types (dates are strings, binary is Base64)
- Numbers have no fixed precision (floating-point ambiguity)
- Parsing nested structures is more expensive than flat records
Benchmark context: JSON parsing is typically 2-5x slower than Avro deserialization and 10-30x slower than reading equivalent Parquet data for analytical queries.
Streaming Suitability
Good (with JSONL).
- JSONL is ideal for event streaming, log aggregation, and message queues
- Each line is independently parseable
- Natural fit for HTTP streaming APIs (Server-Sent Events)
- Kafka commonly uses JSON for message values (though Avro is preferred at scale)
- Append-only writes are trivial
Limitations:
- No schema registry integration (unlike Avro)
- Schema drift is silent and dangerous
- Larger wire size than binary alternatives
Batch Analytics Suitability
Poor.
- Full file scan required for every query
- No column pruning or predicate pushdown
- Repeated field names waste I/O bandwidth
- Nested structures require recursive parsing
- No built-in statistics (min/max/count/null_count)
For analytics, convert JSON/JSONL to Parquet at ingestion time.
When to Use JSON
- REST APIs -- request/response payloads
- Configuration files -- application settings, feature flags
- Event payloads -- webhook bodies, notification data
- Small-to-medium datasets (< 500 MB) where human readability matters
- Document databases -- MongoDB, CouchDB, Elasticsearch
- Message queues -- Kafka, RabbitMQ, SQS (for simplicity)
- Logging -- structured logs in JSONL format (ELK stack)
- Data exchange between teams with no shared schema infrastructure
When NOT to Use JSON
- Large-scale analytics (> 1 GB) -- use Parquet or ORC
- High-throughput streaming -- use Avro with schema registry
- Columnar queries (aggregations on few columns) -- use Parquet
- Bandwidth-constrained environments -- use Protobuf or Avro (2-5x smaller)
- Schema evolution at scale -- use Avro with compatibility rules
- Binary data -- Base64 encoding adds 33% overhead
- Fixed-schema, high-volume pipelines -- binary formats are strictly better
Example Workloads
| Workload | JSON Fit | Better Alternative |
|---|---|---|
| REST API responses | Excellent | -- |
| Application config | Excellent | -- |
| Structured logging (JSONL) | Good | -- |
| Kafka event streaming | Moderate | Avro + Schema Registry |
| Data lake storage | Poor | Parquet |
| Clickstream analytics | Poor | Parquet |
| IoT sensor data (high volume) | Poor | Avro or Protobuf |
| ML feature store | Poor | Parquet |
Tradeoffs Table
| Property | Rating | Notes |
|---|---|---|
| Human readability | 5/5 | Ubiquitous, well-tooled (jq) |
| Schema enforcement | 2/5 | JSON Schema exists but is opt-in |
| Schema evolution | 2/5 | Flexible but ungoverned |
| Compression efficiency | 2/5 | Repetitive keys compress well, but still text |
| Query performance | 1/5 | Full scan, no pruning |
| Streaming support | 4/5 | JSONL is excellent for streaming |
| Ecosystem support | 5/5 | Universal language support |
| Nested data | 5/5 | First-class support |
| Type safety | 2/5 | Basic types, no dates/timestamps |
| Write speed | 4/5 | Simple text serialization |
Common Mistakes
1. Using JSON for data lake storage
Every analytical query does a full scan. Convert to Parquet on landing.
2. No schema validation in pipelines
Without JSON Schema validation, producers can silently change field names, types, or structure. Always validate at ingestion boundaries.
3. Floating-point precision loss
JSON numbers follow IEEE 754. Large integers (> 2^53) lose precision when parsed by JavaScript. Use strings for large IDs:
{"order_id": "9007199254740993"}4. Date/time as unformatted strings
JSON has no date type. Without a convention, you get "2024-01-15", "01/15/2024", "1705305600", and "Jan 15, 2024" in the same dataset. Standardize on ISO 8601.
5. Single large JSON array instead of JSONL
A single JSON array [{...}, {...}, ...] cannot be streamed, appended, or split. Use JSONL for record-oriented data.
6. Ignoring encoding
JSON must be UTF-8 (per RFC 8259). Producing JSON in other encodings causes silent corruption.
7. Deep nesting without limits
Deeply nested JSON (10+ levels) is hard to query, hard to flatten, and often indicates a design problem.
Interview Framing
"JSON is the default data format for the web -- APIs, configs, and event payloads all use it. Its self-describing nature and support for nested structures make it extremely versatile for data interchange. However, for analytical workloads, JSON's text encoding, repeated field names, and lack of columnar access make it 10-30x slower than Parquet. In my pipelines, JSON is the format data arrives in (API responses, Kafka messages), and Parquet is the format data lives in (data lake, query engine). The conversion boundary is the ETL/ELT ingestion layer."
Follow-up points:
- JSONL vs JSON arrays for streaming (splittability, append-friendly)
- JSON Schema for contract enforcement at API boundaries
- Why Avro replaces JSON in high-throughput Kafka pipelines (schema registry, binary encoding, 2-5x smaller)
Top 5 Use Cases
- REST API payloads -- the universal request/response format
- Structured logging -- JSONL to ELK/Datadog/Splunk for searchable logs
- Configuration files -- application settings, infrastructure-as-code parameters
- Document databases -- MongoDB, CouchDB, Elasticsearch native format
- Webhook/event payloads -- GitHub webhooks, Stripe events, Slack notifications
Top 5 Warning Signs You Should Switch Away from JSON
- Your JSONL files exceed 1 GB -- convert to Parquet for analytics, Avro for streaming
- Schema drift is causing production incidents -- adopt Avro with a schema registry
- Kafka consumer lag is growing -- JSON serialization overhead may be the bottleneck
- Analytical queries scan TBs of JSON in S3 -- column pruning with Parquet could reduce I/O by 90%
- You are Base64-encoding binary data in JSON -- use a format with native binary support