docs/Distributed System With Big Data/csv

CSV (Comma-Separated Values) -- Deep Dive

One-Paragraph Summary

CSV is the oldest and most universally supported tabular data format, storing records as plain-text lines with fields separated by a delimiter (usually a comma). Its simplicity makes it the lingua franca for data exchange between systems, spreadsheets, and databases, but its complete lack of schema, type information, and built-in compression makes it a poor choice for analytics workloads, schema evolution, or any pipeline where data quality and performance matter at scale.

What Is CSV?

CSV is a plain-text format where each line represents a record and fields within a record are separated by a delimiter character. Despite its name, the delimiter can be a tab (TSV), pipe (|), semicolon (;), or any other character.

There is no single CSV standard. RFC 4180 provides guidelines, but real-world CSV files vary wildly in quoting rules, escaping, encoding, and line endings.

csv

id,name,email,signup_date,active
1,Alice Johnson,alice@example.com,2024-01-15,true
2,"Bob ""The Builder"" Smith",bob@example.com,2024-02-20,true
3,Charlie Brown,charlie@example.com,2024-03-10,false

Anatomy of a CSV File

Row vs. Columnar

CSV is row-oriented. Each line contains all fields for a single record. This means:

Reading a full record is fast (single line scan).
Selecting a subset of columns requires parsing every field in every row.
Aggregating a single column (e.g., SUM(amount)) requires reading the entire file.

Schema Support

None. CSV has no built-in schema definition. Consumers must know:

The delimiter character
Whether a header row exists
The data type of each column (everything is text)
The encoding (UTF-8, Latin-1, etc.)
The quoting and escaping rules

This makes CSV inherently fragile -- a producer can change column order, add columns, or change types without any mechanism to communicate that to consumers.

Schema Evolution

Not supported. There is no mechanism for:

Adding fields with defaults
Removing fields safely
Renaming fields
Changing field types

Any structural change to a CSV file is a breaking change for consumers that rely on column position rather than column name.

Compression

No built-in compression. CSV files are plain text and highly compressible. Common approaches:

Approach	Compression Ratio	Splittable?	Notes
Gzip (.csv.gz)	5-10x	No	Most common
Bzip2 (.csv.bz2)	8-12x	Yes (block-level)	Slower
LZ4 (.csv.lz4)	3-5x	No	Fastest
Zstandard (.csv.zst)	6-10x	No	Best balance
Uncompressed	1x	Yes	Line-splittable

Key tradeoff: Gzip-compressed CSV is not splittable, meaning a distributed framework like Spark cannot split a single .csv.gz file across multiple workers. Use Bzip2 or leave uncompressed if splittability matters.

Human Readability

Excellent. This is CSV's greatest strength:

Openable in any text editor
Openable in any spreadsheet application
No special tooling required
head, tail, cut, awk, and sort work natively
Non-technical users can inspect and edit files

Machine Efficiency

Poor.

No binary encoding -- all values stored as text strings
Parsing requires character-by-character scanning for delimiters and quotes
No indexes, no metadata, no statistics
Type conversion (string to int/float/date) adds CPU overhead
No predicate pushdown or column pruning possible

Benchmark context: Reading a 1 GB CSV in Spark is typically 5-20x slower than reading the equivalent Parquet file.

Streaming Suitability

Moderate. CSV can be streamed line-by-line, making it suitable for:

Log file tailing
Simple ETL pipelines
Unix pipe workflows (cat data.csv | grep "ERROR" | cut -d, -f3)

However, the lack of schema makes it unsuitable for schema-enforced streaming (e.g., Kafka topics with schema registry).

Batch Analytics Suitability

Poor. For analytical queries:

Full file scan required for any query
No column pruning (must read all columns even if query uses one)
No predicate pushdown
No built-in statistics (min/max/count)
Poor compression compared to columnar formats
Slow parsing

When to Use CSV

Data exchange between systems with no shared schema registry
Legacy system integration where CSV is the only supported format
Simple, one-time exports for human consumption (spreadsheets, reports)
Small datasets (< 100 MB) where performance is irrelevant
Unix pipeline processing with awk, cut, sort, grep
Database bulk imports (e.g., COPY in PostgreSQL, LOAD DATA INFILE in MySQL)
Seed data and fixtures for testing

When NOT to Use CSV

Analytics workloads -- use Parquet or ORC instead
Schema evolution is needed -- use Avro or Protobuf
Large datasets (> 1 GB) -- columnar formats compress 5-10x better
Event streaming -- use Avro or Protobuf with a schema registry
Nested or hierarchical data -- use JSON, Avro, or Parquet
Data quality is critical -- no type enforcement, no validation
Multi-language microservices -- use a format with codegen (Avro, Protobuf)

Example Workloads

Workload	CSV Fit	Better Alternative
Export user list for marketing	Good	--
Daily sales report to Excel	Good	--
Clickstream analytics	Poor	Parquet
Kafka event streaming	Poor	Avro
ML feature store	Poor	Parquet
Config file	Poor	YAML/JSON
Database migration	Moderate	Native dump format
Log aggregation	Moderate	JSON Lines

Tradeoffs Table

Property	Rating	Notes
Human readability	5/5	Universal text format
Schema enforcement	0/5	None whatsoever
Schema evolution	0/5	Any change is breaking
Compression efficiency	2/5	Text-only, row-oriented
Query performance	1/5	Full scan always
Streaming support	3/5	Line-oriented, but no schema
Ecosystem support	5/5	Every tool supports CSV
Nested data	0/5	Flat records only
Type safety	0/5	Everything is a string
Write speed	4/5	Simple text serialization

Common Mistakes

1. Using CSV for data lake storage

CSV in a data lake means every query does a full scan. Convert to Parquet on ingestion.

2. Assuming column order is stable

Producers can reorder columns. Always match on header names, never on position.

3. Ignoring encoding issues

A CSV file with Latin-1 encoding opened as UTF-8 will corrupt accented characters silently.

4. Not handling quoted fields

Fields containing commas, newlines, or quotes must be properly quoted. Naive parsers break on:

csv

description,"This field has a comma, and ""quotes"""

5. Using CSV for nested data

Flattening nested JSON into CSV (e.g., address.city, address.state) creates wide, sparse tables. Use a format that supports nesting.

6. Not specifying NULL representation

Is an empty field "", NULL, null, \N, or an actual empty string? CSV does not distinguish.

7. Gzip-compressed CSV in Spark

A single .csv.gz file cannot be split across workers. Either use uncompressed CSV or switch to Parquet.

Interview Framing

"CSV is the universal data exchange format -- every system can read and write it. But that universality comes at a cost: no schema, no types, no compression, and terrible query performance. In a modern data pipeline, CSV is an ingestion format, not a storage format. I would accept CSV at the boundary of a system and immediately convert to Parquet or Avro for downstream processing."

Follow-up points:

Explain why Gzip CSV is not splittable and what that means for Spark
Discuss the "CSV as a contract" anti-pattern (brittle, no evolution)
Compare CSV import speed vs. Parquet read speed with concrete ratios

Top 5 Use Cases

Spreadsheet exports -- reports, dashboards, ad-hoc analysis for business users
Database bulk loading -- COPY FROM in PostgreSQL, LOAD DATA in MySQL
Legacy system integration -- mainframe extracts, batch file drops
Simple ETL staging -- land CSVs in S3, then transform to Parquet
Open data publishing -- government datasets, Kaggle, public APIs

Top 5 Warning Signs You Should Switch Away from CSV

Your CSV files exceed 1 GB -- you are paying a massive performance and storage penalty
Consumers are writing custom parsers for your CSV quirks -- escaping, encoding, nulls
Schema changes break downstream pipelines -- no evolution mechanism exists
Analytical queries are slow -- full scan on every query, no column pruning
You are storing CSV in a data lake for long-term analytics -- convert to Parquet on write