docs/Distributed System With Big Data/csv
Edit on GitHub

CSV (Comma-Separated Values) -- Deep Dive

One-Paragraph Summary

CSV is the oldest and most universally supported tabular data format, storing records as plain-text lines with fields separated by a delimiter (usually a comma). Its simplicity makes it the lingua franca for data exchange between systems, spreadsheets, and databases, but its complete lack of schema, type information, and built-in compression makes it a poor choice for analytics workloads, schema evolution, or any pipeline where data quality and performance matter at scale.


What Is CSV?

CSV is a plain-text format where each line represents a record and fields within a record are separated by a delimiter character. Despite its name, the delimiter can be a tab (TSV), pipe (|), semicolon (;), or any other character.

There is no single CSV standard. RFC 4180 provides guidelines, but real-world CSV files vary wildly in quoting rules, escaping, encoding, and line endings.

csv
id,name,email,signup_date,active
1,Alice Johnson,alice@example.com,2024-01-15,true
2,"Bob ""The Builder"" Smith",bob@example.com,2024-02-20,true
3,Charlie Brown,charlie@example.com,2024-03-10,false

Anatomy of a CSV File


Row vs. Columnar

CSV is row-oriented. Each line contains all fields for a single record. This means:

  • Reading a full record is fast (single line scan).
  • Selecting a subset of columns requires parsing every field in every row.
  • Aggregating a single column (e.g., SUM(amount)) requires reading the entire file.

Schema Support

None. CSV has no built-in schema definition. Consumers must know:

  • The delimiter character
  • Whether a header row exists
  • The data type of each column (everything is text)
  • The encoding (UTF-8, Latin-1, etc.)
  • The quoting and escaping rules

This makes CSV inherently fragile -- a producer can change column order, add columns, or change types without any mechanism to communicate that to consumers.


Schema Evolution

Not supported. There is no mechanism for:

  • Adding fields with defaults
  • Removing fields safely
  • Renaming fields
  • Changing field types

Any structural change to a CSV file is a breaking change for consumers that rely on column position rather than column name.


Compression

No built-in compression. CSV files are plain text and highly compressible. Common approaches:

ApproachCompression RatioSplittable?Notes
Gzip (.csv.gz)5-10xNoMost common
Bzip2 (.csv.bz2)8-12xYes (block-level)Slower
LZ4 (.csv.lz4)3-5xNoFastest
Zstandard (.csv.zst)6-10xNoBest balance
Uncompressed1xYesLine-splittable

Key tradeoff: Gzip-compressed CSV is not splittable, meaning a distributed framework like Spark cannot split a single .csv.gz file across multiple workers. Use Bzip2 or leave uncompressed if splittability matters.


Human Readability

Excellent. This is CSV's greatest strength:

  • Openable in any text editor
  • Openable in any spreadsheet application
  • No special tooling required
  • head, tail, cut, awk, and sort work natively
  • Non-technical users can inspect and edit files

Machine Efficiency

Poor.

  • No binary encoding -- all values stored as text strings
  • Parsing requires character-by-character scanning for delimiters and quotes
  • No indexes, no metadata, no statistics
  • Type conversion (string to int/float/date) adds CPU overhead
  • No predicate pushdown or column pruning possible

Benchmark context: Reading a 1 GB CSV in Spark is typically 5-20x slower than reading the equivalent Parquet file.


Streaming Suitability

Moderate. CSV can be streamed line-by-line, making it suitable for:

  • Log file tailing
  • Simple ETL pipelines
  • Unix pipe workflows (cat data.csv | grep "ERROR" | cut -d, -f3)

However, the lack of schema makes it unsuitable for schema-enforced streaming (e.g., Kafka topics with schema registry).


Batch Analytics Suitability

Poor. For analytical queries:

  • Full file scan required for any query
  • No column pruning (must read all columns even if query uses one)
  • No predicate pushdown
  • No built-in statistics (min/max/count)
  • Poor compression compared to columnar formats
  • Slow parsing

When to Use CSV

  1. Data exchange between systems with no shared schema registry
  2. Legacy system integration where CSV is the only supported format
  3. Simple, one-time exports for human consumption (spreadsheets, reports)
  4. Small datasets (< 100 MB) where performance is irrelevant
  5. Unix pipeline processing with awk, cut, sort, grep
  6. Database bulk imports (e.g., COPY in PostgreSQL, LOAD DATA INFILE in MySQL)
  7. Seed data and fixtures for testing

When NOT to Use CSV

  1. Analytics workloads -- use Parquet or ORC instead
  2. Schema evolution is needed -- use Avro or Protobuf
  3. Large datasets (> 1 GB) -- columnar formats compress 5-10x better
  4. Event streaming -- use Avro or Protobuf with a schema registry
  5. Nested or hierarchical data -- use JSON, Avro, or Parquet
  6. Data quality is critical -- no type enforcement, no validation
  7. Multi-language microservices -- use a format with codegen (Avro, Protobuf)

Example Workloads

WorkloadCSV FitBetter Alternative
Export user list for marketingGood--
Daily sales report to ExcelGood--
Clickstream analyticsPoorParquet
Kafka event streamingPoorAvro
ML feature storePoorParquet
Config filePoorYAML/JSON
Database migrationModerateNative dump format
Log aggregationModerateJSON Lines

Tradeoffs Table

PropertyRatingNotes
Human readability5/5Universal text format
Schema enforcement0/5None whatsoever
Schema evolution0/5Any change is breaking
Compression efficiency2/5Text-only, row-oriented
Query performance1/5Full scan always
Streaming support3/5Line-oriented, but no schema
Ecosystem support5/5Every tool supports CSV
Nested data0/5Flat records only
Type safety0/5Everything is a string
Write speed4/5Simple text serialization

Common Mistakes

1. Using CSV for data lake storage

CSV in a data lake means every query does a full scan. Convert to Parquet on ingestion.

2. Assuming column order is stable

Producers can reorder columns. Always match on header names, never on position.

3. Ignoring encoding issues

A CSV file with Latin-1 encoding opened as UTF-8 will corrupt accented characters silently.

4. Not handling quoted fields

Fields containing commas, newlines, or quotes must be properly quoted. Naive parsers break on:

csv
description,"This field has a comma, and ""quotes"""

5. Using CSV for nested data

Flattening nested JSON into CSV (e.g., address.city, address.state) creates wide, sparse tables. Use a format that supports nesting.

6. Not specifying NULL representation

Is an empty field "", NULL, null, \N, or an actual empty string? CSV does not distinguish.

7. Gzip-compressed CSV in Spark

A single .csv.gz file cannot be split across workers. Either use uncompressed CSV or switch to Parquet.


Interview Framing

"CSV is the universal data exchange format -- every system can read and write it. But that universality comes at a cost: no schema, no types, no compression, and terrible query performance. In a modern data pipeline, CSV is an ingestion format, not a storage format. I would accept CSV at the boundary of a system and immediately convert to Parquet or Avro for downstream processing."

Follow-up points:

  • Explain why Gzip CSV is not splittable and what that means for Spark
  • Discuss the "CSV as a contract" anti-pattern (brittle, no evolution)
  • Compare CSV import speed vs. Parquet read speed with concrete ratios

Top 5 Use Cases

  1. Spreadsheet exports -- reports, dashboards, ad-hoc analysis for business users
  2. Database bulk loading -- COPY FROM in PostgreSQL, LOAD DATA in MySQL
  3. Legacy system integration -- mainframe extracts, batch file drops
  4. Simple ETL staging -- land CSVs in S3, then transform to Parquet
  5. Open data publishing -- government datasets, Kaggle, public APIs

Top 5 Warning Signs You Should Switch Away from CSV

  1. Your CSV files exceed 1 GB -- you are paying a massive performance and storage penalty
  2. Consumers are writing custom parsers for your CSV quirks -- escaping, encoding, nulls
  3. Schema changes break downstream pipelines -- no evolution mechanism exists
  4. Analytical queries are slow -- full scan on every query, no column pruning
  5. You are storing CSV in a data lake for long-term analytics -- convert to Parquet on write

Quick Reference: CSV Parsing Gotchas


Further Reading