CSV (Comma-Separated Values) -- Deep Dive
One-Paragraph Summary
CSV is the oldest and most universally supported tabular data format, storing records as plain-text lines with fields separated by a delimiter (usually a comma). Its simplicity makes it the lingua franca for data exchange between systems, spreadsheets, and databases, but its complete lack of schema, type information, and built-in compression makes it a poor choice for analytics workloads, schema evolution, or any pipeline where data quality and performance matter at scale.
What Is CSV?
CSV is a plain-text format where each line represents a record and fields within a record are separated by a delimiter character. Despite its name, the delimiter can be a tab (TSV), pipe (|), semicolon (;), or any other character.
There is no single CSV standard. RFC 4180 provides guidelines, but real-world CSV files vary wildly in quoting rules, escaping, encoding, and line endings.
id,name,email,signup_date,active
1,Alice Johnson,alice@example.com,2024-01-15,true
2,"Bob ""The Builder"" Smith",bob@example.com,2024-02-20,true
3,Charlie Brown,charlie@example.com,2024-03-10,falseAnatomy of a CSV File
Row vs. Columnar
CSV is row-oriented. Each line contains all fields for a single record. This means:
- Reading a full record is fast (single line scan).
- Selecting a subset of columns requires parsing every field in every row.
- Aggregating a single column (e.g.,
SUM(amount)) requires reading the entire file.
Schema Support
None. CSV has no built-in schema definition. Consumers must know:
- The delimiter character
- Whether a header row exists
- The data type of each column (everything is text)
- The encoding (UTF-8, Latin-1, etc.)
- The quoting and escaping rules
This makes CSV inherently fragile -- a producer can change column order, add columns, or change types without any mechanism to communicate that to consumers.
Schema Evolution
Not supported. There is no mechanism for:
- Adding fields with defaults
- Removing fields safely
- Renaming fields
- Changing field types
Any structural change to a CSV file is a breaking change for consumers that rely on column position rather than column name.
Compression
No built-in compression. CSV files are plain text and highly compressible. Common approaches:
| Approach | Compression Ratio | Splittable? | Notes |
|---|---|---|---|
| Gzip (.csv.gz) | 5-10x | No | Most common |
| Bzip2 (.csv.bz2) | 8-12x | Yes (block-level) | Slower |
| LZ4 (.csv.lz4) | 3-5x | No | Fastest |
| Zstandard (.csv.zst) | 6-10x | No | Best balance |
| Uncompressed | 1x | Yes | Line-splittable |
Key tradeoff: Gzip-compressed CSV is not splittable, meaning a distributed framework like Spark cannot split a single .csv.gz file across multiple workers. Use Bzip2 or leave uncompressed if splittability matters.
Human Readability
Excellent. This is CSV's greatest strength:
- Openable in any text editor
- Openable in any spreadsheet application
- No special tooling required
head,tail,cut,awk, andsortwork natively- Non-technical users can inspect and edit files
Machine Efficiency
Poor.
- No binary encoding -- all values stored as text strings
- Parsing requires character-by-character scanning for delimiters and quotes
- No indexes, no metadata, no statistics
- Type conversion (string to int/float/date) adds CPU overhead
- No predicate pushdown or column pruning possible
Benchmark context: Reading a 1 GB CSV in Spark is typically 5-20x slower than reading the equivalent Parquet file.
Streaming Suitability
Moderate. CSV can be streamed line-by-line, making it suitable for:
- Log file tailing
- Simple ETL pipelines
- Unix pipe workflows (
cat data.csv | grep "ERROR" | cut -d, -f3)
However, the lack of schema makes it unsuitable for schema-enforced streaming (e.g., Kafka topics with schema registry).
Batch Analytics Suitability
Poor. For analytical queries:
- Full file scan required for any query
- No column pruning (must read all columns even if query uses one)
- No predicate pushdown
- No built-in statistics (min/max/count)
- Poor compression compared to columnar formats
- Slow parsing
When to Use CSV
- Data exchange between systems with no shared schema registry
- Legacy system integration where CSV is the only supported format
- Simple, one-time exports for human consumption (spreadsheets, reports)
- Small datasets (< 100 MB) where performance is irrelevant
- Unix pipeline processing with
awk,cut,sort,grep - Database bulk imports (e.g.,
COPYin PostgreSQL,LOAD DATA INFILEin MySQL) - Seed data and fixtures for testing
When NOT to Use CSV
- Analytics workloads -- use Parquet or ORC instead
- Schema evolution is needed -- use Avro or Protobuf
- Large datasets (> 1 GB) -- columnar formats compress 5-10x better
- Event streaming -- use Avro or Protobuf with a schema registry
- Nested or hierarchical data -- use JSON, Avro, or Parquet
- Data quality is critical -- no type enforcement, no validation
- Multi-language microservices -- use a format with codegen (Avro, Protobuf)
Example Workloads
| Workload | CSV Fit | Better Alternative |
|---|---|---|
| Export user list for marketing | Good | -- |
| Daily sales report to Excel | Good | -- |
| Clickstream analytics | Poor | Parquet |
| Kafka event streaming | Poor | Avro |
| ML feature store | Poor | Parquet |
| Config file | Poor | YAML/JSON |
| Database migration | Moderate | Native dump format |
| Log aggregation | Moderate | JSON Lines |
Tradeoffs Table
| Property | Rating | Notes |
|---|---|---|
| Human readability | 5/5 | Universal text format |
| Schema enforcement | 0/5 | None whatsoever |
| Schema evolution | 0/5 | Any change is breaking |
| Compression efficiency | 2/5 | Text-only, row-oriented |
| Query performance | 1/5 | Full scan always |
| Streaming support | 3/5 | Line-oriented, but no schema |
| Ecosystem support | 5/5 | Every tool supports CSV |
| Nested data | 0/5 | Flat records only |
| Type safety | 0/5 | Everything is a string |
| Write speed | 4/5 | Simple text serialization |
Common Mistakes
1. Using CSV for data lake storage
CSV in a data lake means every query does a full scan. Convert to Parquet on ingestion.
2. Assuming column order is stable
Producers can reorder columns. Always match on header names, never on position.
3. Ignoring encoding issues
A CSV file with Latin-1 encoding opened as UTF-8 will corrupt accented characters silently.
4. Not handling quoted fields
Fields containing commas, newlines, or quotes must be properly quoted. Naive parsers break on:
description,"This field has a comma, and ""quotes"""5. Using CSV for nested data
Flattening nested JSON into CSV (e.g., address.city, address.state) creates wide, sparse tables. Use a format that supports nesting.
6. Not specifying NULL representation
Is an empty field "", NULL, null, \N, or an actual empty string? CSV does not distinguish.
7. Gzip-compressed CSV in Spark
A single .csv.gz file cannot be split across workers. Either use uncompressed CSV or switch to Parquet.
Interview Framing
"CSV is the universal data exchange format -- every system can read and write it. But that universality comes at a cost: no schema, no types, no compression, and terrible query performance. In a modern data pipeline, CSV is an ingestion format, not a storage format. I would accept CSV at the boundary of a system and immediately convert to Parquet or Avro for downstream processing."
Follow-up points:
- Explain why Gzip CSV is not splittable and what that means for Spark
- Discuss the "CSV as a contract" anti-pattern (brittle, no evolution)
- Compare CSV import speed vs. Parquet read speed with concrete ratios
Top 5 Use Cases
- Spreadsheet exports -- reports, dashboards, ad-hoc analysis for business users
- Database bulk loading --
COPY FROMin PostgreSQL,LOAD DATAin MySQL - Legacy system integration -- mainframe extracts, batch file drops
- Simple ETL staging -- land CSVs in S3, then transform to Parquet
- Open data publishing -- government datasets, Kaggle, public APIs
Top 5 Warning Signs You Should Switch Away from CSV
- Your CSV files exceed 1 GB -- you are paying a massive performance and storage penalty
- Consumers are writing custom parsers for your CSV quirks -- escaping, encoding, nulls
- Schema changes break downstream pipelines -- no evolution mechanism exists
- Analytical queries are slow -- full scan on every query, no column pruning
- You are storing CSV in a data lake for long-term analytics -- convert to Parquet on write