CSV, JSON, XML, YAML — Data Format Selection Guide
Why Format Choice Matters
Switching a legacy API from XML to JSON reduced payload sizes by 40% — that's a common real-world outcome. Conversely, choosing JSON for a configuration file and then discovering you can't add comments is a maintenance headache that surfaces later.
CSV, JSON, XML, and YAML were each designed at different times to solve different problems. Understanding their specifications will help you make decisions that hold up under production conditions.
CSV — The Minimal Solution for Tabular Data
Specification and History
CSV (Comma-Separated Values) is defined in RFC 4180 (2005), though implementations vary widely in practice, making full interoperability difficult. Its origins trace back to IBM mainframes in the 1960s, and it spread through the rise of spreadsheets.
RFC 4180's core rules are minimal:
- Records are separated by CRLF
- Trailing newline after the last record is optional
- Fields may be enclosed in double quotes
- Double quotes within a field are escaped by doubling them (
"")
Structural Limitations
CSV is fundamentally a two-dimensional format (rows and columns). It cannot represent nested structures, type information, or metadata.
name,age,department,skills
Alice,28,Engineering,"JavaScript,TypeScript"
Bob,32,Design,"Figma,CSS"
Note that a field containing a comma must be quoted. Type information is entirely absent — the consumer must infer that age is an integer.
import csv
with open('data.csv', newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
# row['age'] is a string — requires explicit int() conversion
print(row['name'], int(row['age']))
Best Use Cases
- Data transfer between databases or systems (bulk import/export)
- Excel and Google Sheets integration
- High-volume log or sensor data with flat structure
- Sharing data with analysts or non-developers
JSON — The Web API Default
Specification
JSON (JavaScript Object Notation) is defined in RFC 8259 (2017) and ECMA-404. RFC 8259 obsoletes RFC 7159 and mandates UTF-8 encoding.
JSON's six data types (string, number, boolean, null, object, array) make it structurally richer than CSV.
{
"user": {
"name": "Alice",
"age": 28,
"skills": ["JavaScript", "TypeScript"],
"active": true,
"metadata": null
}
}
Parsing Speed and Native Support
JSON's simplicity makes it easy to implement, and virtually every language includes a native parser. JavaScript engines like V8 include highly optimized JIT-compiled JSON parsers.
// Parse and serialize
const data = JSON.parse('{"name":"Alice","age":28}');
const json = JSON.stringify(data, null, 2);
// Streaming for large files
const { parser } = require('stream-json');
const { streamArray } = require('stream-json/streamers/StreamArray');
require('fs').createReadStream('large.json')
.pipe(parser())
.pipe(streamArray())
.on('data', ({ value }) => processItem(value));
Validation with JSON Schema
JSON itself has no built-in schema. JSON Schema (draft 2020-12) fills this gap, enabling structure validation and documentation generation.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["name", "age"],
"properties": {
"name": { "type": "string", "minLength": 1 },
"age": { "type": "integer", "minimum": 0, "maximum": 150 }
}
}
Best Use Cases
- REST API request and response bodies
- Configuration files where comments are not needed
- Document storage in NoSQL databases (MongoDB, Firestore)
- Frontend-to-backend data exchange
XML — The Pioneer of Structured Data
Specification
XML (Extensible Markup Language) was first published as a W3C Recommendation in 1998. Its tag-based syntax supports namespaces, attributes, and mixed content.
<?xml version="1.0" encoding="UTF-8"?>
<users xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="users.xsd">
<user id="1">
<name>Alice</name>
<age>28</age>
<skills>
<skill>JavaScript</skill>
<skill>TypeScript</skill>
</skills>
</user>
</users>
Schema Validation with XSD
XML Schema Definition (XSD) is the most mature schema language in the XML ecosystem, offering precise type constraints, cardinality, and namespace-aware validation.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="user">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="age">
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="150"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
</xs:sequence>
<xs:attribute name="id" type="xs:positiveInteger" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
The File Size Problem
XML's start and end tag repetition means the same data expressed in XML is typically 3–5× the size of equivalent JSON.
Approximate size for equivalent data:
CSV: 50 bytes (sacrificing structure)
YAML: 100 bytes
JSON: 120 bytes
XML: 380 bytes (tags repeated for every field)
Best Use Cases
- SOAP web services (enterprise system integration)
- Document-oriented data: SVG, XHTML, Office Open XML (OOXML)
- Systems requiring strict namespace-based type contracts
- Legacy system integration where XML is already the protocol
YAML — Configuration at Human Scale
Specification
YAML (YAML Ain't Markup Language) has YAML 1.2.2 (October 2021) as its current specification. Since YAML 1.2, every valid JSON document is also valid YAML — YAML is a proper superset of JSON.
# YAML supports comments
user:
name: Alice
age: 28
skills:
- JavaScript
- TypeScript
active: true
metadata: null # null can also be written as ~
Anchors and Aliases
YAML supports value reuse through anchors (&) and aliases (*), which is especially useful for DRY configuration files.
defaults: &defaults
timeout: 30
retries: 3
log_level: info
production:
<<: *defaults # merge key
log_level: warning
database: prod-db.example.com
staging:
<<: *defaults
database: staging-db.example.com
The YAML Type-Coercion Trap
YAML's expressiveness comes with a hidden cost: unexpected type coercion.
# Problematic examples
country_code: NO # Norway's country code — parsed as false (YAML 1.1)
version: 1.0 # Parsed as float, not string
date: 2024-01-01 # May be parsed as a date object depending on parser
port: 8080 # Parsed as integer (intentional)
In YAML 1.2, only true and false are booleans (YES/NO/on/off are strings). However, parsers following YAML 1.1 will treat NO as false. This catches CI/CD configuration files regularly.
import yaml
# PyYAML defaults to YAML 1.1
yaml.safe_load('country: NO') # {'country': False} ← TRAP
# ruamel.yaml uses YAML 1.2
from ruamel.yaml import YAML
yml = YAML()
yml.load('country: NO') # {'country': 'NO'} ← correct
Best Use Cases
- CI/CD pipeline configuration (GitHub Actions, GitLab CI, CircleCI)
- Application settings files that humans edit frequently
- Infrastructure-as-code (Kubernetes manifests, Ansible Playbooks, Helm Charts)
- Any configuration that needs both comments and hierarchical structure
Full Comparison Matrix
| Property | CSV | JSON | XML | YAML |
|---|---|---|---|---|
| Human readability | High (flat data) | High | Moderate | Very high |
| Data types | None | 6 types | String only | Rich |
| Comments | No | No | Yes | Yes |
| Nested structures | No | Yes | Yes | Yes |
| Schema support | None | JSON Schema | XSD / DTD | JSON Schema (reused) |
| Relative file size | Smallest | Small | Largest | Small–Medium |
| Parse complexity | Low | Low | High | Medium |
| Specification | RFC 4180 | RFC 8259 | W3C Rec | YAML 1.2.2 |
| Streaming support | Yes | Yes | Yes (SAX) | Limited |
| Binary representation | No | No (Base64 workaround) | No | No |
Decision Flow by Use Case
REST API Responses
Recommendation: JSON. It is the de facto standard for REST APIs. XML is only justified when integrating with systems that mandate SOAP.
GET /api/users/1
Accept: application/json
HTTP/1.1 200 OK
Content-Type: application/json
{"id": 1, "name": "Alice", "age": 28}
Configuration Files
Recommendation: YAML (or TOML for simple flat config). The ability to add comments and the hierarchical indentation syntax make YAML far more maintainable than JSON for config.
# GitHub Actions workflow
name: CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm test
Bulk Data Exchange
Recommendation: CSV (for flat data) or JSON Lines (for structured records). CSV has excellent compatibility with Excel, pandas, and most ETL tools.
# JSON Lines (JSONL) — one JSON object per line, ideal for streaming
import json
with open('events.jsonl', 'w') as f:
for event in events:
f.write(json.dumps(event) + '\n')
# Read line by line (low memory usage)
with open('events.jsonl') as f:
for line in f:
event = json.loads(line)
process(event)
Structured Logging
Recommendation: JSON. Structured logs in JSON format are easily parsed, aggregated, and queried by logging infrastructure (Elasticsearch, Loki, CloudWatch).
{"timestamp":"2026-04-15T10:30:00Z","level":"ERROR","message":"Connection failed","service":"payment","user_id":42,"latency_ms":1205}
Converting Between Formats
Format conversion is a routine part of development and data workflows:
- CSV → JSON: CSV-JSON Converter
- XML → JSON: XML-JSON Converter
- YAML ↔ JSON: YAML-JSON Converter
- JSON formatting and validation: JSON Formatter
Summary
| Use Case | Recommended Format |
|---|---|
| REST API | JSON |
| SOAP / enterprise integration | XML |
| Config file (comments needed) | YAML |
| Large tabular data transfer | CSV |
| Streaming log data | JSON Lines |
| Kubernetes / CI-CD config | YAML |
Format choices are expensive to change retroactively. The clearest guidance for 2026: JSON for APIs, YAML for CI/CD and developer-facing configuration, CSV for bulk tabular data exchange. XML remains relevant for enterprise integration and document formats, but rarely the right choice for new greenfield projects.
References
- RFC 8259 — The JSON Data Interchange Format (2017, IETF)
- RFC 4180 — Common Format and MIME Type for CSV Files (2005, IETF)
- W3C XML 1.0 Specification
- YAML 1.2.2 Specification (October 2021)
- JSON Schema Draft 2020-12
