CSV, Comma Separated Values (RFC 4180)

Format Description Categories >> Browse Alphabetical List

CSV, Comma Separated Values (RFC 4180)

Table of Contents

Identification and description
Local use
Sustainability factors
Quality and functionality factors
File type signifiers
Notes
Format specifications
Useful references

Format Description Properties

ID: fdd000323
Short name: CSV_strict
Content categories: dataset
Format Category: file-format, encoding
Other facets: unitary, text, structured, symbolic
Last significant FDD update: 2024-05-07
Draft status: Full

Identification and description

Relationship to other formats
Full name	CSV, Comma Separated Values (strict form as described in RFC 4180)
Description	CSV is a simple format for representing a rectangular array (matrix) of numeric and textual values. It an example of a "flat file" format. It is a delimited data format that has fields/columns separated by the comma character %x2C (Hex 2C) and records/rows/lines separated by characters indicating a line break. RFC 4180 stipulates the use of CRLF pairs to denote line breaks, where CR is %x0D (Hex 0D) and LF is %x0A (Hex 0A). Each line should contain the same number of fields. Fields that contain a special character (comma, CR, LF, or double quote), must be "escaped" by enclosing them in double quotes (Hex 22). An optional header line may appear as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. CSV commonly employs US-ASCII as character set, but other character sets are permitted.
Production phase	May be used at any stage in the lifecycle of a dataset.
Has modified version	Variants of the strict form described here exist. See Notes below.
Affinity to	TSV , TSV, Tab-Separated Values

Local use

LC experience or existing holdings	The Library of Congress has many CSV files in its collections, over 840,000 as of May 2024.
LC preference	The Library of Congress Recommended Formats Statement (RFS) includes CSV as a preferred format for datasets. The RFS does not specify a type of CSV.

Sustainability factors

Disclosure	A simple de facto format, for which no single, official specification exists. The strict variant of the format described here was registered with IANA for the text/csv MIME type in RFC 4180. In RFC 4180, the required section in an RFC for MIME type registration that documents the "Published Specification" reads: "While numerous private specifications exist for various programs and systems, there is no single 'master' specification for this format. An attempt at a common definition can be found in Section 2 [of RFC 4180]." Some Useful References below provide variant specifications.
Documentation	IETF RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files. 2005. Available at http://tools.ietf.org/html/rfc4180 or http://www.ietf.org/rfc/rfc4180.txt
Adoption	Widely used as an exchange format for tabular data. Although very limited in functionality, there are many data exchange or data preservation contexts for which it is adequate, particularly when the syntax and semantics of fields are described in ancillary documentation that is also exchanged or preserved. CSV files can be imported and exported by almost any software designed for storing or manipulating data, including relational database systems, spreadsheet software, and statistical analysis software. CSV is a preferred format for interchange in many contexts because it is so easy to process. Recommended Data Formats for Preservation Purposes in the Florida Digital Archive (link via Internet Archive) lists CSV as a format with a high confidence level of providing ongoing access in a usable form. CSV is a recommended format for data deposit with Library and Archives Canada, a 'high' recommended format for sCornell implementation, and a recommended format for long-term retention by the State Archives of North Carolina. CSV was one of the primary formats into which the UK National Archives converted datasets that were selected for the National Digital Archive of Datasets between 1997 and 2010 (after which a government initiative promoting open data eliminated the need for such conversion by the National Archives). It is the preferred format for preparing tabular environmental data at the Oak Ridge National Laboratory and its use for tabular data is a best practice for the DataONE (Data Observation Network for Earth) project. Most government open data initiatives have CSV as one of the primary formats in which data can be downloaded. For example, CSV is one of the formats in which data from the U.S. Data.gov can be downloaded. Others include XML, ESRI shapefiles (ESRI_shape), and KML. The last two are for geospatial data.
Licensing and patents	None.
Transparency	A simple text-based format that is very transparent, being both human-readable and easily machine-processable. Simple tools have been developed to validate files and visualize the content of the variables/columns. See, for example, CSV Fingerprint or CSV Lint in Useful References below.
Self-documentation	Poor. There is no internal capability to represent metadata, although the optional header row may provide some clues to the semantics of the columns. For preservation, an associated codebook is desirable, listing and describing the fields, and indicating types and ranges for field data values. In some contexts, the relevant information is supplied by documentation for a larger corpus or resource, rather than for each dataset. Accessibility Features Accessibility features for datasets and databases typically involve conformance to W3C's guidelines for page structure, tables and forms. In practical terms, this means pages (if applicable to the dataset) should be well-structured with regions and headings identified and the content is marked up or tagged on a page in a way that uses appropriate and meaningful elements; tables are organized through logical relationship in grids with labeled header cells and data cells that define their relationship; and forms (if applicable to the dataset) validate input provided by the user and provide options to undo changes and confirm data entry and notify users about successful task completion, any errors, and provide instructions to help them correct mistakes. Each of these criteria should be supported by text accessible to a screen reader. According to a 2021 discussion list thread on the W3C mailing list, "CSV is so limited in terms of what it can/can't do (can't actually define column or row headers), there's probably little to no scope in terms of remediating CSV files (other than perhaps making sure the first row and column contain header cells, even if they can't be explicitly denoted as such." A response adds "CSV files with more than one table, any table that is not at top and left, that has blank rows used for presentation, or other visual marking calling out specific cells would likely not be able to conform to accessibility requirements because the format does not provide sufficient semantic information." W3C's CSV on the Web: A Primer from 2016 states "CSV is also a poor format for data. There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or differing data types within a column." Overall, there is limited native support but applications can add options such as described in Make your Excel documents accessible to people with disabilities. Comments welcome.
External dependencies	None.
Technical protection considerations	None.

Quality and functionality factors

Dataset
Normal functionality	An extremely simple format with limited capabilities. The format does not support strong data typing and is limited to representing a simple tabular structure.
Support for software interfaces (APIs, etc.)	The simple nature of the CSV format allows easy programming for parsing and using the data.
Data documentation (quality, provenance, etc.)	No support. Most guidelines for use of the format for archiving datasets call for data documentation in separate files in appropriate formats.
Beyond normal functionality	None.

File type signifiers and format identifiers

Tag	Value	Note
Filename extension	csv	No particular extension is specified or required, but .csv is often used.
Internet Media Type	text/csv	Registered with IANA via RFC 4180.
Other	NF00143	See https://www.archives.gov/files/lod/dpframework/id/NF00143.ttl
Pronom PUID	x-fmt/18	See http://www.nationalarchives.gov.uk/PRONOM/x-fmt/18.
Wikidata Title ID	Q935809	See https://www.wikidata.org/wiki/Q935809.

Notes

General	Several relatively common variations from the strict form specified by RFC 4180 are found and may be supported by software tools such as those listed below as Useful References: In locales where the comma character is used in place of a decimal point in numbers, the separator between fields/columns is often a semicolon. The line break character may be CR or LF, not necessarily CRLF. Some Unix-based applications may use a different escape mechanism for indicating that one of the separator characters occurs within a text value. The individual character is preceded by a backslash character rather than enclosing the entire string in double quotes. Single quotes may be treated as equivalent to double-quotes for escaping (also known as "text-qualification"). Several other caveats are worth noting: The last record in a file may or may not end with a line break character. Non-printable characters may be included in text fields by using one of several c-style character escape sequences: \### or \o### Octal; \x## Hex; \d### Decimal; and \u#### Unicode. The treatment of whitespace adjacent to field and record separators varies among applications. If whitespace at the beginning and end of a textual field value is significant, the text string should be text-qualified, i.e. enclosed in quotes. In some uses, there is an assumption of strong data typing, with unquoted fields considered to be numeric, and quoted fields considered to be text data.
History

General

Several relatively common variations from the strict form specified by RFC 4180 are found and may be supported by software tools such as those listed below as Useful References:

In locales where the comma character is used in place of a decimal point in numbers, the separator between fields/columns is often a semicolon.
The line break character may be CR or LF, not necessarily CRLF.
Some Unix-based applications may use a different escape mechanism for indicating that one of the separator characters occurs within a text value. The individual character is preceded by a backslash character rather than enclosing the entire string in double quotes.
Single quotes may be treated as equivalent to double-quotes for escaping (also known as "text-qualification").

Several other caveats are worth noting:

The last record in a file may or may not end with a line break character.
Non-printable characters may be included in text fields by using one of several c-style character escape sequences: \### or \o### Octal; \x## Hex; \d### Decimal; and \u#### Unicode.
The treatment of whitespace adjacent to field and record separators varies among applications. If whitespace at the beginning and end of a textual field value is significant, the text string should be text-qualified, i.e. enclosed in quotes.
In some uses, there is an assumption of strong data typing, with unquoted fields considered to be numeric, and quoted fields considered to be text data.

History

Format specifications

RFC 4180 is available online from IETF at more than one site and in more than one format:
- RFC 4180 in various formats (http://tools.ietf.org/html/rfc4180).
- RFC 4180 in plain text (http://www.ietf.org/rfc/rfc4180.txt).

Useful references

URLs

CSV at Wikipedia (http://en.wikipedia.org/wiki/Comma-separated_values).
CSV format specification (http://supercsv.sourceforge.net/csv_specification.html). From sourceforge site for Super CSV
The Comma Separator Value (CSV) File Format (http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm). From Creativyst Software.
CSV Files (http://www.csvreader.com/csv_format.php). From website for CSVReader and DataStreams software.
FDA File Format Details: CSV (https://web.archive.org/web/20160316101027/http://fclaweb.fcla.edu/content/csv). From Florida Digital Archive (link via Internet Archive)
Setosa blog: CSV Fingerprints (http://setosa.io/blog/2014/08/03/csv-fingerprints/). The blog post includes a box to paste a CSV file into for visualization.
All About CSV Fingerprint (https://source.opennews.org/en-US/articles/all-about-csv-fingerprint/).
CSV Lint (http://csvlint.io/). Simple online format validator for CSV files
NARA File Format Preservation Plan ID entry for NF00143 (https://www.archives.gov/files/lod/dpframework/id/NF00143.ttl). Information in NARA File Format Preservation Plan ID about Comma Separated Value.
PRONOM entry for x-fmt/18 (http://www.nationalarchives.gov.uk/pronom/x-fmt/18). Information in PRONOM from UK National Archives about CSV. PUID: x-fmt/18.
Wikidata entry for Q935809 (https://www.wikidata.org/wiki/Q935809). Information in Wikidata about CSV Wikidata Title ID: Q935809.

Last Updated: 05/09/2024

Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction \| Sustainability Factors \| Content Categories \| Format Descriptions \| Contact

Sustainability of Digital Formats: Planning for Library of Congress Collections

CSV, Comma Separated Values (RFC 4180)

Identification and description

Local use

Sustainability factors

Quality and functionality factors

File type signifiers and format identifiers

Notes

Format specifications

Useful references