Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
![]() |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | CSV, Comma Separated Values (strict form as described in RFC 4180) |
---|---|
Description |
CSV is a simple format for representing a rectangular array (matrix) of numeric and textual values. It an example of a "flat file" format. It is a delimited data format that has fields/columns separated by the comma character %x2C (Hex 2C) and records/rows/lines separated by characters indicating a line break. RFC 4180 stipulates the use of CRLF pairs to denote line breaks, where CR is %x0D (Hex 0D) and LF is %x0A (Hex 0A). Each line should contain the same number of fields. Fields that contain a special character (comma, CR, LF, or double quote), must be "escaped" by enclosing them in double quotes (Hex 22). An optional header line may appear as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. CSV commonly employs US-ASCII as character set, but other character sets are permitted. |
Production phase | May be used at any stage in the lifecycle of a dataset. |
Relationship to other formats | |
Has modified version | Variants of the strict form described here exist. See Notes below. |
Affinity to | TSV , TSV, Tab-Separated Values |
LC experience or existing holdings | The Library of Congress has many CSV files in its collections, over 840,000 as of May 2024. |
---|---|
LC preference | The Library of Congress Recommended Formats Statement (RFS) includes CSV as a preferred format for datasets. The RFS does not specify a type of CSV. |
Disclosure |
A simple de facto format, for which no single, official specification exists. The strict variant of the format described here was registered with IANA for the text/csv MIME type in RFC 4180. In RFC 4180, the required section in an RFC for MIME type registration that documents the "Published Specification" reads: "While numerous private specifications exist for various programs and systems, there is no single 'master' specification for this format. An attempt at a common definition can be found in Section 2 [of RFC 4180]." Some Useful References below provide variant specifications. |
---|---|
Documentation | IETF RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files. 2005. Available at http://tools.ietf.org/html/rfc4180 or http://www.ietf.org/rfc/rfc4180.txt |
Adoption |
Widely used as an exchange format for tabular data. Although very limited in functionality, there are many data exchange or data preservation contexts for which it is adequate, particularly when the syntax and semantics of fields are described in ancillary documentation that is also exchanged or preserved. CSV files can be imported and exported by almost any software designed for storing or manipulating data, including relational database systems, spreadsheet software, and statistical analysis software. CSV is a preferred format for interchange in many contexts because it is so easy to process. Recommended Data Formats for Preservation Purposes in the Florida Digital Archive (link via Internet Archive) lists CSV as a format with a high confidence level of providing ongoing access in a usable form. CSV is a recommended format for data deposit with Library and Archives Canada, a 'high' recommended format for sCornell implementation, and a recommended format for long-term retention by the State Archives of North Carolina. CSV was one of the primary formats into which the UK National Archives converted datasets that were selected for the National Digital Archive of Datasets between 1997 and 2010 (after which a government initiative promoting open data eliminated the need for such conversion by the National Archives). It is the preferred format for preparing tabular environmental data at the Oak Ridge National Laboratory and its use for tabular data is a best practice for the DataONE (Data Observation Network for Earth) project. Most government open data initiatives have CSV as one of the primary formats in which data can be downloaded. For example, CSV is one of the formats in which data from the U.S. Data.gov can be downloaded. Others include XML, ESRI shapefiles (ESRI_shape), and KML. The last two are for geospatial data. |
Licensing and patents | None. |
Transparency |
A simple text-based format that is very transparent, being both human-readable and easily machine-processable. Simple tools have been developed to validate files and visualize the content of the variables/columns. See, for example, CSV Fingerprint or CSV Lint in Useful References below. |
Self-documentation |
Poor. There is no internal capability to represent metadata, although the optional header row may provide some clues to the semantics of the columns. For preservation, an associated codebook is desirable, listing and describing the fields, and indicating types and ranges for field data values. In some contexts, the relevant information is supplied by documentation for a larger corpus or resource, rather than for each dataset. Accessibility Features Accessibility features for datasets and databases typically involve conformance to W3C's guidelines for page structure, tables and forms. In practical terms, this means pages (if applicable to the dataset) should be well-structured with regions and headings identified and the content is marked up or tagged on a page in a way that uses appropriate and meaningful elements; tables are organized through logical relationship in grids with labeled header cells and data cells that define their relationship; and forms (if applicable to the dataset) validate input provided by the user and provide options to undo changes and confirm data entry and notify users about successful task completion, any errors, and provide instructions to help them correct mistakes. Each of these criteria should be supported by text accessible to a screen reader. According to a 2021 discussion list thread on the W3C mailing list, "CSV is so limited in terms of what it can/can't do (can't actually define column or row headers), there's probably little to no scope in terms of remediating CSV files (other than perhaps making sure the first row and column contain header cells, even if they can't be explicitly denoted as such." A response adds "CSV files with more than one table, any table that is not at top and left, that has blank rows used for presentation, or other visual marking calling out specific cells would likely not be able to conform to accessibility requirements because the format does not provide sufficient semantic information." W3C's CSV on the Web: A Primer from 2016 states "CSV is also a poor format for data. There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or differing data types within a column." Overall, there is limited native support but applications can add options such as described in Make your Excel documents accessible to people with disabilities. Comments welcome. |
External dependencies | None. |
Technical protection considerations | None. |
Dataset | |
---|---|
Normal functionality | An extremely simple format with limited capabilities. The format does not support strong data typing and is limited to representing a simple tabular structure. |
Support for software interfaces (APIs, etc.) | The simple nature of the CSV format allows easy programming for parsing and using the data. |
Data documentation (quality, provenance, etc.) | No support. Most guidelines for use of the format for archiving datasets call for data documentation in separate files in appropriate formats. |
Beyond normal functionality | None. |
Tag | Value | Note |
---|---|---|
Filename extension | csv |
No particular extension is specified or required, but .csv is often used. |
Internet Media Type | text/csv |
Registered with IANA via RFC 4180. |
Other | NF00143 |
See https://www.archives.gov/files/lod/dpframework/id/NF00143.ttl |
Pronom PUID | x-fmt/18 |
See http://www.nationalarchives.gov.uk/PRONOM/x-fmt/18. |
Wikidata Title ID | Q935809 |
See https://www.wikidata.org/wiki/Q935809. |
General |
Several relatively common variations from the strict form specified by RFC 4180 are found and may be supported by software tools such as those listed below as Useful References:
Several other caveats are worth noting:
|
---|---|
History |
|