Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
![]() |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | Stata Data Format (.dta), Version 118 |
---|---|
Description |
The Stata_dta format (with extension .dta) is a proprietary binary format designed for use as the native format for datasets with Stata, a system for statistics and data analysis. Stata 1.0 was released in 1985 for the IBM PC. Stata is now available for Windows, Mac OS, and Unix. Versions of the .dta format are numbered separately from the Stata application. Version 118, described in this document, and given the name "Stata_dta_118" on this site, was introduced in April 2015 with Stata 14 and is also the default file format for Stata 15, which was released on June 6, 2017. A newer version of Stata_dta (version 119) was introduced in Stata 15, but is only used for datasets with more than 32,767 variables, as supported by Stata/MP. Stata 15 help for dta states, "Stata itself can read older formats, but whenever it writes a dataset, it writes in 118 format. If a dataset has more than 32,767 variables, Stata writes in 119 format." See Notes for more on the version history for Stata_dta and which version (sometimes called "release") of the dataset format is associated with which version of the application. Basic characteristics of Stata_dta_118 apply to all versions of the format. Numbers are represented as 1-, 2-, and 4-byte integers and 4- and 8-byte floating-point numbers. ANSI/IEEE Standard 754-1985 format is used for the binary floating point values, which is equivalent to IEEE Standard 754-2008 for the floating-point numbers used in .dta files. Byte-ordering (big-endian or little-endian), which varies with operating system and processor hardware, is declared in the file header. In Stata_dta_118, strings are encoded in UTF-8, whether in data, or in variable names, etc. In earlier versions the encoding was ASCII. Stata generally places a binary zero (hex 00, written as \0 in Stata documentation) at the end of strings. However, structural details have changed significantly with some format versions, particularly between versions 115 and 117. Most details in this description of Stata_dta_118 will be relevant to versions 117 and 119, not described separately at this time. A Stata_dta_118 file has the following general structure:
See Representation of strings and Representation of numbers for more details on these important aspects of the Stata_dta_118 format. |
Production phase | Designed as an initial-state or middle-state format to support creation and statistical analysis of data and intermediate storage and exchange of statistical data among users of the Stata system for statistical analysis. |
Relationship to other formats | |
Has earlier version | Several earlier versions not described separately at this site at this time. |
Has later version | One later version, 119, not described separately on this site at this time. |
LC experience or existing holdings | The Library of Congress has a small number of this family of formats in its collections. |
---|---|
LC preference | See the Library of Congress Recommended Formats Statement for format preferences for datasets. The RFS expresses a preference for widely adopted character-based formats rather than application-specific native formats or binary formats for datasets. |
Disclosure | Stata_dta is a family of proprietary formats developed and maintained by StataCorp LLC. Versions of the format dating from 2003 are publicly documented. |
---|---|
Documentation | The current version of the Stata_dta format is specified at http://www.stata.com/help.cgi?dta. As of June 2017, this specification is for Stata_dta_118 and provides links to documentation for Stata_dta versions between 113 and 119, covering Stata 8 (2003) through Stata 15 (2017). |
Adoption |
The Stata_dta_118 format is primarily used in association with Stata statistical software, which is widely used, particularly in academic settings. See, for example, Quantitative File Formats for Preservation, a post on the Digital Preservation Coalition blog, which indicates that the bulk of the datasets received by the Irish Social Science Data Archive are in SPSS, SAS, and Stata formats. Stata_dta files can be imported into and/or exported from other statistics software, including SPSS and SAS. readstata13 is an R package to read and write Stata file formats into a R data.frame. Stata_dta versions 102 to 118 are supported. Stat/Transfer, a popular conversion utility for statistical data, can read and write Stata_dta files. Stata_dta is a download format for several data archives, including the Survey of Consumer Finances from the U.S. Federal Reserve. Current Population Survey Data for Social, Economic and Health Research is available for download in Stata_dta format, as is the General Social Survey from NORC at the University of Chicago. See also Stata examples and datasets. Survey Solutions, free software from the World Bank Group for collecting data from structured interviews or web surveys includes Stata_dta among its Data Export Files. As of June 2017, Survey Solutions is generating files compatible with Stata 14, i.e., Stata_dta_118. The Stata_dta format is accepted by most statistical archives. ICPSR (Inter-university Consortium for Political and Social Research) accepts and distributes datasets in this format. The UK Data Archive lists Stata_dta as acceptable in its File Formats Table. Instructions from the GESIS archive in Germany on Preparing Data for Submission (link via Internet Archive) lists the Stata_dta among preferred formats. The list of preferred and acceptable File formats for the DANS (Data Archive and Networked Services) lists the Stata_dta format as preferred. The Institution for Social and Policy Studies (ISPS Data Archive. Link via Internet Archive.) accepts Stata_dta but prefers an ASCII file such as CSV. The popular NESSTAR software suite for assembling a collection of datasets for online discovery and analysis does not appear to support the import of Stata_dta files in the NESSTAR Publisher module. A list of recommended or acceptable formats that includes the Stata_dta format is from the Colorado School of Mines. The Dataverse guidance on ingest of Stata files says, "Stata does the best job at documenting the internal format of their files, by far. ... Because of that, Stata is the best supported format for tabular data ingest." |
Licensing and patents | No issues. |
Transparency | Stata_dta_118 is not transparent, since data values are stored in binary form. However, the ASCII (XML-style) tags that contain the file's components are visible when the file is opened in a text editor. See for example, Stata sample file, odd1.dta. This file is in Stata_dta, version 117, but version 118 would be identical except for the <release> value. |
Self-documentation | Stata_dta_118 can contain names and optional labels for variables. Labels that explain values for coded variables can also be included. Missing values are supported for numeric variables. There does not seem to be any way to embed a description of the file as a whole apart from an 80-character label for the dataset. |
External dependencies | None beyond software that can import data in this format. |
Technical protection considerations | Stata_dta_118 appears to have no internal capabilities for encryption or other technical protection. However, a discussion thread from 2007 on encryption of individual variables for anonymizing data implies that individual variable values may be encrypted for this purpose. The compilers of this resource have not determined whether this approach is widely used. Comments welcome. |
Dataset | |
---|---|
Normal functionality | The Stata_dta format is capable of representing all the data types used in Stata, a widely used software system for statistical analysis. |
Support for software interfaces (APIs, etc.) | See Adoption section above. |
Data documentation (quality, provenance, etc.) | See Self-documentation above. For re-use or long-term preservation, additional discipline-specific metadata, such as a Data Documentation Initiative (DDI) record, is often used in archival contexts. |
Tag | Value | Note |
---|---|---|
Filename extension | dta |
|
Magic numbers | ASCII: <stata_dta><header><release>118</release> Hex: 3C 73 74 61 74 61 5F 64 74 61 3E 3C 68 65 61 64 65 72 3E 3C 72 65 6C 65 61 73 65 3E 31 31 38 3C 2F 72 65 6C 65 61 73 65 3E |
From specification. |
Pronom PUID | fmt/1037 |
See http://www.nationalarchives.gov.uk/PRONOM/fmt/1037 |
Wikidata Title ID | Q32979267 |
See https://www.wikidata.org/wiki/Q32979267. |
General | |
---|---|
History |
Stata 1.0 was released in January 1985 for the IBM PC. It was a product of CRC, based in California. The first Unix version was released in 1998 and the first Macintosh version in 1992. CRC moved to Texas in 1993, and became StataCorp. See A brief history of Stata on its 20th anniversary in 2005. See also History of Stata. Although the .dta format has remained somewhat similar over the years, significant changes have been made. A recent version history follows:
PRONOM lists signatures for several earlier versions of Stata_dta, determined by inference and observation: version 111; version 110; version 105; and version 104. |
|