VI. Datasets
NOTE: See also Geospatial and Cartographic
The Library is aware that, in some cases, the provision of datasets and databases for current research uses (including support for the U.S. Congress) may depend upon native formats and associated software, while preservation and long-term access may depend upon data-migration via transport or export formats, with a concomitant risk of loss of precision and accuracy. Given the focus of this document is preservation and long-term access, the following format preferences favor those outcomes.
i. Datasets
i. Datasets |
|
Preferred |
Acceptable |
A. Formats |
- Platform-independent, character-based formats are preferred over native or binary formats as long as data is complete, and retains full detail and precision. Preferred formats include well-developed, widely adopted, de facto marketplace standards, e.g.
- Formats using well known schemas with public validation tool available
- Line-oriented, e.g. TSV, CSV, fixed-width
- Platform-independent open formats, e.g. .db, .db3, .sqlite, .sqlite3
- Any proprietary format that is a de facto standard for a profession or supported by multiple tools (e.g. Excel .xls or .xlsx, Shapefile)
- Character Encoding, in descending order of preference:
- UTF-8, UTF-16 (with BOM),
- US-ASCII or ISO 8859-1
- Other named encoding
|
For data (in order of preference):
- Non-proprietary, publicly documented formats endorsed as standards by a professional community or government agency, e.g. CDF, HDF
- Text-based data formats with available schema
For aggregation or transfer:
- ZIP, RAR, tar, 7z with no encryption, password or other protection mechanisms.
|
B. Related Materials |
Consult the appropriate sections of this document to identify the preferred formats for supplementary material
|
|
C. Delivery Method, in order of preference |
- Public download URLs
- Automated private download URLS with any necessary API keys or credentials
- Hard drive; CD-ROM; DVD-ROM
|
|
D. Metadata |
- Deposits should include all applicable metadata, data dictionaries, XML schemas, and technical specifications as appropriate. Discipline-specific metadata standards should be used whenever possible
- As supported by format:
- Title
- Creator
- Creation date
- Place of publication
- Publisher/producer/distributor
- Contact information
- A list of software used to produce, render or compress the data (if applicable)
- Character encoding
- Include if available:
- Language of work
- Other relevant identifiers (e.g., DOI, LCCN, canonical URL, etc.)
- Subject descriptors
- Abstracts
- Key or reference to each data field
- Checksums
- Permanent version specifiers (e.g., date, version number, etc.)
- Information about how the data was collected and any sampling or post-processing which as been applied
- Known copyright terms, especially for datasets which combine data from multiple sources
- For datasets serving as part of a database: proprietary database package and version
- For aggregate files: manifest or file list of payload content
|
|
E. Technological Measures |
- Files must contain no measures (such as digital rights management technologies or encryption) that control access to or prevent use of the digital work.
- Files in formats which support linking or embedding external resources (e.g. XML, JSON, Excel .xls or .xlsx) should be self-contained to remain useful in the event of external service changes.
- Files in formats which support executable code (e.g. Excel) do not contain executable code.
|
Files in formats which support executable code do not depend on embedded programs for purposes other than display (e.g. search, filtering, etc.); the raw data is available without executing code. |
Back to Top
ii. Databases
ii. Databases |
|
Preferred |
Acceptable |
A. Preservation |
Complete set of the content contained within the database
| |
B. Access, in order of preference |
- Publisher web interface with:
- Comprehensive and user-friendly search and discovery
- Counter compliant usage statistics
- Delivered preservation content
|
Documented API |
Back to Top