Search LC:

textMD: Technical Metadata for Text

Official Web Site at The Library of Congress

The textMD v.3.0 alpha element set

This document contains a listing of elements and their related attributes in textMD version 3.0 alpha with values or value sources where applicable. It is an "outline" of the schema, detailed by:

All descendants and all attributes below <textMD> are optional. All elements are repeatable. Attributes are not used in a mandated sequence and are not repeatable (per XML rules). "Ordered" below means the subelements must occur in the order given.

The proposed textMD 3.0 alpha schema adds the following new elements that did not exist in version 2.2:

Additionally, the 3.0 alpha schema now has a target namespace URI: info:lc/xmlns/textMD-v3. Versions up to (and including) 2.2 did not declare a default namespace URI.

↑ Back to top ↑

Root element in textMD element set

textMD
Usage: Root Element for bundling text technical metadata.
Attributes: none.
Contains: encoding, character_info, language, alt_language, font_script, markup_basis, markup_language, processingNote, printRequirements, viewingRequirements, textNote, pageOrder, pageSequence.
Contained by: none.
↑ Back to top ↑

Top-level <textMD> elements

These elements are direct children of the <textMD> root element. The sorting is based on the accepted sequence in which they may be used.

encoding
Usage: Technical aspects of the text generation, whether analog-to-digital or born digital.
Attributes: QUALITY.
Contains: encoding_platform, encoding_software, encoding_agent
Contained by: textMD.
character_info
Usage: Information regarding the encoding of characters within the file, including the standardized name of the character set, the byte order, the character size, and the line break mechanism.
Attributes: none.
Contains: charset, byte_order, byte_size, character_size, linebreak
Contained by: textMD.
language
Usage: Language(s) used in work. Use ISO 639-2 codes, which are enumerated in the schema as valid text values.
Attributes: none.
Contains: none.
Contained by: textMD.
alt_language
Usage: A language code/description for the text other than ISO 639-2. The alt_language element has a single attribute, authority, which may be used to record the source of the language code (e.g., Ethnologue).
Attributes: authority.
Contains: none.
Contained by: textMD.
font_script
Usage: The default font or script of the item.
Attributes: none.
Contains: none.
Contained by: textMD.
markup_basis
Usage: The metalanguage used to create the markup language, such as SGML, XML, GML, etc.
Attributes: version.
Contains: none.
Contained by: textMD.
markup_language
Usage: Markup language employed on the text (i.e., the specific schema or dtd). May be a URI for schema or dtd, but not mandatory.
Attributes: version.
Contains: none.
Contained by: textMD.
processingNote
Usage: Any general note about the processing of the file not covered elsewhere.
Attributes: none.
Contains: none.
Contained by: textMD.
printRequirements
Usage: Any special requirements for printing the item.
Attributes: none.
Contains: none.
Contained by: textMD.
viewingRequirements
Usage: Any special hardware or software requirements for viewing the item.
Attributes: none.
Contains: none.
Contained by: textMD.
textNote
Usage: Any general note on material not covered elsewhere.
Attributes: none.
Contains: none.
Contained by: textMD.
pageOrder (new element)
Usage: The natural (language-specific) page turning order of the text (left-to-right for Latin-based script, right-to-left for Arabic, Hebrew, etc.) independent of how it is represented in the METS file.
Attributes: none.
Contains: none.
Contained by: textMD.
pageSequence (new element)
Usage: The arrangement of the page-level divs in the METS file. That is, does the first div contain the first page a user would naturally read based on the language-specific direction of the text (the beginning of the content) or the last page the user would naturally read (the end of the content)? Enumerated values are 'reading-order' and 'inverse-reading-order'.
Attributes: none.
Contains: none.
Contained by: textMD.
↑ Back to top ↑

Low-level <textMD> elements

These elements are contained by the top-level child elements underneath <textMD>. The sorting is alphabetical.

byte_order
Usage: Byte order, primarily useful for cases where it’s not clear just by specifying an IANA character set. Uses enumerated values of ‘big,’ ‘little,’ and ‘middle' endian.
Attributes: none.
Contains: none.
Contained by: character_info.
byte_size
Usage: The size of an individual byte within the expressed as a number of bits (as integer). This does not necessarily equal the character size, as a character may have more than one, or a variable number of bytes per character.
Attributes: none.
Contains: none.
Contained by: character_info.
character_size
Usage: The size of an individual character within the character set as a number of bytes of the size expressed in the byte_size. In the case of variable encodings, such as UTF-8 for Unicode, the character_size element should state "variable" and also identify the specific variable character set encoding in the encoding attribute.
Attributes: encoding.
Contains: none.
Contained by: character_info.
charset
Usage: The character set employed by the text. Controlled vocab using IANA names for character sets.
Attributes: none.
Contains: none.
Contained by: character_info.
encoding_agent
Usage: Person who transcribed text from the original medium to another. For example, in the case of an oral history transcript or a transcription of a stone rubbing.
Attributes: role.
Contains: none.
Contained by: encoding.
encoding_platform
Usage: Hardware platform on which document was original produced, including specific computer type and any imaging equipment used for OCR.
Attributes: linebreak.
Contains: none.
Contained by: encoding.
encoding_software
Usage: Type of software used in producing text, including OCR, word processing, text editor, etc.
Attributes: version.
Contains: none.
Contained by: encoding.
linebreak
Usage: How line breaks are represented in current file (which may differ from how they were originally encoded). Either carriage return, line feed, or carriage return/line feed.
Attributes: none.
Contains: none.
Contained by: character_info.
↑ Back to top ↑

textMD attributes

These attributes may appear on given elements within textMD. The sorting is alphabetical.

authority
Usage: A string used to record the source of the non-ISO 639-2 language code (e.g., Ethnologue).
Contained by: alt_language.
encoding
Usage: Used to identify a specific variable character set (as a string), such as UTF-8.
Contained by: character_size.
linebreak
Usage: Used to indicate whether the type of linebreak that a system uses. Enumerated values are CR, LF, or CR/LF.
Contained by: encoding.
QUALITY
Usage: Used to record a quality measure (as a string) for the output of the encoding process (OCR quality, transcription quality, etc.).
Contained by: encoding.
role
Usage: Used to indicate the role of an agent. Enumerated values are OCR, TRANSCRIBER, MARKUP, and EDITOR.
Contained by: encoding_agent.
version
Usage: Used to record the version number (as a string) for a given piece of software, a markup language, or a schema version.
Contained by: encoding_software, markup_basis, or markup_language.
↑ Back to top ↑