The Library of Congress >> Especially for Librarians and Archivists >> Standards

MARC Standards

HOME >> MARC Development >> Discussion Paper List


MARC DISCUSSION PAPER NO. 2022-DP02

DATE: December 21, 2021
REVISED:

NAME: Enrichment of Web Archive Information in Field 856 in the MARC 21 Formats

SOURCE: ISSN International Centre, Paris, and Finnish National Library

SUMMARY: This paper considers options for adding new subfields to the existing field 856 (Electronic Location and Access) in order to establish a subfield for persistent identifiers (PIDs): ARK, DOI, Handle and URN; also to allow separation of current and past (i.e., functional and dead) URL addresses including valid and confirmed Web archive addresses for the latter. The paper also provides a place for indicating date ranges for relevant archived content. Finally, this paper explains the need for specifying file formats for archived content more precisely. This can be accomplished by making 856 $q repeatable. In this document, the term Uniform Resource Identifier (URI) has been replaced by more precise terms PID, URN, and URL.

KEYWORDS: Field 856 (All formats); Electronic location and access (All formats); Access to online information resources (All formats); Open access information (BD, HD); Persistent identifier (PID) (All formats); Archival Resource Key (ARK) (All formats); Digital Object Identifier (DOI) (All formats); Handle system (All formats); Uniform Resource Names (URNs) (All formats); Uniform Resource Locators (URLs) (All formats); Electronic file format types (All formats); Internet Media Types (MIME Types) (All formats)

RELATED: 2020-DP01; 2020-03; 93-4; 97-1; 99-06; 2019-01; DP 49; DP 54; DP 69; 2018-DP11;Guidelines for the Use of Field 856, Revised August 1999;Guidelines for the Use of Field 856, Revised March 2002

STATUS/COMMENTS:
12/21/21 – Made available to the MARC community for discussion.

01/27/22 – Results of MARC Advisory Committee discussion: The paper's goal of making provisions to accommodate URIs for archiving was generally supported.  Several members noted that information about URLs and URIs should be machine actionable and the paper authors agreed. There was considerable discussion about whether a new field should be created to contain archival URIs. A straw poll vote overwhelmingly favored the creation of a new field (tag 857 was floated as a possibility). A proposal to update subfields $g, $h, and $q in field 856 and a discussion paper concerning the creation of a new field to record electronic archival location and access will be presented at the MAC Annual meeting.


Discussion Paper No. 2022-DP02: Enrichment of Web Archive Information in Field 856

1. BACKGROUND

ISO 3297:2020 allows for the assignment of ISSNs to a broad range of online continuing resources. According to the ISSN Manual, journals' URLs are recorded in field 856. As part of its 2024 Strategy, the ISSN International Centre wishes to implement a persistent identifier (PID) resolution service based on current and past URLs including the archive URLs. The ISSN International Centre promotes long-term preservation for continuing resources by publishing on Keepers Registry the archival status of about 70,000 journals and can retrieve archive URLs managed by the archiving agencies. The ISSN International Centre and the National Library of Finland are currently testing together a URN resolver to address this problem of link instability on the web. The value of persistent identifiers for coordination with a network of digital archives and legal deposit libraries is apparent.

Some libraries may already have alternative tools to manage their URLs and technical metadata that is needed for long-term preservation but others may rely on MARC records to store this metadata. The ISSN Portal collects data from 93 ISSN National Centres and normalizes it to the MARC 21 format. The possibility to differentiate several types of URLs in field 856 provides a solution to exchange and utilize these URLs in case no other process is available in a particular system.

The authors working on this discussion paper consulted with the authors of two other papers being prepared in parallel regarding field 856, one from the Deutsche Nationalbibliothek, and another from OCLC. In the discussions, they determined both that there were no significant conflicts among the papers and that there was generally mutual support for each. In spite of the perceived compatibility of the papers, it was decided to submit them separately because each had a distinct focus. Submitting them separately also avoided a single, unwieldy, and overly complex paper.

1.1. Historical Background

History of field 856 prior to 2020 has been described in Discussion Paper No. 2020-DP01.

Changes proposed in the discussion paper led to the modernization of the field, as outlined in MARC Proposal 2020-03. Technically outdated subfields $b, $h, $i $j, $k, $l, $n, $r and $t were removed. Based on the analysis made for the purposes of the proposal, these subfields have not been actively used.

1.2. Current Definition of Field 856

Field 856 is structured identically in the MARC Bibliographic, Authority, Holdings, Classification, and Community Information formats, although the definition and scope differ slightly from format to format. The Bibliographic field 856 is currently defined as follows:

Currently it is not clear if 856 needs to be repeated if there is a need to specify more than one electronic format type code in 856 $q. Providing two codes in a single occurrence of $q would however have a negative impact on machine readability of the data.

1.3. A Note on the Content Designator History

For reference, the history of the subfields discussed in this paper is described below.

$g - Electronic name - End of range [REDEFINED, 1997]

$g - Uniform Resource Name [OBSOLETE, 2000]
Because subfield $g (Electronic name - End of range) was rarely if ever used, it was redefined as Uniform Resource Name in 1997. It was subsequently made obsolete in favor of recording the URN in subfield $u. This was possible since at that point $g had not been used. URN and other PIDs became popular only much later.

$h – Processor of request [OBSOLETE, 2020]

$i – Instruction [OBSOLETE, 2020]

$j – Bits per second [OBSOLETE, 2020]

$k – Password [OBSOLETE, 2020]

$q - File transfer mode [REDEFINED, 1997]
Subfield $q was defined to contain an indication of whether the file was transferred as binary or ASCII. It was redefined to contain type of electronic format.
As a reminder of the past history, current description of the subfield contains outdated information about character encoding of files.

$u - Uniform Resource Identifier [RENAMED, 2000]
Prior to 1999, subfield $u was defined as repeatable. It was changed to not repeatable in favor of repeating the field due to ambiguity in determining when the subfield could be repeatable. Subfield $u was changed back to repeatable and renamed in 2000 to record URNs after subfield $g was made obsolete.

1.4. Requirements for Bibliographic Description

Separation of PIDs and URLs allows libraries to embed in MARC records both PIDs and URLs to which these PIDs should resolve. PID - URL mappings can be extracted from MARC tag 856 and loaded to the mapping tables in PID resolvers. If an organization does not have other means for delivering PID resolution-related information to the organization maintaining the resolver, it can use MARC records for this purpose. This functionality will be valuable, e.g., for members of the ISSN network to maintain the International Centre's URN resolver, but the proposed changes to the 856 field can support all PID systems and all organizations hosting PID resolvers. 

This paper suggests modifications to field 856 in order to distinguish between persistent identifiers and URLs, but also between different types of URLs. This requires separate subfields for PIDs (ARK, DOI, Handle and URN) and the current URL, past URL or URLs, and Web archive URLs. Using different subfields for PIDs and URLs allows harvesting of the PID of a resource and all current and Web archive URLs in which the resource is located. Bibliographic records containing this information in 856s can be harvested, and PID - URL mappings can be loaded into PID resolvers, in which the linking information can be thereafter maintained centrally and by automated means. Any PID resolver can in principle use this linking data in MARC records.

Subfield 856 $u provides for the current and valid URL. However, there is at present no way to specify a URL suffering from link rot (HTTP error 404) or content drift (retrieved document is no longer the one cataloged). Even outdated URLs can be useful; if the resource is not available on the Web, it may have been archived in one or more Web archives, and the old URL is the key with which the resource can be found and retrieved from these archives.

If the content of a Web page has changed completely (e.g., because the owner of the Internet domain is not the same anymore) it is important to specify the date range during which the archived content is valid for the cataloged resource. Different Web archives may have different temporal coverage of the resource, in which case it is useful to specify their date ranges individually. The same applies to the completeness of the archive; some archives may have harvested a Web site more often and more fully than others.

1.5. Requirements for long term digital preservation

There is a need to distinguish persistent identifiers (PIDs) from unstable URLs in MARC records when planning for long term digital preservation. Currently URN and other PIDs share 856 $u with URLs. This means that two fundamentally different data elements (identifiers and location) share the same subfield. Although it is usually possible to tell PIDs and URLs apart, parsing the character string in 856 $u in order to determine its nature is an additional burden. For instance, URN parsing would require an up-to-date list of registered URN namespaces. Without such a list it is impossible to separate URNs from URN-like URLs, such as the ones in Linkedin which use an unregistered urn:li namespace. Providing a separate subfield for PIDs will also permit library systems and other applications to have separate process and display options for PIDs and URLs.

Moreover, when MARC data is migrated to other metadata formats which do separate PIDs from other HTTP URIs, metadata migration will be less challenging if such separation is done in MARC as well.

If the described resource will be preserved for future generations, it is important to describe its electronic format accurately, including the format version specification. Only precise information will enable rendering of the file with appropriate hardware and software. Identification of file formats at risk of becoming unreadable or inaccessible using an automated process requires specification of both the file format and its version. Providing this information in human and machine-readable form may require two or more complementary codes. This can be accomplished by making 856 $q repeatable. Exact file format specification can be provided in technical metadata formats, including textMD or MIX, or in PREMIS, but there are libraries which are not able to use these formats. For them it is important to be able to communicate mandatory preservation metadata elements such as file format specifications in the MARC format.  

2. DISCUSSION

This discussion paper concentrates on how to provide PIDs, "normal" and outdated URLs and Web archive URLs in 856. Other potential extensions to 856 are not within the scope, apart from the need of specifying file format and version accurately, which can be achieved by making 856 $q repeatable.

2.1. PIDs and URLs in 856

PIDs include ARKs, DOIs, Handles and URNs. These are all identifier systems which require a dedicated application (resolver) to provide services based on the Persistent ID, including (but not limited to) providing all the current URLs of the resource or metadata about it.

Appropriate 02X fields should be used for identifiers in order to support duplicate control. In 02X PIDs should not be presented as HTTP URIs, since including the resolver address to the PID makes duplicate control complicated. For instance, an ISBN can be presented in 020 as 9789515177735, as URN in 024 (URN:ISBN:978-951-51-7773-5) and as HTTP URI in 856 (http://urn.fi/URN:ISBN:978-951-51-7773-5).  HTTP URI is not ideal for duplicate control purposes because the resolver URI can change, and the same URN can be resolved elsewhere.

There are two solutions to the problem of short-lived URLs in bibliographic records. One possible choice is to have bibliographic data rely on PIDs. If URLs are stored only in resolvers, link maintenance can be centralized and automated. For instance, if a DOI is provided in a bibliographic record instead of the actual URL, responsibility of link maintenance is delegated from the library to the publisher or another organization maintaining the DOI, and if the resource becomes available in multiple locations, the DOI can resolve to all of them.

Another choice is special treatment to URLs that have become outdated due to link rot or content drift. Such URLs are still important, if they can be used for retrieving the cataloged resource from Web archives. Web archive URLs may be generated automatically from the original URLs. For instance, harvested Plos ONE (https://journals.plos.org/plosone/) is available in the Internet Archive at https://web.archive.org/web/*/https://journals.plos.org/plosone/. If a manual check of the harvested content has been made, and it is certain that the archive link or links are valuable, it should be possible to provide them in a bibliographic record, in a separate 856 subfield. In present MARC formats, 856 provides just one option, subfield $u, for PIDs and URLs. 856 $u may be repeated if both PID and URL are provided, but it is not possible to specify two or more URLs in the same 856. Therefore, if it were necessary to specify the current and past URLs and a Web archive URL, three 856's will be needed, and there is no way to indicate the different roles of these URLs.

In addition, the usefulness of a Web archive link may be determined by information about the completeness of harvesting. Temporal coverage can be estimated by comparing publication date information of the resource to its harvesting date range.

Harvesting frequencies may vary a lot depending on the priorities Web archives have specified. For instance, major newspapers may be harvested daily.  Completeness of an archive is dependent on the frequency and depth of harvesting by a Web archive. Describing how often or, if the frequency is difficult to determine, how many times the resource has been harvested will help the users to choose the most appropriate archived copy, or to determine whether an archived copy is worth investigating. This information is valuable especially if the archived copy is not freely available and a user must travel to a legal deposit library in order to be able to access it.

We therefore suggest the following changes:

1.      Re-establish 856 $g for PIDs (ARKs, DOIs, Handles and URNs)
2.      Reserve 856 $u for URLs, and limit it to the current URL of the resource only.
3.      Create a new, repeatable 856 subfield $h for non-functioning URL
4.      Create a new 856 subfield $i for a validated Web archive URL
5.      Create a new 856 subfield $j for the Web archive harvesting date range, and
6.      Create a new 856 subfield $k for specifying the completeness of the Web archive.  

2.2. Internet Media Type and other codes in 856 $q

In order to facilitate digital long-term preservation, the electronic format of the cataloged resource has to be specified accurately. Sometimes it is not enough to specify just the file format. Version information may be essential in digital preservation, since the need to migrate a file may depend solely on the version of the file format (e.g., EPUB 2 versus EPUB 3).

Internet Media Types (https://www.iana.org/assignments/media-types/media-types.xhtml; formerly known as MIME Types) are a good starting point, but they are not always specific enough. For instance, all PDF versions have the same Internet Media Type code: application/pdf. Since normal PDF files cannot be preserved in the long term, but PDF/A files are archivable, it is important to indicate if the file type is PDF/A. Internet Media Types codes must be complemented with another code, such as the more granular PRONOM PUID code https://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm which allows digital archivists and machines to accurately determine hardware and software needed for rendering the document.

856 $q is currently defined as follows:

$q - Electronic format type (NR)
Identification of the electronic format type, which is the data representation of the resource, such as text/HTML, ASCII, Postscript file, executable application, or JPEG image. Electronic format type may be taken from enumerated lists such as registered Internet Media Types (MIME types).

Intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). The electronic format type also determines the file transfer mode, or how data are transferred through a network. (Usually, a text file can be transferred as character data which generally restricts the text to characters in the ASCII (American National Standard Code for Information Interchange (ANSI X3.4)) character set (i.e., the basic Latin alphabet, digits 0-9, a few special characters, and most punctuation marks) and text files with characters outside of the ASCII set, or non-textual data (e.g., computer programs, image data) must be transferred using another binary mode.)

The latter half of the second paragraph ("The electronic format type also determines the file transfer mode…") needs updating. 856 $q description should be changed so that if the electronic format type is specified, the value should be a code from a controlled vocabulary. Stronger prescriptive language is necessary to ensure the data is machine readable and understandable.

Description of the 856 tag in the MARC formats should contain a note saying that the accurate use of 856 $q is recommended for all resources which will be preserved in the long term, except if the electronic format specification is provided in media type specific technical metadata formats (TextMD, MIX, AudioMD, VideoMD) or in PREMIS format.

Note that most libraries may find it difficult to create Open Archival Information System (OAIS) submission information packages with descriptive and technical metadata provided in different metadata formats. It may be necessary to incorporate mandatory preservation metadata elements in MARC 21 Bibliographic Format. If so, file format specification in 856 $q is such an element.

Using the same 856 $q instance for both Media Type and PRONOM PUID may diminish machine readability of the data. Repeating the entire 856 for just one code is not practical. The problem can be easily solved by making 856 $q repeatable.

Since PRONOM PUIDs are machine understandable but not familiar for most library metadata users, it may be necessary to provide slightly redundant information, such as both Internet Media type code and version, even though PUID alone would be sufficient.

Information in 856 $q should support digital preservation by providing information necessary to allow people or machines to make decisions about the usability of the cataloged resource (what hardware and software might be required to display or execute it, for example).

File format specification may also be used for determining the need for migrating the resource to a more modern file format or subsequent version of the same format. Although such decisions will typically be made in dedicated digital preservation systems, library systems may be responsible for providing sufficient technical metadata to preservation systems in submission information packages, alongside the resources themselves.

2.3. Alternative fields that could be used

Web harvesting may be based on a legal mandate such as a legal deposit act. If so, creating an archived version of a resource is a preservation action, and field 583 (Action Note) may be used to provide additional information.

However, if harvesting is not supported by legal mandate, the publisher may request removal of the archived resource. Therefore 583 should be used only if harvesting is supported by legislation or agreements with the publisher. So, for instance, harvesting a Finnish continuing resource to the Web archive of the National Library of Finland is a preservation action, and could be described in 583.

Web archive harvesting date range should always be provided in 856 even if there is a legal mandate for archiving, since it would be confusing to the users if information about Web archival date ranges were available both in 583 and 856 fields, especially if such information is provided within one record.

Field 362 (Dates of publication and/or Sequential Designation) or 363 (Normalized Date and Sequential Designation) should not be used for description of the harvesting date range, since archived content may be incomplete. For instance, the National Library of Finland has had a mandate to harvest the Finnish web only since 2007, and our Web archive does not contain online serial volumes prior to that date.

3. PROPOSED CHANGES

In field 856 (Electronic Location and Access) of the MARC 21 Formats, make the following changes:

3.1. Revise the definition of subfield $q

Revise the defintion of $q and change it to repeatable, as follows:

Current Definition:

$q - Electronic format type (NR)
Identification of the electronic format type, which is the data representation of the resource, such as text/HTML, ASCII, Postscript file, executable application, or JPEG image. Electronic format type may be taken from enumerated lists such as registered Internet Media Types (MIME types).

Intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). The electronic format type also determines the file transfer mode, or how data are transferred through a network. (Usually, a text file can be transferred as character data which generally restricts the text to characters in the ASCII (American National Standard Code for Information Interchange (ANSI X3.4)) character set (i.e., the basic Latin alphabet, digits 0-9, a few special characters, and most punctuation marks) and text files with characters outside of the ASCII set, or non-textual data (e.g., computer programs, image data) must be transferred using another binary mode.)

Proposed Revision:

$q - Electronic format type (R)
Identification of the electronic format type, which is the data representation of the resource, such as text/HTML, ASCII, Postscript file, executable application, or JPEG image. Electronic format type should be specified with a code taken from the list of registered Internet Media Types (MIME types). If necessary (for e.g. in order to specify a file format version to support digital preservation) PRONOM Unique Identifier (PUID) codes may be used to complement the information provided by the MIME Type.

An up-to-date list of Internet Media Type codes is available at https://www.iana.org/assignments/media-types/media-types.xhtml

A description of the PRONOM PUIDs and a search interface to the code database is available at https://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm

Subfield $q may be repeated if two or more codes from different controlled lists and/or version number are provided.

If neither MIME Type codes nor PUID codes are suitable, a recommended solution is to use the Internet Media Type code of the format and the format version number, or the Internet Media Type code and free text description.

3.2. Rename subfield $u and revise its definition

Rename subfield $u and revise its definition as follows:

Current Defintion:

$u - Uniform Resource Identifier (R)
Uniform Resource Identifier (URI), which provides standard syntax for locating an object using existing Internet protocols. Field 856 is structured to allow for the creation of a URL from the concatenation of other separate 856 subfields. Subfield $u may be used instead of those separate subfields or in addition to them.

Subfield $u may be repeated only if both a URN or a URL or more than one URN are recorded.

Used for automated access to an electronic item using one of the Internet protocols or by resolution of a URN. Subfield $u may be repeated only if both a URN and a URL or more than one URN are recorded. Field 856 is repeated if more than one URL needs to be recorded.

Proposed Revision:

$u - Current Uniform Resource Locator (URL) (R)
Uniform Resource Locator, which provides standard syntax for locating and accessing a resource using existing Internet protocols. Subfield $u may be repeated.

Non-functioning URLs which no longer provide access to the described item (either due to link rot or content drift) should be transferred to 856 $h.

A redirected URL which is no longer valid but facilitates access to the current URL of the resource may be replaced by the current URL. If so, the old URL should be moved to $h. If the resource disappears later, both URLs can be used for Web archive access.

URLs based on shortening services (https://en.wikipedia.org/wiki/URL_shortening) such as https://tinyurl.com should not be used in $u. The real URL is a better option, since the shortened address cannot be used for Web archive access if the document vanishes from the Web. Moreover, URL shortening services may fail, and then the document is no longer available via the shortened URL.  

3.3. Add new subfields

Define new subfields $g, $h, $i, $j, and $k in field 856 as follows:

$g - Persistent identifier (PID) (R)
Persistent identifier, (e.g. ARK, DOI, Handle, URN) assigned to the resource used for automated access and other resolution services by a PID resolver such as Handle.Net. PID should be provided in 856 $g in HTTP URI format (actionable hyperlink).

Field 856 $g may be repeated if the resource has more than one PID. However, if 856 contains also one or more URLs in 856 $u to which these PIDs are intended to resolve, $g shall not be repeated unless all recorded PIDs provide the same services (that is, resolve to the same URL or URLs).

If a PID is already actionable and the resolver knows all relevant URLs, these URLs should not be provided in 856 $u. PID – URL mappings should be maintained centrally (and automatically) in resolvers, not in bibliographic records.

$h - Non-functioning Uniform Resource Locator (URL) (R)
Specific form of a Uniform Resource Locator (URL), which originally could be used for locating a resource using existing Internet protocols, but is no longer functional, either due to link rot or content drift. The former can be detected programmatically, the latter can only be detected by humans (see e.g. Example 4).

Subfield $h may be repeated, if the resource has more than one non-functioning URL.

If automatic or manual check of the URL in 856 $u has shown that the hyperlink no longer functions as intended, the URL should be moved or copied to 856 $h.

If the URL is no longer up to date and it is redirected to the current URL, the new URL may replace the old one in $u. This makes it possible to provide the old URL in 856 $h in order to support Web archive usage. 

$i - Web archive Uniform Resource Locator (URL) (NR)
Uniform Resource Locator (URL) which enables retrieval of a resource from a Web archive using existing Internet protocols. Field 856 should be repeated if more than one Web archive URL is recorded. Archived content may represent the entire (continuing) publication or just part of it, harvesting periods may differ, and access statuses may vary between different Web archives.

$j - Web archive harvesting date range (NR)
Date range, specified according to Extended Date/Time Format (EDTF), during which a continuing resource has been harvested to a Web archive as specified in $i. The start date should always be mentioned; the end date only if harvesting has stopped, or the resource is no longer published, or if the content of the resource has drifted because the ownership of the Internet domain has changed.

Subfield $j should not be repeated even if there are gaps in harvesting and more than one date range needs to be recorded. Multiple date ranges can be provided in a single 856 by separating them with ";".

856 $j is intended primarily for continuing resources.

$k - Web archive harvesting completeness (NR)
Contains information from the Web archive specified in $i about how often or how many times the resource has been harvested during the date ranges specified in $j. This information allows the library system users to estimate the completeness of the coverage of the resource in the Web archive.

856 $k is intended primarily for continuing resources.

4. EXAMPLES

Example 1: DOI and URL

100 1# $a Lee, F.
245 00 $a Enacting the Pandemic: Analyzing Agency, Opacity, and Power in Algorithmic Assemblages
773 0# $t Science & Technology Studies : journal of the Society for Social Studies of Science. $g Vol. 34 no. 1 (2021), p. 65-90
856 40 $g https://doi.org/10.23987/sts.75323 $u https://sciencetechnologystudies.journal.fi/article/view/75323

Example 2: URN and URL

022 0# $a1819-1819
245 04 $aThe ISSN Portal.
264 #1 $aParis : $bISSN International Centre, $c2005--
856 40 $ghttps://urn.issn.org/urn:issn:1819-1819 $u https://portal.issn.org/resource/ISSN/1819-1819

Example 3: Current URL; certified Web archive URL facilitating access to additional content

022 ## $a1932-6203
245 00 $aPloS one.
260 ## $a San Francisco, CA : $b Public Library of Science
362 1 # $a Began with vol. 1(1) (2006).
856 40 $i https://web.archive.org/web/*/https://journals.plos.org/plosone/ $j 2015-01-29- $u https://journals.plos.org/plosone/

Example 4: Non-Functional URL, including Web archive completeness

022 0# $a 2393-8048 $2 20 $l 2393-8048
222 ## $a International advance journal of engineering, science & management
245 ## $a International advance journal of engineering, science & management.
260 ## $a Sri Ganganagar $b Parth Computers
856 40 $h http://www.iajesm.com $i https://web.archive.org/*/http://www.iajesm.com/ $j 2014-2019 $k captured 17 times $z No longer available online as of October 29, 2021

Example 5: Current URL; certified Web archive URL facilitating access to additional content and specification of completeness via harvesting frequency

022 ## $a 1798-1557
245 00 $a HS.fi.
260 ## $a Helsinki : $bSanoma, $c2017-
856 40 $i https://web.archive.org/web/*/www.hs.fi $j2003-12-18- $kweekly $u https://www.hs.fi

Example 6: Specification of PDF/A version A-3b in 856 $q, using Internet Media Type code and PRONOM PUID

020 ## $a 9789521241291
024 7# $a urn:isbn:978-952-12-4129-1 $2 urn
100 1# $a Byanjankar, Ajay
245 10 $a Predicting risk and return in peer-to-peer lending with machine learning
347 ## $a text file $b PDF $c 1.626 MB $2 rdaft
856 40 $g https://urn.fi/URN:ISBN:978-952-12-4129-1 $u https://www.doria.fi/handle/10024/182600 $q application/pdf $q fmt/276

Example 7: Specification of FFV1 version 3 moving image file using the Internet media type code and version number

856 40 $q video/x-ffv $q 3

See https://www.iana.org/assignments/media-types/video/FFV1.

5. BIBFRAME DISCUSSION

The implications of these proposed changes on BIBFRAME will need to be considered together in order to prevent inadvertent data loss and conversion inconsistencies.

6. QUESTIONS FOR DISCUSSION

6.1. In the proposed form, 856 will enable libraries to rely more on archived Web content. Is there a general interest to do this, especially for continuing resources? And what about risks inherent in such a strategy? Although Web archive links themselves are persistent, the unlikely event of a failure of a Web archive will render a very large number of links invalid at once.

6.2. Alternatively, would it be better to create a new MARC tag to contain web archive information, e.g. Web archive name, archival period, etc.?

6.3. Should the beginning and the end of the archival period be provided separately, in different subfields, instead of as a date range?

6.4. Which codes do we need to describe harvesting frequency? Just the "normal" ones (daily, weekly, monthly, etc.) or something more? Can the catalogers estimate and describe the depth of harvesting of a particular Web site, or would it be better to just tell how many times the page has been harvested (since Web archives always provide this information)?

6.5. Are there other code lists than Internet Media Types and PRONOM PUIDs which should be mentioned in 856 $q? And if so, how would such a list or lists complement these two? 

6.6. Would the media type description still be sufficiently machine readable if Internet Media Type code and PRONOM PUID were provided in the same 856 40 $q (e.g. 856 $q application/pdf (fmt/480))?

6.7. Do separate subfields for different types of URIs have the potential for confusion or a higher error rate for manual cataloging?

6.8. Is there a need for discussion on if and when 583 can be used to describe Web harvesting as a preservation event?

6.9. If there is a need to create PID – URL pairs for resolvers from the data in 856 $g and $u, 856 $g should not be repeated unless there is a way to indicate which PID – URL pairs belong together. Order of subfields could be used to achieve this (e.g., 856 $g $u $g $u $u). Is this feasible, or is it better to just require that 856 is repeated for each PID – URL pair?

6.10. Are there other potential PID- or Web archive related issues that need to be considered when modernizing 856? And should all these issues be bundled into a single discussion paper/proposal?


HOME >> MARC Development >> Discussion Paper List

The Library of Congress >> Especially for Librarians and Archivists >> Standards
(05/17/2022)
Legal | External Link Disclaimer Contact Us