The Library of Congress >> Especially for Librarians and Archivists >> Standards

MARC Standards

HOME >> MARC Development >> Discussion Paper List


MARC DISCUSSION PAPER NO. 2017-DP01

DATE: January 4, 2017
REVISED:

NAME: Use of Subfields $0 and $1 to Capture Uniform Resource Identifiers (URIs) in the MARC 21 Formats

SOURCE: PCC Task Group on URIs in MARC

SUMMARY: This paper discusses the need to capture URIs in the MARC 21 Formats in a manner that clearly differentiates between:

To that end, the paper proposes restricting the use of the $0 to URIs that refer to Records describing Things, and defining the $1 to record URIs directly referring to the Thing.

Note: Standard vocabulary terms from controlled lists, such as MARC lists, are not generally considered Authority ‘records’; however, when those terms are represented as SKOS concepts and assigned actionable/dereferenceable URIs, they do carry with them ‘record’ like data in a particular vocabulary scheme.  The latter are referenced in this paper as Authority ‘records’ in conjunction with more traditional Authorities in a record format.

KEYWORDS: Authority record control number or standard number (All formats); Subfield $0 (All formats), Uniform Resource Identifier (All formats); URI (All formats); Real World Object (All formats)

RELATED: 2007‐06/1; 2009-DP01/1; 2009-DP06/1; 2010-DP02; 2010‐06; 2015‐07; 2016‐DP04; 2016-DP18; 2016-DP19

STATUS/COMMENTS:
01/04/17 – Made available to the MARC community for discussion.

01/21/17 – Results of MARC Advisory Committee discussion: There was mostly support for the paper returning as a proposal.  Concerns were raised that "$1" is the last remaining subfield code.  The committee agreed that so long as $0 and $1 were not defined in the same field, divergent usage of $0 should not be problematic. The committee's view was that PCC should develop best practice guidance which enables metadata librarians and systems developers to distinguish between Authority URIs, Thing URIs and URLs for web pages. It may be useful if PCC were to conduct a field by field analysis of MARC to establish where the definition of $1 is appropriate.  


Discussion Paper No. 2017-DP01: Use of $0 and $1 to Capture URIs

1. BACKGROUND

In the MARC 21 format, the $u contains URIs that serve as web addresses for documents (commonly called a URL or Uniform Resource Locator). The $0 contains an “Authority record control number or standard number” which may be in the form of a URI. To date, these subfields and URI distinctions (document location vs. control/standard number) have sufficed to meet library needs. As the library community moves into a linked data environment, however, new use cases arise that necessitate the refinement of existing subfield definitions and implementations, and/or the introduction of new subfields. Such evolution is exemplified by the recent refinement of $0 to remove the parenthetical prefix ‘(uri)’ in order to more easily facilitate dereferencing of HTTP format URIs [see 2016-DP18].

Experiments by the PCC URI Task Force and others in converting MARC 21 to linked data suggest that there are major benefits to storing URIs in MARC 21. That said, the Resource Description Framework (RDF), the recommended encoding for linked data, requires more semantic precision than MARC 21 currently contains. This paper argues that the use of different MARC 21 subfields for URIs that refer to different types of entities is an important prerequisite for the conversion to linked data, a proposal that is illustrated with a refinement for the definition of $0 and a new definition of $1.

A scope note. The Uniform Resource Locator (or URL) is another important type of URI, which provides addresses for human-readable websites, documents, or web pages.  But since the focus of this paper is linked data designed for machine consumption, document URLs are out of scope. URLs and the use of $u (described above) to capture them are not part of the proposal.

2. DISCUSSION

2.1. URIs and the Semantic Web

According to linked data design principles [COOL URIs, https://www.w3.org/TR/cooluris/], the semantic web infrastructure relies on the unique identification of entities—or, in semantic web terms, ‘Real World Objects’ (RWOs); or, even more colloquially, ‘Things.’ For example, a Person and a MARC 21 Authority record about the person are different RWOs, or Things, and each needs to be uniquely identified with distinct URIs for semantic clarity.

RDF statements about a living person may include lifespan dates or a home address, which would be accessible from a URI that functions somewhat like a Social Security number. But an authority record is fundamentally different because it is an information object that may contain a description of a person, as well as a revision history and other facts about the record itself. Although this difference may seem pedantic, it is important for making precise statements about library resources. When we state in a machine-understandable form that “William Shakespeare is the author of Hamlet,” we want to ensure that the reference is to the person who lived from 1564 to 1616, and not to an authority record or similar document. In short, a person can be an author, but a record cannot.

Unfortunately, this distinction is easily lost when URIs are recorded in MARC using current conventions. For example, a common pattern on the semantic web is to say that an Authority (modeled as skos:Concept) has a focus of the Thing (using the property foaf:focus), as in the example below:

<URI for an Authority for Some Person> <foaf:focus> <URI for Some Person> .

Alternatively, Person can also be linked back to the Authority using a different property (e.g. madsrdf:isIdentifiedByAuthority):

<URI for Some Person> <madsrdf:isIdentifiedByAuthority> <URI for an Authority for Some Person>.

If the distinction between Authorities and RWOs is not made when URIs are added to MARC records, the conversion to RDF produces incorrect relationships. For example, the first statement below ideally translates into RDF as the second statement:

100 1# $a Last name, First, $e author. $0 <Some URI>
<Some Thing> <authoredBy> <Some Person>

If, however, we reference the author’s LCNAF Authority record URI in the MARC 21 100 $0 subfield, as shown in first statement below, the RDF encoding that results from an automatic conversion process might look like the second statement below, since the URI refers to the record, not the person:

100 1# $a Last name, First, $e author. $0 <Authority record about the Person>
<Some Thing> <authoredBy> <Authority record about the Person>

Instantiated with real data, this pattern takes us back to the introductory discussion, asserting that Hamlet was authored by the Authority record about William Shakespeare.

In RDF, if you say that two entities are the same as each other using the common RDF property owl:sameAs, then everything stated about one entity is also true of the other. This can lead to messy data if the two things are not in fact the same. For instance, two authority records from different national authority files describing the same person are not the same resource. Each authority record has unique traits: different dates of creation and/or of modification, different sources of information, different processes asserted on them, etc. Therefore, rather than asserting that the two authority records are owl:sameAs, we want to assert that the focus of each authority record is the same Person, which is identified by the URI for the Person/RWO. URIs that directly identify a Person provide a bridge between different authority records focusing on the same Person.

The following diagram illustrates the semantic differences in RDF between Records about Things and Things/RWOs.

In the above example, it may appear that the VIAF URIs in upper right and center are identical, but the URI in the top right with the slash at the end refers to a foaf:Document (which serves the role of an authority record), while the center URI without the slash refers to a schema:Person.

2.2. Current Use of $0 URIs in MARC 21 and Conversion to RDF

Libraries have a strong history of creating authority records and controlled lists of terms, and of adding identifiers from international agencies to bibliographic records (e.g., ISNI). The $0 was added to the MARC 21 Bibliographic format to capture the control number of the relevant authority record control number or standard number for individual fields. In the last few years, the phrase “standard number” has been interpreted to include URIs. Current practice does not differentiate between the URI for the Record or the URI for Thing the record is about, so either type or both types of URI may be captured in a single MARC 21 field. Within the MARC 21 context, where semantics do not need to be as exact as RDF, this may not be a problem because either type of URI may provide access to machine-actionable data. If we try to convert the MARC 21 data to RDF, however, we run into difficulties.

In the following example, the 100 field contains two $0s. The first, from LC-NAF, records the URI for the authority record of the author Michelle Obama; the second, from the Virtual International Authority File (VIAF) records a URI referring directly to Michelle Obama as a Person (without the slash):

100 1# $a Obama, Michelle, $d 1964- $e author $0 http://id.loc.gov/authorities/names/n2008054754 $0 http://viaf.org/viaf/81404344

Consider what happens when we try to convert the above MARC 21 field into the following RDF statement:

<SomeWork> <wasAuthoredBy> <Michelle Obama>

It becomes apparent immediately that there is no way for a converter to correctly write the proper statements. Semantically precise RDF output for this example should consist of two statements. The first is simply a more machine-understandable rendition of the above statement:

<SomeWork> <wasAuthoredBy> <http://viaf.org/viaf/81404344> .

The second is an assertion that the Person referenced by the VIAF URI is the focus of an authority record that is accessible on the web:

<http://viaf.org/viaf/81404344> <madsrdf:isIdentifiedByAuthority>
<http://id.loc.gov/authorities/names/n2008054754> .

In other words, the URIs stored in $0 are ambiguous because they may refer either to Things, or to records or documents about them. As a result, it will be difficult for an automated conversion process to parse the semantics of the URIs.

It can be argued that converters should dereference each URI present in the MARC 21 record to understand whether the URI is for a Thing or for a Record (or Authority) about a Thing before writing any RDF triples. But this is not feasible for large-scale processes, especially considering the repeated conversions that will take place in the near future as libraries convert MARC 21 records to linked data while maintaining MARC 21 as the database of record. Both converter tools and data source targets such as id.loc.gov would not be able to handle the added load if the dereferencing of URIs were required each time.

Another argument for omitting the distinction in MARC is that it is not necessary for all use cases; and when it is, automated converters can scan the syntax of the URI itself to identify the type of resource by parsing the URI to find self-identifying tokens such as ‘person’ or ‘authority’. But this is not advisable linked-data practice. For example, the European Molecular Biology Laboratory states that:

“[S]oftware should never try to derive any particular meaning from the URI string itself. It is important that software treats all URIs as opaque so as not to make assumptions about data (e.g. a document's content type), but it can also be helpful to consider whether a URI should also be opaque to humans, especially for Linked Data where correct semantics are important. The most common example of where this can be a problem is the conflation of concepts and their names”
(https://www.ebi.ac.uk/rdf/documentation/good-practice-uris).

Because there are countless patterns a data provider may use to construct URIs, it is considered best practice for computers to interpret all URIs as opaque or restful (http://t-code.pl/blog/2016/02/rest-misconceptions-1/). This strategy is often required anyway because a URI may be truly opaque, giving no hint of the resource type it refers to.

Another option would be to link only to Things in MARC 21 and not to Records about Things. But this would mean ignoring most of the linked data resources available to us that have been derived from legacy library metadata records. Perhaps in the future, libraries will privilege minting and linking to Thing URIs over Record URIs, and reconsider their record-based models. In fact, the Library of Congress has recently acknowledged the distinction and is now publishing URIs for Things. For example, the URI http://id.loc.gov/rwo/agents/n2008054754  refers to the ‘RWO’, ‘real-world object’, or ‘person’ named Michelle Obama instead of the SKOS concept representing a heading maintained in a name authority file. But until such changes are fully implemented and widely adopted, we need to be able to link to the vocabularies we use in the models available now. Thus it is important to capture both types of URI in a way that clearly distinguishes them without incurring redundant or expensive processing overhead.

Hence a critical step in the process of preparing MARC 21 for RDF conversion is to designate a subfield that can be used throughout MARC 21 that contains URIs for Things, and is separate from a subfield with a similar distribution that captures URIs for Records or Authorities about the Thing. This paper recommends that $0 be used for Record or Authority URIs, while a newly defined $1 subfield should be used for Thing URIs.

2.3. Proposed strict interpretation of $0 for “Authority record control number or standard number”

The current definition of the $0 subfield is “Authority record control number or standard number.” The historic practice for adding URIs to $0s has focused primarily on URIs for Authority records, which include the modeled RDF data discussed in this paper as well as the human-readable document URLs that are not in its scope. But to make the most of the increasingly sophisticated RDF datasets now being published in the library community, it makes strategic sense to limit the scope of $0 in the MARC 21 formats to store control numbers or standard numbers (including URIs) that refer to Records about Things, as defined above, keeping the definition as close to the traditional library authority files as possible. As a result, URIs appearing in $0 should provide access to strictly machine actionable or parseable data from Authority records, SKOS Concepts, and other Record-like entities. In other words, this paper proposes that the definition of $0 be restricted to exclude traditional document URLs (generally found in $u) and Thing URIs.

2.4. Proposed addition of $1 in parallel with $0

In parallel with the $0 strict interpretation described above, we propose the use of $1 to hold URIs that refer directly to a Thing or RWO (Person, Place, Thing, Concept, etc.)--i.e., the actual Thing that is the focus of a $0 resource. For each of the MARC fields that provision URIs for Records using a $0 subfield, we also propose defining $1 to capture URIs for the corresponding Thing. But $1 and $0 do not have to co-occur; they can appear singly or combined in a MARC 21 field. The $1 is appropriate for this function because it is currently undefined throughout MARC 21, and allows the freedom to easily add RWO URIs anywhere in MARC 21 format.

The following diagram illustrates how the semantic differences in RDF between Records and Things/RWOs would map to the $0/$1 proposal:

To summarize, two arguments motivate the distinction between Record and Thing URIs:

Nevertheless, the implementation of separate Thing and Record URIs is still in flux and the arguments that motivate it are not fully settled. The distinction is especially problematic for abstract concepts such as 'kindness.'  Since the concept of 'kindness' is only realized through human language, does it have a real-world referent that is separate from the documents about it?

This is a classic philosophical problem, but one answer is that 'kindness' has a real-world referent if we can talk about it with mutual understanding. Accordingly, 'kindness' is a query that returns a Google Knowledge Card, just like obviously physical and tangible Things such as 'Michelle Obama' and 'Greenwich Village'. Moreover, library classification schemes such as the Dewey Decimal Classification assign unique identifiers that are independent of linguistic realization. And in the December 2016 document issued by the authors of the IFLA-sponsored Library Reference Mode ("Explanations of recurring issues from the LRM World-Wide Review"), the authors assert that "The term 'real world' is not restricted to physical things. It includes the world of concepts and ideas, which are quite real even though they are not physical." Finally, Thing URIs for concepts that are technically distinct from Document URIs are already being published. For example, the $1 as we have defined it can be populated with a Wikidata URI for 'kindness':

650 #0 Kindness $0http://id.loc.gov/authorities/subjects/sh85072376 $1http://dbpedia.org/resource/Kindness $1http://www.wikidata.org/entity/Q488085

Despite these arguments, however, the Library of Congress has not yet published Thing URIs for concepts, which can be interpreted as evidence that the reality of concepts is probably still a controversial subject that requires further discussion in the library community.

3. EXAMPLES

In each of the examples below the $0s represent what we would consider Authorities, and $1s represent Things (RWOs). Repeated $0s represent different authorities describing the same Thing (RWO). Repeated $1s represent different URIs for the same Thing (RWO).

Note: RDF URIs are for machine consumption rather than human consumption; however, to help humans understand what the URIs identify, RDF URIs often de-reference to a human readable web page URL.  In the first example below, when humans try to access the machine readable RWO URI http://id.loc.gov/rwo/agents/n95004729 from a web browser, the browser automatically redirects the request to a web page at http://id.loc.gov/rwo/agents/n95004729.html.
The visual difference between the machine readable URI and human readable URL is often very subtle and easily overlooked if one is not aware of it.  For example, in the dbpedia redirect, the word ‘resource’ in the machine readable URI is replaced with ‘page’:
http://dbpedia.org/resource/Astronaut
http://dbpedia.org/page/Astronaut
And in wikidata, the word ‘entity’ is replaced with ‘wiki’:
http://www.wikidata.org/entity/Q488085
https://www.wikidata.org/wiki/Q488085
Therefore, bear in mind that in trying to access the RWO URIs in these examples, you automatically will be redirected to a web page that is about the RWO URI resource (often akin to an authority record), and is augmented to facilitate human understanding, i.e., it generally contains more data than a machine is receiving when it reads the actual RWO URI.

Authority Format

Pattern: $0 {Authority URI} $1 {Thing URI}

500 1# $a Jemison, Mae, $d 1956- $0 http://id.loc.gov/authorities/names/n95004729 $1 http://id.loc.gov/rwo/agents/n95004729

374 ## $a Astronauts $0 http://id.loc.gov/authorities/subjects/sh85008988 $1 http://dbpedia.org/resource/Astronaut

750 #7 $a Kindness $0 http://id.loc.gov/authorities/subjects/sh85072376 $1 http://dbpedia.org/resource/Kindness $1 http://www.wikidata.org/entity/Q488085 $2 lcsh

Note: Currently there is a policy to not use 7XXs in LCNAF or other MARC subject authority files (e.g. LCSH, FAST, MESH), but they could prove useful in external Authority files wanting to link to the LCSH, FAST, MESH, etc.

Bibliographic Format

Pattern: $0 {Authority URI} $1 {Thing URI}

100 1# $a Obama, Michelle, $d 1964- $0 http://id.loc.gov/authorities/names/n2008054754 $1 http://viaf.org/viaf/81404344

100 0# $a Santa Claus $0 http://id.loc.gov/authorities/names/no2015039717 $1 http://dbpedia.org/resource/Santa_Claus

700 1# $a Stipe, Michael, $d 1960- $0 http://id.loc.gov/authorities/names/n91125827 $1 http://www.bbc.co.uk/things/3aeaa474-ad77-4eb0-a6ba-69f1af33b7f4#id

600 00 $a Zeus $c (Greek deity) $0 http://id.loc.gov/authorities/names/no2014048635 $1 http://viaf.org/viaf/308237987

650 #0 $a Kindness $0 http://id.loc.gov/authorities/subjects/sh85072376 $1 http://dbpedia.org/resource/Kindness $1 http://www.wikidata.org/entity/Q488085

651 #0 $a Greenwich Village (New York, N.Y.) $0 http://id.loc.gov/authorities/names/n97020733 $1 http://vocab.getty.edu/tgn/7015857

4. BIBFRAME DISCUSSION

The recommendations proposed in this document will facilitate the conversion of MARC 21 records to BIBFRAME. The distinction between Thing and Authority URIs is consistent with the BIBFRAME 2.0 model.

5. QUESTIONS FOR DISCUSSION

5.1. Is there agreement for the need to make a distinction between Things (RWOs) and Authorities in MARC?

5.2. Does the $1 provide the best available field to provision for URIs for the Things that Authorities and other record-like structures are about? Is there another subfield that is preferable to the $1? Should we instead attempt to make the distinctions outside of MARC, e.g. dereferencing the URI to find the type, or parse the syntax of the URI, regardless of the known scaling issues during conversion/load requests on data providers.

5.3. Often the $0 and/or the proposed $1 refer to the Thing described in $a and/or other subfields, but there are currently cases where $0 is defined to refer to something related to, but not described in the field (e.g., fields 883/885). Are these divergent uses of the $0 something we want to maintain? If so, how do we document and highlight the distinctions, so implementations consistently follow the specifications?

5.4. What is the best way to communicate to metadata librarians and system developers about how to distinguish among 1) Authority URIs, 2) Thing URIs, and 3) URLs for human-readable Web pages?

The PCC URIs in MARC Task Group has made an early attempt through a draft document, Formulating and Obtaining URIs: A Guide to Commonly Used Vocabularies and Reference Sources to describe URI implementation patterns for various data vocabulary/reference sources. Questions remain as to how to make this document more useful, what does a maintenance model look like, what vocabularies should be included.

Other tools (e.g. the W3C RDF Validator) need to be identified for when a resource is from a data source not listed in the Formulation URIs document or other similar documents.

5.5. An across-the-board solution implies that every MARC 21 field containing a $0 should also have a $1 field. Are there fields where the Thing and Authority distinction is not clearly drawn?

5.6. With a stricter interpretation of the $0 (as a subfield containing only URIs for Records about Things), what strategies would need to be taken to migrate URIs for Things currently in $0s to the newly defined $1?

We don't want the success of this proposal to have a chilling effect on the provision of data in $0 or the wholesale deletion of existing data "because it might be wrong". Tools will need to be created to validate URIs to make sure they are in their proper place.


HOME >> MARC Development >> Discussion Paper List

The Library of Congress >> Especially for Librarians and Archivists >> Standards
( 03/17/2017 )
Legal | External Link Disclaimer Contact Us