DISCUSSION PAPER NO. 86

DATE: May 5, 1995
REVISED:
NAME: Mapping the Dublin Core Metadata Elements to USMARC
SOURCE: OCLC/NCSA Metadata Workshop; Library of Congress
SUMMARY: This paper reviews the discussions held at the OCLC/NCSA Metadata Workshop in Dublin, Ohio in March about core data elements for discovery and retrieval of Internet resources ("metadata") by a diverse group of Internet users. The data elements as defined by the participants at the Workshop are listed with possible equivalents in USMARC. Problems in mapping are reviewed and options for resolutions are suggested.
KEYWORDS: OCLC/NCSA Metadata Workshop; Dublin core data elements; Internet resources
RELATED: DP87 (June 1995); DP88 (June 1995); 94-9 (June 1994)
STATUS/COMMENTS:
5/5/95 - Forwarded to USMARC Advisory Group for discussion at the June 1995 MARBI meetings.
6/26/95 - Results of USMARC Advisory Group discussion - Participants were interested in this effort and wanted to know how it would be implemented. The discussion paper will be reposted on the USMARC list, and the OCLC/NCSA Metadata Workshop Report will also be posted to continue discussion. Specific comments were as follows:
1) Date was unclear. It either should be generalized or more specifically defined.
2) It would be preferable to merge Author and OtherAgent and make them one element.
3) Attention needs to be given to the question of version; there was no specific element for it.
4) Questions arose as to why Coverage was included, since it is not generally applicable. In addition the term "Coverage" is vague and another term might be reconsidered.
5) Abstract would be a useful element; it was explained that this data could be under Subject with a scheme subelement=abstract.
6) It was suggested that 040$e be used to identify the origin of the metadata as outside the traditional cataloging record. Another option is in Leader/17 (Encoding level) as partial or preliminary. This needs to be considered further.
7) ObjectType: needs further consideration. Much of this is in 516.
DISCUSSION PAPER NO. 86:  Mapping the Dublin core 
metadata elements to USMARC

I.      INTRODUCTION

        The USMARC Advisory Group has discussed the creation of USMARC
records for Internet resources several times and has modified the
USMARC bibliographic format in several ways to accommodate them. 
Field 856 (Electronic Location and Access) was first defined in
January 1993 with Proposal No. 93-4 to provide location and access
information for electronic resources, and Internet resources in
particular, as a result of the OCLC Internet Resources Project. 
The field has been modified and enhanced in several proposals since
its initial approval.   Most recently, Proposal No. 94-9 (Changes
to the USMARC Bibliographic Format to Accommodate Online Systems
and Services) was discussed in June 1994 to make further changes in
the bibliographic format to allow for the creation of records for
online systems and services.  That paper included a list of data
elements needed for the description of these types of resources and
their mapping to USMARC fields.

        The OCLC/NCSA Metadata Workshop, held in Dublin Ohio March
1-3, 1995, was organized by OCLC and the National Center for
Supercomputer Applications (NCSA) to address the problem of
providing metadata for a larger proportion of network-accessible
materials.  ("Metadata" is defined as data about data; it is
roughly equivalent to a bibliographic description.)  The original
intent was to recognize various "stakeholder" communities with an
interest in the search and retrieval of Internet resources, to
understand the uses descriptive metadata would serve for these
communities, and to achieve if possible some consensus on a limited
data element set for identifying these resources. Workshop
participants included librarians and archivists, researchers,
computer and information scientists, software developers,
publishers, and members of Internet Engineering Task Force (IETF)
working groups.  Within these constituencies there was tremendous
diversity of approach.  Some participants were concerned with
electronic data resources in general while others focused on
particular types of materials, such as humanities texts or
geospatial metadata.   Some were interested in the network services
and protocols that would make use of the metadata, while others
took the point of view of the author, publisher or end-user.  

        The one thing that united all participants was a belief that
nearly any standard metadata would be better than none, since
currently there is little agreement and no standardization. 
Nonetheless, early in the course of the workshop it became evident
that no single data element set whether limited or unlimited would
satisfy the widely divergent and highly specific needs of the
various stakeholders.  The emphasis therefore shifted to something
that was perceived as both useful and doable: the definition of a
simple data element set that could be used by information providers
to describe their own resources.  The goal was to draft a single
sheet of instructions that an author or publisher mounting a
document on a network server would be able to follow without
excessive effort or additional knowledge.

        Such a data element set, if it could become an official or de
facto standard, would have at least four different uses.  It would
encourage authors and publishers to provide metadata simultaneously
with their data.  It would allow the developers of authoring tools
for network publishing to include templates for this information
directly in their software, making it even easier for the
information providers to supply it.  The metadata created by the
information providers would serve as a basis for more detailed
cataloging or description when warranted by specific communities. 
 And it would ensure a common core set of elements that could be
understood across communities, even if more specific information
was required within a particular interest group.  

        In order to make the task more manageable, some limitations on
scope were imposed.  First, the universe of materials to be
described was limited to document-like objects, or DLOs.  The
universe of document-like objects itself was left undefined, but
intuitively this would seem to include certain things like texts
and digitized photographs, and to exclude others like human beings
and computer services.  Second, the data elements themselves were
limited to those supporting discovery and retrieval of the DLO;
that is, to ascertaining that an object exists and obtaining a copy
of it.  Enough description should be included to allow the searcher
to confirm that the DLO in question is actually the desired object,
but not necessarily all of the information one would need to
support other valid purposes like security, authentication,
purchase or use. 

        Because it was agreed that a fairly short list of data
elements would be most useful and simple for naive users to use,
the concept of extensibility was established.  The metadata element
set concentrated on describing intrinsic properties of the
resource.  Extrinsic data, such as cost, access limitations, etc.
was considered outside the scope of the core set.  The extension
mechanism would allow for the base set to be extended for a variety
of purposes.  Specific user communities may have additional data
elements that are of particular importance.  Local additions are
accommodated by allowing any elements to be added to the record for
a resource.  A particular user community may establish a list of
additional elements that may be incorporated for specialized
purposes.  A scheme sub-element is defined for some of the elements
in the core set and may also be used for the extensible sets.  This
allows for the specification of established schemes or sets of
rules that govern the syntax or semantics of an element.  (For
instance, the URL scheme might be specified for electronic location
information; the various subject thesauri might be specified for
subjects.)

        The metadata element set that emerged is documented in a
position paper which is available through OCLC's World Wide Web
server at the following URL:
<URL:http://www.oclc.org:5046/conferences/metadata/dublin_core_re
port.html>  
It includes thirteen elements described briefly with examples of
their use.  These can be grouped roughly into three categories: 
        -      access points (Title, Subject, Author, OtherAgent,
               Identifier)
        -      information to facilitate identification (Publisher,
               Date, ObjectType, Language, Form, Coverage)
        -      information to relate this object to other objects
               (Relation, Source)
All elements are optional and repeatable, since the participants
did not feel they could predict every specialized use for the
metadata.  Some elements have additional subelements defined.

        The Dublin Core metadata element set is a core set in the
sense that it is a small number of elements, judged to have general
applicability, that will be universally understood if the standard
is followed.   It is not a core data element set in the sense of
being a minimum number of required elements.  The assumption is
that while the information provider is encouraged to supply all of
these elements, information that is not applicable or not readily
available can be omitted.  It is also not a core data element set
in the sense of being the minimum number of elements adequate to
describe an object.  As mentioned above, the extensibility
mechanism allows for additional data to support other purposes. Any
implementation will require an extensibility mechanism to include
other elements, either of local significance or pointers to other
established element sets (MARC, GILS, TEI, etc.).

        Perhaps the most important thing to note about the Core
Metadata Elements is that it is syntax-independent; that is, the
meaning and content of the elements are defined and described
independent of any particular way of encoding them, defined with no
necessary relation to any particular transport syntax.   The intent
is that the Core set can be mapped to any desired syntax, (e.g.,
USMARC, Standardized Generalized Markup Language, etc.).  This
situation can be compared to a cataloging code such as AACR2 that
identifies standard data elements but does not define a format to
use them.

        It is important to consider the different issues raised if a
human being or a machine performs the mapping to MARC.  If a
cataloger uses the metadata as a basis for creating a catalog
record, appropriate decisions can be made on a case-by-case basis. 
If the mapping is done by machine, it becomes more problemmatic. 
For example the SCHEME element may be helpful in machine mapping,
but only if the content of a SCHEME field is itself taken from an
authority list.    


II.     The "Dublin Core" 

        Below is the list of core data elements with definitions and
examples where available that were formulated at the Dublin
Metadata Workshop.  A mapping to USMARC fields is indicated
(formulated by the Network Development and MARC Standards Office,
and not at the Workshop).  In some cases questions are posed for
resolution of the problems in mapping.  Note that a mechanism would
have to be in place to convert the data from one transport syntax
to USMARC.

Subject:       words or phrases indicative of the information content.
               If the value comes from a controlled vocabulary, the
               SCHEME sub-element is used to indicate which vocabulary.
               EXAMPLES:       English language -- style -- data processing
                               Dogs
USMARC:        653 (Index Term--Uncontrolled) or 650 (Subject Added
               Entry--Topical Term)
This element has a SCHEME subelement defined to indicate the
vocabulary.  Thus, Library of Congress Subject Heading terms or
other controlled thesaurus terms could be used as data. Field 650
can be used for the subject headings or terms, but there may be
cases of incorrect mapping, such as when a geographic name (which
should be coded as 651) or a personal name (which should be coded
as 600) is used as subject.  The Dublin core element set does not
distinguish these.  The indicator would be set according to the
SCHEME.  This is an example of the content of SCHEME needing to be
authority controlled to be useful for machine mapping.  Field 653
can be used if SCHEME is not present, but is less than optimal for
controlled subject headings, because the implication is that the
headings are uncontrolled.  
               
Title:         the title, name, or short description of the object.
               EXAMPLES:       Moby Dick: an electronic version
                               Photograph of the Empire State Building
USMARC:        245 (Title Statement)
This could include subtitles.  Everything can be included in 245$a,
or the conversion would have to attempt to use punctuation for
parsing the data in subfields (245$a for title proper; 245$b for
subtitle).

Author:        the name or creator of the content.
               EXAMPLES:       Melville, Herman
                               Mao tse-tung
                               von Neuman Janos
                               von Neuman, John              
USMARC:        100 Main Entry--Personal Name) or 110 (Main Entry--
               Corporate Name) or 700 (Added Entry--Personal Name) or
               710 (Added Entry--Corporate Name)
Mapping the author brings up several questions.  If using 1XX
fields, the concept of main entry is not entirely applicable, since
main entry is an AACR concept, and there is no assumption that
these materials are being described according to library cataloging
rules. For our purposes, author could be main entry, but the 1XX
fields are not repeatable, so any additional authors would go in
700.  Since all elements are repeatable in the Dublin Core, there
could be more than one "author" (i.e., person responsible for the
general content of the work without a specified role).  It could be
difficult to determine whether the data belongs in Author or
OtherAgent, although otherAgent always would have a role defined,
so it could be distinguished from Author on that basis.  The
identification of either personal or corporate name causes
difficulty for the mapping, so whichever field is chosen would
result in a certain percentage of incorrect mappings.   However, we
may be able to assume that the majority of "Authors" will be
personal names, and that corporate names will probably have a ROLE
subelement attached.  

The USMARC Advisory Group might consider the definition of a
generic author field, i.e. an author that is undistinguished by
type.  See Discussion Paper No. 88 for a discussion of this issue. 
The problem of an author element also arose when the Network
Development and MARC Standards Office provided a USMARC mapping for
the Government Information Locator Service (GILS).  The GILS
profile used field 710, since it was not desirable to require that
a decision be made on main entry.  However, for that project, the
majority of authors would be government agencies, and under AACR2
would probably not be entered as main entry.  

The name should be given in the natural sort order of the language
being used.
        
Publisher:             the name of the entity responsible for making the
                       object available.
                       EXAMPLES:      Oxford University Press
                                      OCLC
                                      [Privately distributed]
USMARC:                260$b (Name of publisher, distributor, etc.)

OtherAgent:            the name of any other entity responsible for the
                       content of the object; the ROLE sub-element
                       describes the type responsibility. 
                       EXAMPLES:      otherAgent role=illustrator: Maurice
                                              Sendak
                                      otherAgent role=compiler: John Bear
USMARC:                700 or 710 (Added entry--Personal name or Added
                               entry--Corporate name)
The same problem of distinguishing personal and corporate authors
is evident here; see above under Author for discussion of the
issue.  OtherAgent includes a ROLE subelement; this would
correspond to 700 or 710$e (Relator term).  The data may or may not
be inverted; how this will be handled needs to be resolved.  A
proposal could be circulated to the Dublin metadata group to
include two "OtherAgents" so that we can distinguish between
personal and corporate authorship (OtherAgent (Person) and
OtherAgent (Organization)).

Date:                  the date of publication. Specifically not of the
                       content but of the actual object described.
USMARC:                260$c (Date of publication, distribution, etc.)
There are many dates defined in the USMARC bibliographic format. 
The only one considered a core element in the Dublin set is the
publication date.  Note that the date is also given in 008/07-10 in
a standardized form (the date in 260$c could include other elements
in addition to date of publication).  The extensibility mechanism
would be needed in many cases, where other types of dates are
particularly important (e.g., date of an original for digitized
texts).

Identifier:            a character string or number used to distinguish
                       this object from other objects; a SCHEME subelement
                       identifies the authority.
                       EXAMPLE:       Identifier (URL): http://www.oclc.org
USMARC:                               010 (LC Control Number)
                                      020 (ISBN)
                                      022 (ISSN)
                                      024 (Other Standard Identifier)
                                      856$u (Uniform Resource Locator)
Since all elements are repeatable and a SCHEME subelement is
defined for Identifier, it can be mapped to various USMARC fields. 


Object-type:           conceptual description of the object.
                       EXAMPLES:      book
                                      map
                                      graphic illustration                   
USMARC:                Leader/06 (Type of record)
Specific object types would convert to an equivalent value in
Leader/06.  For instance the object type "book" would convert to
code a for language material; "map" to code e for printed map (or
cartographic material if Proposal No. 95-16 is approved).  When
there is more than one value is available, such as sound recordings
(nonmusical sound recording and musical sound recording), it may
not be possible to make a distinction, but one unambiguous
conversion will need to be supplied.  In some cases, using
Leader/06 may not be specific enough, since something similar to an
Specific Material Designator (SMD) may be used.  In those cases, a
code in the appropriate 008 character position might be more
appropriate, but how to map these to USMARC is unclear.

Form:                  physical, logical, or encoding characteristics.  
                       (Information as to how it got represented in its
                       current form.)
                       EXAMPLES:      TIFF ver. 2.3.4.5.6
                                      SGML / TEI P3-1994
USMARC:                538 (System Details Note)

Relation:              Important known relationship to other objects; the
                       TYPE sub-element describes the nature of the
                       relationship; the SCHEME sub-element identifies the
                       notation used to identify the related object(s).
                       Relation (supersedes)(url):
                       http://www.oclc.org/cr0.9
USMARC:                772 (Parent Record Entry); 773 (Host Item Entry);
                       775 (Other Edition Entry); 776 (Additional Physical
                       Form Entry); 780 (Preceding Entry; 785 (Succeeding
                       Entry); 787 (Nonspecific Relationship Entry)
The TYPE sub-element indicates the relationship being expressed,
and thus the fields to be used.  The SCHEME sub-element might map
to a URL or a record control number.  See Discussion Paper No. 87
(Addition of Subfield $l in Linking Entry Fields 76X-78X in the
USMARC Bibliographic Format) for a discussion of defining a URL in
linking entry fields.

Language:              natural language of the object content; the SCHEME
                       element identifies the controlled vocabulary.
USMARC:                041 (Language code) or 546 (Language Note)
The SCHEME sub-element could identify the USMARC Code List for
Languages if 041 is used.  Alternatively, field 546 could be used
for a textual note.  Language is also given in coded form in
008/35-37.

Source:                object from which this object was derived; contains
                       a nested object description.
USMARC:                786 (Data Source Entry) or 776 (Additional Physical
                       Form Entry)
Since this element includes a "nested object description", a
linking entry field is appropriate with the separate data elements
of the nested description in defined subfields.  Field 786 was
recently defined for Data Source Entry.  However, the following
elements are not available in the linking entry fields: subject,
object type.

Coverage:              describes the spatial and temporal characteristics
                       of the object and is the key element for supporting
                       spatial or temporal range searching on document-
                       like objects.  Coverage can be modified by the
                       qualifiers "spatial" and "temporal".
USMARC:                Spatial: 034 (Coded Cartographic Mathematical Data)
                       or 255 (Cartographic Mathematical Data)
Whether the data is recorded in a coded or textual form would
determine which USMARC field would be used.  This element was added
to the Dublin Core elements as this paper was being finalized; an
example shows bounding coordinates as spatial data to be recorded
here.  In USMARC field 034 coordinates are recorded in separate
subfields (westernmost, easternmost, etc.); in field 255 they are
recorded in a textual form in subfield $c.  Note that the USMARC
and Content Standards for Digital Geospatial Metadata (CSDGM)
crosswalk uses both fields.
                       Temporal:  045 (Time Period of Content) or 513
                       (Type of Report and Period Covered Note)
Again the data may be recorded in a formatted form as yyyymmddhh in
field 045 or in a textual form in 513$b.  The example in the Dublin
document shows the data as formatted.  Note that the USMARC and
CSDGM crosswalk uses field 045 for this data.  The Government
Information Locator Service (GILS) mapping includes both fields,
depending on the data.


III.    CONCLUSIONS

        A series of messages distributed on the USMARC list in 1994
discussed the need for a specific identification of the record as
a non-traditional library catalog record, indicating that the
description and access points may not conform to expected standards 
(AACR2, ISBD, USMARC).  In this case, the record might conform to
USMARC in structure and tagging, but not precisely to the
definitions associated with the tags and other content designators. 
The Network Development and MARC Standards Office answered that
there were several data elements in USMARC that together could be
used to show this:  Leader/18 set to blank (non-ISBD) or u
(unknown), and a code defined in field 042 (Authentication Code)
for a particular project that would serve to identify the record as
non-standard cataloging (e.g. gils).  However, the USMARC Advisory
Group might consider the definition of a specific code (perhaps in
Leader/18) that would unambiguously indicate the nature of the
record.

        The USMARC Advisory Group might consider creating more generic
fields in MARC to accommodate this type of application that
requires flexibility in definition and use of fields.  It is
possible to map the metadata author or subject elements to MARC
fields but as with some other data, the data elements will not
always be completely accurate representations of data that is
expected in the fields as they are presently defined.  One instance
is with the "author" element.

        The Dublin core metadata elements will probably be revised and
refined as a wider discussion takes place.  There is some thought
to distributing a document summarizing the work in a Request for
Comment (RFC) to the Internet Engineering Task Force.  A drafting
committee has been formed to continue the effort, which includes
representatives from the library community, including MARBI and the
Library of Congress.  Any resultant document will probably include
the following:
        Goals and scope
        Underlying assumptions and philosophy
        List of data element set
        Extensibility framework
        Guidelines for application
        Relationship to other work
        Theaurus for multidisciplinary vocabulary

        The USMARC Advisory Group will be kept informed of the
progress of this effort.  It is important to settle on the USMARC
mapping early in the process, so that if there are any changes
needed to USMARC to better accommodate the metadata elements, work
can begin.

        Additional information about the Dublin Metadata Workshop is
available at the following URL: 
http://www.oclc.org:5046/conferences/metadata.

        See Attachment A for examples of metadata elements for some
Internet resources.

------------------------------------------------------------------
                                            ATTACHMENT A


Example 1:  Resource description for electronic versions of a print
OCLC Research Report

Element Name           USMARC Field           Content

subject                650
   scheme=LCSH                                Internet (Computer network)         
        
                                              Cataloging of computer files
                                              Information networks
                                              Computer networks
                                              Libraries--Communication systems
                                              Information storage and retrieval
                                              systems

title                  245$a                  Assessing Information on the
                                              Internet: Toward Providing Library
                                              Services for Computer Mediated
                                              Communication

Responsible agent              
                       700?                   
   role=author                                Martin Dillon
   role=author                                Erik Jul
   role=author                                Mark Burge
   role=author                                Carol Hickey
Note that this is a deviation from the core set, in that it does
not distinguish between author and OtherAgent.  This issue is under
discussion.

Publisher              260$b                  OCLC

Date                   260$c                  1994

Identifier             856
   Scheme=URL                                 http://ftp.rsch.oclc.org/pub/
                                              internet_resources_project/report/
                                              internet.ps

Object type:           Leader/06=a
   Scheme=USMARC                              Language material

Form:                  538                    7 postscript files
                                              1 Unix tar file
Note: Field 256 could also be used for this data.

Language:              041
   Scheme=USMARC                              English


Source:                786
                       ?                      Subject: same as above
                       $a                     Responsible agent: same above
                       $t                     Title same as above:
                       $d                     Date: 1993
                       ?                      Object type: same as above
                       $h                     Form:
                                              Scheme=AACR2; 1 v. (various pagings)
                                              : ill. ; 29 cm.
                       $d                     Publisher: same as above
                       $b                     Edition:  NA
Note that subfield $h was defined with Proposal No. 94-17A.

Other information known about this resource not in included in
Dublin Metadata record:

URL, names, and file sizes of 7 postscript
URL, name, and file size of tar file    


Example 2: Resource description for LC MARVEL

Element Name           USMARC Field           Content

subject                       
   scheme=LCSH         650                    Library of Congress
                                              Library of Congress--Catalogs
This points out a mapping problem, since Library of Congress should
be coded as 610.)

title                  245                    LC MARVEL: machine-assisted
                                              realization of the virtual
                                              electronic library

Responsible agent
                       700?            
   role=database provider?                    Library of Congress
                              
Publisher              260$b                  Library of Congress

Date                   260$c                  19??

Identifier             856
   scheme=URL                                 gopher://marvel.loc.gov
                                              telnet://marvel.loc.gov

Object type:           Leader/06              online database
This would map to code m for Computer file.  "Online system or
service" is available in 008/26.
  
Form:                  538                    gopher server?
                                              telnet?

Relation:                                     NA

Language:              041                    English

Coverage:              045                    ??

Source:                                       NA (This is an original work.)

Other information known about this resource not in included in
Dublin Metadata record:

Contact for assistance:  LC MARVEL Design Team, Library of
Congress, Washington, DC 20541; email:[email protected]       
        
                                
Example 3:  Resource description for TEI tagged electronic text of
The Haunted Hotel by Wilkie Collins

Element Name           USMARC Field           Content

subject                650                    could apply a genre term here 
                                             
title                  245                    The haunted hotel: a mystery of
                                              modern Venice 

Responsible agent  
   role=author         700? 100?              Wilkie Collins
   role=creator        700? 710?
   of electronic text                         University of Virginia Library
                                              Electronic Text Center  
       
Publisher              260$b                  University of Virginia Library

Date                   260$c                  1993

Identifier             856                    ftp:\\etext.lib.virginia.edu\...
   Scheme=URL

Object type:           Leader/06              Language material
or                     008/26=d               electronic text?
  
Form:                  538                    1 ascii file with minimal TEI tagging

Relation:              776$w                  [LCCN of original]
   Type=additional physical                   This is a machine readable version
        form                                  of the item specified in source.
   Scheme=LCCN                                in source.

Language:              041                    English

Coverage:                                     NA

Source:                786
                       ?                      Subject: same as above
   role=author         $a                     Responsible agent: same above
                       $t                     Title same as above:
                       $d                     Date: 1992
                       ?                      Object type: same as above
                       $h                     Form:
                                              Scheme=AACR2; 327 p. ; 29 cm.
                       $d                     Publisher: Dover Publications
                       $b                     Edition:  NA

Other information known about this resource not in included in
Dublin Metadata record:
Copies of the file are available to UVA faculty, staff, and
students.
Other details about the editorial principles and practices applied
during the encoding of the text.
Go to:
Library of Congress
Library of Congress Help Desk (09/03/98)