DISCUSSION PAPER NO. 102

DATE: May 1, 1997
REVISED:

NAME: Non-filing characters

SOURCE: USMARC electronic list

SUMMARY: This discussion paper presents problems and solutions for dealing with non-filing characters associated with variable field data in USMARC records.

KEYWORDS: Non-filing characters; Field 245, 2nd indicator (Bibliographic); Field 246 (Bibliographic)

RELATED:

STATUS/COMMENTS:

5/1/97 - Forwarded to USMARC Advisory Group for discussion at the June 1997 MARBI meetings.

6/30/97 - Results of USMARC Advisory Group discussion - There were different preferences for the technique to be used -- subfields, graphic characters, and control characters. Subfields would be easy to implement; graphic characters would be difficult to identify; and control characters are theoretically desirable but have some system drawbacks. There was a preference for two distinct characters, to be used before and after the non- filing part.

Several participants pointed out that the function under question is non-filing, not non-indexing. For example, the English word "the" might not be indexed any place in a string but the characters we are identifying are only when the word occurs at the beginning of a string. There was consensus that the discussion continue at Midwinter.


DISCUSSION PAPER NO. 102: Non-filing characters

1.   BACKGROUND

A technique using one of the indicator positions is currently
provided in USMARC for dealing with non-filing characters that
appear at the beginning of certain variable fields.  Generally, the
handling of non-filing characters by an indicator value works well
in those fields for which it is defined, but such an indicator is
not defined for all fields where initial articles and other non-
filing characters might occur.  The use of both available indicator
positions in many variable fields prevents the extension of this
technique to all fields where it is needed.  Stimulated by a
December 1996 message to the USMARC list from a USMARC users in
Israel, this problem has come to the forefront, particularly with
regard to the wider use of field 246 (Variant Title) in USMARC
following Format Integration.  Titles recorded in field 246, like
those in field 245 (Title Statement), sometimes have initial
articles.  Field 246 does not have a non-filing indicator and both
indicator positions are already defined for other users.  This
discussion paper presents the problems and issues surrounding the
handling of non-filing characters in MARC records.  It describes
techniques suggested for dealing with initial articles with the
advantages and disadvantages of each.  This paper is intended to
foster discussion and lead to a solution to the problem that
guarantees the least negative impact on USMARC systems and users.


2.   DISCUSSION

This paper deals with non-filing characters that appear at the
beginning of cataloging data in access fields.  The current USMARC
technique for identifying non-filing characters retained in records
involves the use of an indicator position that carries a digit (0
through 9) representing the number of characters to be ignored.  In
the USMARC Format for Bibliographic Data, a non-filing indicator is
defined in the following eleven fields:

     130   Main Entry--Uniform Title
     222   Key Title
     240   Uniform Title
     242   Translation of Title by Cataloging Agency
     243   Collective Uniform Title
     245   Title Statement
     440   Series Statement/Added Entry--Title
     630   Subject Added Entry--Uniform Title
     730   Added Entry--Uniform Title
     740   Added Entry--Uncontrolled Related/Analytical Title
     830   Series Added Entry--Uniform Title

A similar indicator was defined for the X00 (Personal Name), X10
(Corporate Name), X11 (Meeting Name), and X30 (Uniform Title)
fields in the USMARC Format for Authority Data.  The indicator was
made obsolete in 1993 for all except the X30 fields.  The change
was made in the USMARC Authority format because the X00, X10, and
X11 fields in the bibliographic format did not have corresponding
non-filing indicators.  For library systems with integrated
authority control, authority format indicators with no
bibliographic equivalents served no practical use.  Since it was
not possible to add the indicator to the X00, X10, and X11 fields
in the Bibliographic format, it was made obsolete in the Authority
format in those fields.

MARC records are created with data elements to support the
processing of the information in a variety of ways.  MARC records
are processed to create printed output products (e.g., catalog
cards, book catalogs, and COM catalogs), and for online
applications.  Online applications center on the indexing of
certain fields to provide access to records using predetermined
search keys.  Search keys provide access through titles, named
persons and corporate bodies, subject terms, classification, and
standard numbers.  It is access points for titles and names that
sometimes include parts of speech or other character strings which
are not always significant for output and retrieval.


3.   INITIAL ARTICLES

The most common non-filing characters in MARC data are initial
definite and indefinite articles, "the" and "a"/"an" in English and
their foreign language counterparts.  Non-filing characters can
include other character strings that are to be ignored in
processing.  Articles play an important role in many languages but
are often dropped or ignored in processes such as filing.

Articles are almost universally ignored in sorting and filing when
they appear at the beginning of a name or title because they tend
to be used intermittently.  Titles and names may be found with or
without an initial article.  For example, the political leader,
Anwar al-Sadat is usually listed by a surname that omits the
initial Arabic article "al-".  Likewise, titles that might be
spoken or written with an article (for example: The Meaning of
Life), are almost always listed without the definite article "the".

Not all languages possess parts of speech such as articles (all
Slavic languages except Bulgarian lack articles), or the articles
associated with the first word of a title may be enclitic (for
example, articles in Bulgarian and Romanian are appended to the end
of a word).  For languages with independent, initial articles,
their use can be very important grammatically.  German, for
example, expresses grammatical case through a variety of initial
definite and indefinite articles.  Arabic and Hebrew use initial
definite articles with both nouns and adjectives.

Articles are used less often in English but still play an important
role in grammar.  For example, English speakers only use initial
articles with personal names when applied to inanimate objects (for
example: "The Henry" would be grammatically correct if referring to
a hotel or ship).  In German it is grammatically possible to say
"der Heinrich" (that is "the Henry"), even when refering to a
living person.  It suffices to say that articles are important
enough that many cataloging rules allow them to be included in
bibliographic data in MARC records


4.   OTHER NON-FILING CHARACTERS

Initial articles are not the only non-filing characters that might
appear at the beginning of cataloging data.  It is common for
special marks to occur at the beginning of access points,
particularly titles.  An opening quotation mark is perhaps the most
common non-filing character to be found at the beginning of titles. 
For some languages, other marks can occur.  For example, in Spanish
the inverted question mark and inverted exclamation mark occur at
the beginning of phases that also end in the regular question mark
("?") and exclamation point ("!").  Other non-filing characters
found in MARC data include the opening square bracket ("[",
signifying a cataloger-supplied title]), the opening parenthesis
("("), as well as initial periods ("...") or dashes ("--") used to
replace them.  Alphanumeric characters that are not articles can
also be ignored in some cases.  In MeSH (Medical Subject Headings),
for example, name of chemical compounds, when including prefixed
letters or numbers, are sorted and filed ignoring the prefixes.  If
not handled in some way, these characters can affect the proper
placement of names, titles, and descriptors in alphabetic indexes.

Examples of Other Nonfiling Characters

...and then I said  [Book title]
[inverted�]Baile comigo!  [Song title]
[inverted�]Quien es quien en el Peru?  [Book title]
16,16-Dimethylprostaglandin E2  [Subject descriptor for a chemical
     compound]
N,N-Dimethyltryptamine  [Subject descriptor for a chemical
     compound]


5.   HANDLING OF NON-FILING CHARACTERS

Use of Indicators

The current USMARC solution for dealing with non-filing characters
has been described briefly already.  It makes use of a indicator
position to signal the number of initial characters in a field to
be ignored in processing.  This technique has these advantages:

-    Creator of the data can decide on the number of characters to
     be skipped in filing.

-    The data itself in the first indexed subfield is not polluted
     with extraneous graphic or control characters.

Disadvantages to this solution include:

-    Some variable fields do not have an available position to use
     for a non-filing indicator.  This is why field 246, which has
     no available indicator position, is used so often as an example
     of the problem.

-    As currently defined, nine (9) is the largest value possible in
     any of the non-filing indicators. (Note: higher values could be
     coded if alphabetic characters were allowed as indicator
     values, and these were assigned decimal values).

-    Use of an indicator cannot identify characters to be ignored in
     other parts of a field, for example, at the end of words.


Use of Graphic Characters as Delimiters

Other solutions have been suggested or used in other formats.  For
example, it has been reported that the unused graphic character
SPACING UNDERSCORE ("_") is used in some German systems to set off
non-filing characters wherever they appear.  The advantages to this
are:

-    The character is available in most computer systems, and

-    All non-filing characters can be easily delimited.

Disadvantages include:

-    Regular cataloging data is polluted with additional graphic
     characters which must be omitted in printed output and
     displays.

-    The SPACING UNDERSCORE character is now found as part of
     legitimate cataloging data, particularly in Internet addresses
     and file names, which make its exclusive use as a non-filing
     character delimiter questionable.


Use of Special Control Characters

A pair of special control characters, such as the NON-SORTING
CHARACTER(S), BEGIN and NON-SORTING CHARACTER(S), END characters
defined in ISO 6630 (Bibliographic control set) could be used as
delimiters for strings of non-filing characters.  Use of such
characters have the following advantages:

-    As specially-defined control characters, they are unique and do
     not conflict with graphic characters that might occur in data.

-    The control characters can be used anywhere, initially,
     medially, and finally.  This allows the demarcation of initial
     articles in subfields, which could be useful in subfield $t of
     the USMARC linking entry (76X-78X) fields and elsewhere.


Disadvantages of the control character solution include:

-    Special characters require system implementation that affects
     hardware, software, and existing data.

-    Cataloging data include special control characters which must
     be handled in printed output and displays.

-    They are not mappable to universal character set encodings


6.   OTHER SOLUTIONS

System Recognition of Articles

A commonly-suggested solution to dealing with non-filing initial
articles is to program library systems to recognize grammatical
articles automatically.  In theory this solution sounds attractive. 
Machine handling of articles, it is suggested, would not be subject
to human error.  Unfortunately, in practice it is very difficult
for a computer to identify initial articles.  Character strings
such as "the" and "a"/"an", which are certainly English articles,
are also legitimate non-article words in other languages.  For
example, in French, "the" means "tea", "a" means "to", and "an"
means "year".  If these strings occurred initially in a French
title they should be filed upon.  The language coding in USMARC
records is not designed to control computer handling of initial
articles in access fields.  The variety of languages which might be
represented in a single record, both in the description and access
points, make machine determination of articles based on language
coding impractical.

Many systems already deal programmatically with non-filing
characters to a limited degree.  Special marks (for example,
quotation marks) are already ignored in sorting, indexing, and
retrieval.  Otherwise the case (i.e., upper or lower) of alphabetic
characters is also ignored, as are associated diacritical marks
which in some cases are counted as "non-filing characters".


Subfield for Articles

It has also been suggested that a special subfield could be defined
in USMARC for non-filing characters.  Subfield $i is often
recommended. Unfortunately, subfield $i is already defined for
other data in several variable fields.  Subfield $1 (one) is the
only subfield currently undefined in all variable fields but there 
would likely be considerable opposition to using a control subfield
at this level for initial articles.  The implementation of a
subfield-level data element for non-filing characters would also
have the disadvantage of separating pieces of titles and names that
belong together for other processing.


Omission of Articles

One of the most widely used solutions for dealing with initial
articles has been to omit them from cataloging data altogether. 
This solution has been particularly widespread in the treatment of
initial articles associated with personal names, so much so that
the non-filing indicator for personal, corporate, and meeting names
was made obsolete in USMARC in 1993.  Rules for the inclusion or
omission of initial articles in other access points vary but have
tended to favor omission in recent years.  In field 240 (Uniform
Title) and field 740 (Added Entry--Uncontrolled Title), although a
non-filing indicator is defined, generally it is not used (i.e., it
is always set to value 0) and any initial articles are omitted. 
This solution has been suggested for field 246 as well.

The omission of initial articles to deal with not being able to
handle them otherwise is not totally acceptable to some USMARC
users.  European and Middle Eastern libraries have been
particularly vocal in their call for a generalizable technique,
like the UNIMARC control character technique, for indicating
non-filing characters.  Their chief argument has been that the
simple omission of articles corrupts the cataloging data
grammatically and yields title strings that the public finds
unacceptable.

7.   QUESTIONS

Whatever the ultimate solution to this problem, if it involves a
change to the USMARC formats themselves, current USMARC-based
systems, users, and data would have to accommodate the change. 
Many worry that the impact would be considerable.  Some of the
questions that have been raised are:

-    Would a new technique for dealing with initial articles replace
     or supplement the existing techniques in USMARC?

-    Would existing USMARC records have to be modified to reflect
     the new technique?  (If not, how would new systems deal with
     old records, and vice versa?)

-    How important is dealing with this problem, considering the
     increasing international use of USMARC?


Go to:


Library of Congress
Library of Congress Help Desk (09/03/98)