PROPOSAL NO: 98-18

DATE: May 27, 1998
REVISED:

NAME: Unicode Identification and Encoding in USMARC records

SOURCE: MARBI Unicode Encoding and Recognition Technical Issues Task Force

SUMMARY: This proposal specifies changes to enable the encoding of records using the 16 bit Universal Character Set (USC-2) (ISO 10646) and Unicode. It recommends initially the use of UTF-8, with further investigation for future specification of UTF-16 encoding.

KEYWORDS: Unicode; Universal Character Set; ISO 10646; Character sets

RELATED: 97-10 (June 1997)

STATUS/COMMENTS:

5/27/98 - Forwarded to the USMARC Advisory Group for discussion at the June 1998 MARBI meetings.

6/28/98 - Results of USMARC Advisory Group discussion - Approved, with Option 1 selected in proposal section 3.2 (use Leader 09 for indicating record is Unicode). The Task Force was asked to continue work in order to treat the questions and issues raised in Appendix A. Items 2 and 3 in Appendix A are to be given highest priority (and it was noted that item 6 was actually no longer needed as it is treated in 98-18).

7/29/98 - Results of LC/NLC review - Agreed with the MARBI decisions.

PROPOSAL NO. 98-18:Unicode identification and encoding

1 BACKGROUND

1.1 Task Force charge

MARBI established a Unicode Encoding and Recognition Technical Issues Task Force in November 1997 and charged it to develop specifications for:

This work complements that of two other MARBI Task Forces: the Character Set Subcommittee that provided mappings of all USMARC character sets (except CJK) to Unicode and the CJK Character Mapping Task Force which is currently providing a mapping for USMARC CJK characters. This proposal is the result of the Encoding and Recognition Task Force work.

Formation of the Task Force resulted from the existence of unresolved questions about certain recommendations of Proposal 97-10 which finalized the non-CJK mappings. Two of these recommendations concerned definition of new subfields for field 066. The other specified an encoding scheme for identifying that Unicode was used in a record. The recommendations were not sufficient for specifying the use of Unicode and were not approved. Rather this Task Force was organized to analyze the requirements and propose workable solutions.

One of the principal motivations for adopting a UCS encoding is to facilitate expansion of the USMARC character repertoire and once this encoding has been specified there will be considerable pressure to use additional characters in USMARC records. The specifics of such expansion lie outside the charge to this Task Force, but restriction of characters to those listed in the USMARC to UCS mapping are viewed to be operative only until such time as proposals concerning expansion are submitted by interested parties and adopted by MARBI.

1.2 Concepts and terminology

A short review of terminology will facilitate understanding of the issues the Task Force has considered. What follows is necessarily simplified. Appendices A and C of The Unicode Standard, Version 2.0 contain a more comprehensive and rigorous exposition of these concepts.

This report follows the terminology practice of international standards that refer to a sequence of eight bits as an octet rather than a byte, even though in today's world a byte is almost always an octet.

Unicode (trademark) is the coded character set now defined by The Unicode Standard, Version 2.0, but it should be understood to include later versions as they result from the process of maintaining the exact correspondence of character repertoire and code point assignments with ISO/IEC 10646. Though they are identical in those respects, Unicode differs from ISO/IEC 10646 in defining character semantics and properties to facilitate interoperability between conformant applications. These definitions, incorporated in The Unicode Standard, Version 2.0, are an integral part of the concept of Unicode. When this report speaks of using Unicode in USMARC, it means that those characters which have been approved through MARBI may be used.

ISO/IEC 10646, the Universal Character Set (UCS) standard, defines two forms of encoding. The more capacious requires 31-bits per character, permitting the definition of a very large repertoire. Because a 31-bit character occupies four octets, this form is known as UCS-4. The other form requires 16 bits (two octets) per character; hence it is called UCS-2. The 65,535 values that can be represented in UCS-2 are enough to encompass most of the characters used in contemporary languages, and UCS code values for them have been assigned in that range. The set of possible UCS-2 values therefore has another name, the Basic Multilingual Plane (BMP) of the UCS.

No UCS character assignments outside the BMP have been made. The character repertoires and code value assignments in the BMP and in Unicode are the same. In this sense Unicode and the UCS BMP are effectively synonymous. Unicode includes a stratagem, "surrogates," that can provide access to roughly a million non-BMP characters that may be assigned in the future. Such assignments are likely as coverage of ideographs becomes more comprehensive.

Another ISO/IEC 10646 concept is the UCS Transformation Format (UTF). UTFs are alternative representations of UCS-4 and UCS-2. They are designed to enable communication protocols to transfer UCS data without confusion or loss. A feature of UTF representation is that not all characters require the same number of bits.

UTF-16 expresses a UCS character as a sequence of one or more 16-bit sequences. This format is a "transformation" only for characters that cannot be represented in UCS- 2. For those that can, the UCS-2 and UTF-16 encodings are identical.

UTF-8 provides for safe transmission in 8-bit environments, such as the Internet. It expresses a character as one or more octets. ASCII characters require a single octet. Other BMP characters require two or three. An 8-bit ASCII character and its UTF-8 encoded value are identical.

UTF-7, devised to support 7-bit transfer protocols such as MIME, also expresses characters as sequences of octets but necessarily uses a more restrictive rule than UTF-8 about what each octet can contain.

USM-94. During its deliberation, the task force found it convenient to have a concise way to refer to the USMARC character repertoire and encoding that are currently in use; that is, to abbreviate "the ASCII and ANSEL character sets (except for certain ANSEL characters,) special escape sequences for a limited set of subscripts, superscripts, and Greek letters, and the ISO 2022 (X3.41) escape sequences for Arabic, Cyrillic, Hebrew and CJK, and the character sets to which those sequences provide access." The term USM-94 has been coined for this purpose.

2 DISCUSSION

2.1 Issue 1: Encoding methods for USMARC Unicode records

Unicode can be encoded in a USMARC record either in this "native" 16-bit form or in a UCS Transformation Format (UTF). The Task Force considered three encoding options: UTF-16, UTF-8, and UTF-7. UTF-7 was eliminated because it is obsolescent.

In considering 16-bit representations, the Task Force had to make a choice between UCS-2 and UTF-16. The difference is a subtle one. Referring to untransformed Unicode as UTF-16 rather than UCS-2 was chosen for two reasons. First, it provides a better terminological counterpart to UTF-8. Second, and more important, it indicates unambiguously the possibility of including non-BMP characters in USMARC records, when such characters are defined. The ever-expanding repertoire of ideographs is likely to require use of non-BMP space in the future.

The consensus of the Task Force is that UTF-8 is the preferred specification for USMARC at this time. UTF-8 is the UCS form most widely supported by existing database and communications software, and library system development employing it is well underway. Further, UTF-8 is strategically important for Internet communications. RFC 2277 "IETF Policy on Character Sets and Languages" (January 1998) states:

Protocols MUST be able to use the UTF-8 charset, which consists of the ISO 10646 coded character set combined with the UTF-8 character encoding scheme, as defined in [10646] Annex R (published in Amendment 2), for all text. Protocols MAY specify, in addition, how to use other charsets or other character encoding schemes for ISO 10646, such as UTF-16, but lack of an ability to use UTF-8 is a violation of this policy.

UTF-8 has the useful property that USMARC implementation can be specified in such a way that only at the point of processing the data contained within the fields is it necessary to know whether a record is in UTF-8 or USM-94. Extant software that deals only with the structure of records may require little or no change to process UTF-8. Task Force discussion has assumed that a UTF-8 proposal should exploit this advantage. To do so requires making explicit certain limitations currently observed in USMARC implicitly. These measures, as well as a record capacity issue which is a small problem for UTF-8, are discussed below as part of Issue 5.

Members of the Task Force have disagreed about the appropriateness of proposing a UTF-16 encoding at this time. Those favoring UTF-16 immediately do so because they recognize it as the more straightforward approach to processing, and believe that setting guidelines now would facilitate progress toward its implementation. Those opposing believe that offering UTF-16 now, in addition to UTF-8, would in actuality require developers simultaneously to implement software for both encodings, with increased risk of failure or of bifurcation of the community. We have agreed to offer a UTF-8 proposal now, and to stress the probable need for development of a UTF-16 proposal in the future.

2.2 Issue 2: Uniformity of encoding

There is unanimous agreement that only one encoding should be used in a communicated MARC record; that is, any one record should be encoded entirely in USM-94 or entirely in Unicode. If Unicode, it should be UTF-8 throughout or, when permitted, UTF-16 throughout. This means that the Leader, the Directory and all subsequent fields will employ the same encoding.

The Task Force favors imposing a similar restriction on files of communicated records; i.e., all records in a file should use the same encoding.

2.3 Issue 3: Escape sequences

Uniformity of encoding within a record means ISO 2022 escape sequences are not used when the encoding is Unicode as escape sequences would have no meaning or function. Thus escape sequences should not be allowed in Unicode encoded USMARC records. When record is converted from Unicode encoding to USM-94, escape sequences would be inserted where needed.

It should be noted that the code for identification of a character set used in an 880 field is comprised of part of the escape sequence for the set. Since identifying the character set used in a field is not relevant to Unicode encoded records it is desirable to specify that in field 880, subfield $6 in Unicode records not include the set identifier in the first segment. (Note: See DP 111 comments on the inutility of the $6 script identifier even in USM-94 records.)

2.4 Issue 4: Combining characters and precomposed forms

97-10 proposed that MARBI "establish that USMARC records [using UCS characters] use only those listed in the USMARC to UCS mapping." This precludes the use of nearly all precomposed characters in favor of sequences of base character plus combining character(s). The Task Force has accepted this recommendation as axiomatic for its work.

Unicode USMARC records must have combining characters follow the base characters in order to conform to Unicode specifications. This requirement varies from established USMARC practice and involves repositioning of combining characters when converting between USM-94 and Unicode. This is a routine matter when combining characters occur singly.

When multiple combining characters are associated with one base character a complexity arises because the Unicode rule for ordering these differs from current practices, which vary among themselves. The Task Force favors storing combining characters in the prescribed Unicode order in Unicode encoded records, but recognizes that conversion of existing USM-94 records may not result in a correct Unicode sequence in certain cases. Nor will it be possible to guarantee perfect round trip mapping when reconverting to USM-94. Two things reduce the importance of the sequencing of multiple combining characters: the infrequent occurrence of characters modified by multiple combining characters, and the variance of practice in existing data. The Task Force believes that inability to define a precise solution should not prevent adoption of a proposal lacking such specification, but further study is in order to determine whether the Task Force can recommend a "best practice."

It is important to keep in mind that this relocation of combining characters refers specifically to their sequential relationship in communicated records. Unicode conformant software will expect combining characters to follow the characters they modify. This internal change does not necessarily imply a change in what the user of a system sees or does.

2.5 Issue 5: Position counts and lengths

A MARC record includes several elements specifying counts or lengths and elements that are fixed in length and positionally defined. Z39.2-1994, ISO 2709, and USMARC uses "character position" as the unit of measure for these, with the implication that a character position is an octet. There is an implied equivalence of "character position" and "character" (with the exception of the multibyte characters in the CJK character set). In USMARC's use of USM-94, each octet is considered a "character position" for counting purposes; and for UTF-8 the same principle is recommended. For the ASCII characters, the UTF-8 encoded character will always be one octet by specification. (Note that it is likely that each 16-bit character would be the basic counting unit for a UTF-16 record.)

2.5.1. Subfield codes, field tags, Subfield Code Count (Leader/11) , Directory Entry Map (Leader/20-23)

Z39.2 specifies that field tags (in the Directory) and the data element identifier in subfield codes (i.e., the a in $a) of variable fields must be ASCII graphic characters, thus one octet per character in UTF- 8. Also, in the Leader the lengths of the Subfield Code Count (2 in USMARC) and the parts of the Directory Entry Map are themselves expressed in ASCII digits, meaning that the UTF-8 representation of any of these requires only one octet per character. Hence, for the Subfield Code Count (Leader/11) and the elements of the Directory Entry Map (Leader 20-23), the number of octets will always be the same as the number of characters.

2.5.2. Indicator Count (Leader/12), Record Status (Leader/5), Type of Record (Leader/6)

USMARC Specifications states that indicators in variable fields and Leader/5-6 shall have ASCII character values, thus will each be one octet in length in UTF-8 encoded records. Also the Indicator Count in Leader/12 of a USMARC record, being a decimal digit, will be one octet in length in UTF-8.

2.5.3. Implementation-defined Leader positions (Leader/7-9, 17-19)

USMARC does not restrict the implementation-defined positions of the Leader (Leader 7-9, 17-19) to ASCII graphics although all values defined by the "particular implementations" -- bibliographic, holdings, authorities, classification, community information -- do meet that criterion. If a UTF-8 proposal is adopted, this restriction should be formalized in USMARC specifications to ensure that a Leader in UTF-8 will always contain exactly 24 octets.

2.5.4. Base Address of Data (Leader/12-16), Record Length (Leader/0-4), Directory length and starting position elements

The Record Length (Leader/0-4), Directory length and starting position elements, and Base Address of Data (Leader/12-16) for a UTF-8 record are all ASCII numbers, thus one octet per character in UTF-8.

Observing the restrictions cited above, the value of the Base Address of Data will be the same whether counted in characters or octets in UTF-8. However, the values for Record Length (Leader/0-4), Field length (Directory entries), and Starting character position (Directory entries), are sensitive to the choice. All of these important elements are used primarily by programs assembling, communicating, or parsing records. Processing of UTF-8 records is facilitated if these two lengths reflect the number of octets (rather than characters) in the applicable item whose length is being recorded.

Measuring lengths in octets means that the field and record capacities of UTF-8 and USM-94 records will differ. For example, a field in USM-94 can contain 9999 characters, but a field in UTF-8 can contain only 9999 octets, which may amount to fewer characters if they are not ASCII. For UTF-8, a technique specified in Z39.2-1994 4.3.1.2, which enables the use of fields longer than 9999, could be used but few systems now support it. There is no way, however, to accommodate a record with several such fields, whose total length would exceed 99,999 octets.

UTF-8 record capacity limits may thus cause problems for conversion of some long records between USM-94 and Unicode. The Task Force has considered these constraints and found them not to form a critical impediment to choosing octets as the UTF-8 unit of measurement.

2.5.5. Non-filing indicator

It is humans who are responsible for the management of the non-filing indicator; so its values should be independent of encoding. It is recommended to count characters in non-filing indicators rather than octets. The Task Force discussed the effects on non-filing indicators of relocating combining characters to follow their base characters during conversion from USM-94 to Unicode. Our conclusion is that practice on setting the indicator in relation to diacritical marks has been inconsistent. Note, however, that unauthorized or inconsistent future use of combining characters could cause problems with non-filing indicators in certain languages, such as Greek.

2.5.6. Fields with positionally defined elements

USMARC fields 006, 007, and 008 contain data elements that are positionally defined. This means that a program can expect to find a particular element at a certain offset from the start of the field. To preserve this certainty in UTF-8 records, it is necessary to restrict the content of these fields to ASCII characters; that is, characters that require only one octet in UTF-8. As is the case with the implementation-defined leader positions, all currently used values meet this criterion. The restriction should be made explicit.

2.6 Issue 6: Identifying Unicode records

Some Task Force members say that it is necessary to know the encoding of a record before beginning to try to read it; so identification outside the record is required and further identification is redundant. They assert that identification of encoding belongs in file labels, or other locations. All Task Force members favor the use of a single encoding in a tranmitted file; a file label identifier has clear utility.

Opponents of the first view contend that lack of an identifier in the record unnecessarily limits processing options; that it may be a mistake not to keep track of the character set in which a record was created or received; and that some methods of communication do not provide an external place for such information. The Task Force came to agreement on recommending an identifier in the record.

With one exception, Task Force members agree strongly that if an identifier in the record is to be useful, it must come very early in the record. The dissenting opinion was that a variable field identifier would be satisfactory, such as field 066 $d proposed in 97-10. Because of the preponderance of opinion that a variable field identifier would not be early enough in the record field 066 was rejected in favor of a Leader character position.

As a data element that provides information for the processing of the record, an identifier of the character set used to encode the record fits well the definition of a Leader element. The Task Force investigated defining one of two currently unused leader positions, Leader/09 and Leader/23, for character set information. Using Leader/23 is an attractive possibility because the datum to be supplied refers to international standards independent of implementation specifics. However, that choice is inconvenient because it would require revision of Z39.2 and ISO 2709 which currently identify 23 as a reserved position in the Entry Map. The advantage of Leader/09 is its ready availability for definition by the USMARC implementation. The disadvantage is that it is the last such Leader position available.

The Task Force recommends that a USMARC Leader position be defined as Character Coding Scheme with two defined values, "a" meaning ISO/IEC 10646 UCS-2 (Unicode), and blank (in Leader/09) or zero (in Leader 23) signifying USM-94. There is no significance in the choice of "a"; the obvious "u" does not conform to USMARC guidelines as u is intended to be reserved for "unknown" where possible. This new value is not intended to identify the UTF in which the record is transmitted or stored as this was not considered necessary.

2.7 Issue 7: Identifying data by character block or script

USM-94 records use field 066 to denote the presence of "non-default" character sets in the record. Proposal 97-10 recommended the definition of two new 066 subfields for Unicode records. Subfield d, to identify UCS use, was discussed above. Subfield e, it was suggested, could identify pieces of the BMP expected to be encountered in the record. "A new USMARC list of repertoire codes would be established or ISO 10646 Annex A could be used."

The Task Force discussed at length various ways of extending or modifying the 066 field. Eventually these were abandoned for a wide range of reasons. It is difficult to define or represent appropriate subsets of the Unicode repertoire that would prove either workable or meaningful. Notably, there is no concise way of describing USM-94 or relating it to any other categorization. There is not consensus on what kinds of categories would be meaningful--script identifiers and character block identifiers were both suggested--and there is immense skepticism about the reliability of 066 data, especially if they have to be provided manually.

More fundamentally, no demonstrable need or concrete use for this information emerged from the discussion. Hence, even if the design questions underlying the previous paragraph could be resolved, there is little sentiment among Task Force members that the result would be useful in the Unicode environment.

The Task Force recommends that systems requiring information of this sort should provide for it outside the communication format and that a transmitted Unicode record should not contain the field 066 even if the record has been converted from a USM-94 record.

3 PROPOSED CHANGES

The following changes are presented for consideration.

3.1 Add text to USMARC Specifications as follows:

b) Reference or include the USMARC to Universal Character Set Mappings (including, when it is approved, the mapping of EACC currently under development) to define the correspondence between the two character sets and the authorized repertoire of UCS characters;

c) Require the use of one encoding thoughout a single USMARC record;

d) Require the use of a single encoding in a file of transmitted USMARC records;

e) Note that ISO 2022 escape sequences will not be used in a Unicode encoded record;

f) Explicitly restrict values to be used in Leader/7-9, 17-19 (the implementation-defined Leader positions) to ASCII characters; i.e., to characters that are encoded as a single 8-bit unit in UTF-8;

g) Clarify that a "character position" is an 8-bit unit in records using either USM-94 or UTF- 8;

h) Require combining characters in Unicode encoded records to follow those they modify.

3.2 In all USMARC formats, choose one of the following options:

Option 2: Define leader position 23 as "Character coding scheme," with two values defined: zero to signify USM-94, and value a to signify Unicode.

3.3. In all USMARC formats:

b) Request that NDMSO identify all positionally defined subfields and make the same restriction where needed.

3.4. In the USMARC Authority, Bibliographic, Classification and Community Information formats (that include fields 066 (Character sets present) and 880 (Alternate Graphic Representation):

b) Specify that the character set (script) identifier in field 880 $6 in Unicode encoded records is not allowed.

Appendix A
RELATED QUESTIONS FOR LATER RESOLUTION

System developers are in need of a complete specification for the use of Unicode data in USMARC records, and the foregoing proposed changes go only part way to achieving that. A number of major issues such as expansion of the USMARC character repertoire and confinement of non-roman script data to Alternate Graphic Representation fields are out of scope of the charge to the Task Force. In varying degrees the problems they pose constitute questions of policy as well as of technique. When the former are resolved, a technical issues group will have more work to do.

This report has mentioned some primarily technical matters that can and should be addressed soon. We enumerate those and some others here as possible work items for this or another Task Force.

1. Specify a UTF-16 encoding. This work can build upon the work reported here. Whether to use the Unicode byte order mark and stating the preferred "endianness" for transmissions need to be part of such a specification.

2. Specify ways to indicate encoding scheme in file labels and other appropriate out-of-the-record locations.

3. Examine more extensively the problems of converting multiple combining characters so as to specify rules or best practice in these cases.

4. Recommend a convention for preserving Unicode data outside the scope of USM-94 in a USM-94 environment.

5. Determine whether changes to USMARC Specifications are necessary to ensure that USMARC handling of bidirectional scripts is unambiguously Unicode conformant, particularly with respect to character ordering in numbers.

6. Identify all subfields with position dependent content for the purpose of determining whether their content should be restricted to ASCII characters.

Appendix B
REFERENCES

The Unicode Standard, Version 2.0. Reading, MA: Addison-Wesley Developer's Press, 1996. [Unicode is a trademark of Unicode, Inc., and may be registered in some jurisdictions.]

International Organization for Standardization. Information Technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. Geneva, 1993. (ISO/IEC 10646-1:1993)

Internet Engineering Task Force. RFC 2277 "IETF Policy on Character Sets and Languages" (January 1998) http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2277.txt

USMARC Specifications for Record Structure, Character Sets, and Exchange Media. Washington, DC: Cataloging Distribution Service, Library of Congress, 1994.

MARBI Discussion Paper No. 73: UCS and USMARC Mapping. December 1993.

MARBI Proposal 96-10: USMARC Character Set Issues and Mapping to Unicode. gopher://marvel.loc.gov/00/.listarch/usmarc/96-10.doc

MARBI Proposal 97-10: Use of the Universal Code Character Set in USMARC Records. gopher://marvel.loc.gov/00/.listarch/usmarc/97-10.doc

USMARC to Universal Character Set Mappings. //www.loc.gov/marc/marc2ucs.html

Go to:

Library of Congress

Library of Congress Help Desk (09/01/98)