The Library of Congress >> Especially for Librarians and Archivists >> Standards

MARC Standards

HOME >> MARC Development >> Proposals List


MARC PROPOSAL NO. 2025-05

DATE: May 22, 2025
REVISED:

NAME: Tagging Transliteration Schemes and BCP 47 in Data Provenance Subfields in the MARC 21 Authority and Bibliographic Formats

SOURCE: ALA-LC Romanization Tables Review Board; PCC Standing Committee on Standards; ALA Core Committee on Cataloging: Asian and African Materials

SUMMARY: This proposal adds two category codes to the MARC data provenance elements to enable the tagging of transliteration schemes at the field- and/or subfield-level, including the use of BCP 47 tags.

KEYWORDS: Data provenance (AD, BD); Data provenance category codes (AD, BD); Subfield $7 (Data provenance); (AD, BD); Subfield $e (Data provenance); (AD, BD); Subfield $l (Data provenance) (BD); Subfield $y (Data provenance) (BD); Transliteration scheme (AD, BD); Best Current Practice 47 (AD, BD); BCP 47 (AD, BD)

RELATED: 2024-DP11, 2022-05, 2021-DP06,2016-DP26,DP109

STATUS/COMMENTS:
05/22/25 – Made available to the MARC community for discussion.

06/25/25 – Results of MARC Advisory Committee discussion: Approved, with the amendment to change the code to "bcp47".

11/26/25 – Results of MARC Steering Group review - Agreed with the MAC decision.


Proposal No. 2025-05: Tagging Transliteration Schemes and BCP 47 in Data Provenance Subfields

1. BACKGROUND

MARC Proposal 2022-05 defined a set of data provenance subfields sharing common characteristics. For most MARC fields, data provenance is recorded in $7, with exceptions for fields where $7 had already been defined.  MARC Proposal 2022-05 further defined a set of data provenance category codes, which were modeled closely on concepts in original RDA. The Discussion Paper 2024-DP11 outlined a couple of scenarios by which the MARC data provenance category codes could be used to incorporate tagging of transliteration schemes at the field and subfield level, mirroring how language characteristics are tagged in contemporary standards such as XML, HTML, and RDF.

Informed by discussion of 2024-DP11 at the Annual 2024 MARC Advisory Committee meeting, this proposal adds two category codes to the MARC data provenance elements to enable the tagging of transliteration schemes at the field- and/or subfield-level, including the use of BCP 47 tags: dpets (data provenance element transliteration scheme) and dpebcp (data provenance element BCP 47 tag). 

A further issue that surfaced during the Annual 2024 MARC Advisory Committee meeting was a lack of awareness of how language tagging works in contemporary web and technology standards like RDF, HTML, and XML, which prefer the use of BCP 47 tags.  As such, the authors of the current proposal drafted and assembled a few resources that may help MARC catalogers better understand how BCP 47 tags are used.

2. DISCUSSION

As defined in MARCAppendix J-Data Provenance Subfields, the current MARC data provenance codes include dpesc (Data provenance element source consulted), and Appendix J includes an example of using dpesc to encode a transliteration scheme.  However, the code "dpesc" seems much broader in scope than just transliteration schemes, and as such it would be beneficial to have a dedicated category code for transliteration scheme.  Original RDA defines "source consulted" as "A resource used in determining the name, title, or other identifying attributes of an entity, or in determining the relationship between entities," which does not seem to apply to the ways that transliteration tables are used in cataloging and metadata creation. Furthermore, the definition of the relationship element "source consulted" in Official RDA, "a manifestation in which there is evidence for a metadata work," would not seem to apply at all to a transliteration scheme, which is not intended to provide "evidence."  Finally, the authors of this paper reached out to the RDA Technical Working Group for advice, and the WG advised that transliteration is adequately covered in Official RDA guidance. Specifically, for transcribed/transliterated manifestation elements, the RDA guidance sections on Data provenance and Guidelines on normalized transcription address transliteration, and for Nomens, various RDA Nomen elements, including derivation, script of Nomen, and scheme of Nomen account for scenarios of transliteration and the use of BCP 47 tags.

A primary use case for tagging transliteration schemes at a field level is in authority data, where it may be useful to distinguish between, for example, a variant access point romanized from Chinese script using Wade-Giles versus a variant access romanized using Pinyin.

In addition, tools have been developed in other ecosystems like Wikidata/Wikipedia (see the  Aksharamukha converter), which enable the conversion of transliterated data from one scheme to another.  In current shared cataloging environments like OCLC Connexion/WorldCat, different communities use different romanization schemes—American catalogers transliterate Cyrillic script differently from German catalogers, for example.  Creating or adapting tools to automate conversion between different romanization schemes in our shared cataloging ecosystem could improve cataloging efficiency and data re-use in both our authority and bibliographic databases, but this is likely not possible without tagging transliteration scheme at a field level (or even at the subfield level).  As an example, if an American library finds a good copy for a Mongolian-language (Cyrillic script) monograph in Connexion, but the record was created by a German library using the transliteration tables of the Deutsches Institut für Normung rather than the ALA-LC Romanization Tables, the American cataloger would have to manually re-transliterate the appropriate fields and subfields from Mongolian when deriving a new record.  However, if the appropriate fields (and/or subfields) were tagged with relevant information about the language, script, and transliteration scheme used, it could in theory be possible to automate the conversion from the German romanization to the desired ALA-LC romanization. In other words, more granular tagging of language, script, and transliteration scheme could greatly promote the re-use of metadata across language communities, and reduce the amount of busy work involved in transliterating/romanizing metadata.

A second use case is conversion to and from BIBFRAME.  Since BIBFRAME is an RDF standard, language, script, and transliteration scheme are tagged at the equivalent of the field- or subfield level, rather than at the record-level.  Enabling more options in MARC to support this more granular level of tagging in MARC could help make conversion from MARC to BIBFRAME and from BIBFRAME to MARC smoother, with less data loss.

Related to this second use case is the matter of the use of BCP 47 tags.  BCP 47 is the leading standard for tagging string (or textual) data in most contemporary computer markup languages and environments, including HTML, XML, and RDF (see the resources cited in the Background section).  In fact, it is the only permitted standard for tagging the language characteristics of a string in RDF.  BCP 47 tags are used primarily to tag only the language of a string, but subtags for script and other aspects of string data, including subtags for script, regional and dialectical variants, and the transliteration scheme used to construct a string, may be appended to the language tag following a specific syntax outlined in the standard.  This implementation is supported in BIBFRAME-compliant tools like Sinopia, which has a widget that assists users in adding valid BCP 47 tags to string data, without having to manually enter the tag.  Because of the recommended syntax of constructing more complex BCP 47 tags, it would be somewhat challenging to construct a valid BCP 47 tag from its possible subsidiary subtags for language, script, transliteration scheme, etc., although in practice the vast majority of BCP 47 tags are much less complex and only include a single language subtag.

The possibility of BCP 47 also brings into starker relief a third use case, affecting bibliographic records.  Although bibliographic records less commonly have transcribed and transliterated data in multiple languages and scripts, nonetheless these situations do come up. Consider the possibility of a bilingual Hebrew-Yiddish monograph with English as the language of cataloging, or a 19th century Yiddish monograph published in Poland with publication information given in Russian (Cyrillic script).   In such instances, legacy practices of tagging language, script, and transliteration scheme at a record-level fall short, and there would be no fully automated way to determine which strings in the record could be associated with what tags for language, script, and transliteration.  This would be especially true for transliterated data, as it would not simply be a matter of associating a language tag with data in a given script. While field- and subfield-level tagging would not solve all possible complications of tagging string data in bibliographic records, it is a marked improvement over record-level tagging.

Because some institutions may wish only to tag the transliteration scheme used to construct a string, whereas others, particularly those institutions moving into a hybrid MARC-BIBFRAME environment, may wish to accommodate full BCP 47 tags in MARC data to facilitate conversion to and from BIBFRAME, this paper proposes the creation of two new category code tags for the MARC data provenance elements.

Furthermore, because the data provenance elements were created relatively recently, and have not been widely implemented in most cataloging communities, adding more granular codes than dpesc (data provenance element source consulted) for transliteration schemes and BCP 47 should not present a major issue in terms of bibliographic file maintenance. In addition, if a local community chooses to use the data provenance code dpesc for recording the transliteration scheme of a string, this does not necessarily create a conflict with the more granular proposed codes.

To aid catalogers in applying BCP 47 tags in the MARC data provenance subfields, the members of the ALA-LC Romanization Tables Review Board brainstormed a few possibilities. 

  1. Adapt existing tools in place in the BIBFRAME editors Sinopia and Marva to work with MARC utilities like Connexion (see the video demos linked in the background section of this proposal).  Note that Princeton has indicated interest in working on some version of this.
  2. Include relevant BCP 47 tags in the ALA-LC Romanization Tables.  For example, the BCP 47 tag ru-Latn-t-ru-m0-alaloc could be included in the ALA-LC Romanization Table for Russian, as this would be the tag applied to strings created when properly applying this table.  Catalogers could copy and paste this tag into their MARC data.
  3. A separate table of valid BCP 47 tags, covering all the languages/scripts represented by the ALA-LC Romanization Tables, could be added to the website for the ALA-LC Romanization Tables.

3. PROPOSED CHANGES

Because of the two possible use cases for tagging transliteration schemes, the authors of this paper propose two additional category codes—1) one specific to BCP 47, which potentially combines codes for language (currently covered by the category code dpeloe), script (currently covered by the category code dpes), and transliteration, and 2) a separate category code which would just cover transliteration schemes.  These codes would be the following:

In preliminary discussions, an alternate option was suggested for the second category code: bcp47.  An advantage of this code would be precision: BCP stands for "Best Current Practice," of which BCP 47 is just one of several.  A disadvantage of this alternate option would be a break from the convention of data provenance category codes, which all begin with "dpe."  As such, a possible third way could be the code dpebcp47, which is a bit lengthier than any of the current codes.  Perhaps the broader constituency of the MARC Advisory Committee could include in their feedback if they have a strong preference regarding the text string for this code.

The data provenance category code dpebcp requires the use of a BCP 47 tag, which is a structured value.  Likewise, with the data provenance category code dpets, the authors of this paper envision that most users and communities would prefer to use a structured value (e.g., from a controlled vocabulary), or perhaps even an IRI, to indicate the transliteration scheme used, but individual communities may decide what implementation approach works best for their descriptive needs.

4. EXAMPLES

4.1. Example 1: Authority record for the Chinese Nobel Prize winner Gao Xingjian


4.1.1. Using dpets Data provenance element transliteration scheme

100 1_ $a Gao, Xingjian $7 (dpets)ala-lc
400 1_ $a 高行健
400 1_ $a Kao, Hsing-chien $7 (dpets)wade

4.1.2. Using dpebcp Data provenance element BCP 47 tag

100 1_ $a Gao, Xingjian $7 (dpebcp)zh-Latn-t-zh-Hans-m0-alaloc
400 1_ $a 高行健 $7 (dpebcp)zh
400 1_ $a Kao, Hsing-chien $7 (dpebcp)zh-Latn-t-zh-Hans-m0-wadegile

4.2. Example 2: Authority record for a Japanese Kabuki actor


4.2.1. Using dpets Data provenance element transliteration scheme

100 1_ $a Ichikawa, Danjūrō, ‡c VII, ‡d 1791-1859 $7 (dpets)ala-lc
400 1_ $a 市川團十郎, ‡c VII, ‡d 1791-1859

4.2.2. Using dpebcp Data provenance element BCP 47 tag

100 1_ $a Ichikawa, Danjūrō, ‡c VII, ‡d 1791-1859 $7 (dpebcp)ja-Latn-t-ja-Jpan-m0-alaloc
400 1_ $a 市川團十郎, ‡c VII, ‡d 1791-1859 $7 (dpebcp)ja

4.3. Example 3: Multiscript bibliographic record

For a digitized version of the serial used in this example, please see: https://babel.hathitrust.org/cgi/pt?id=hvd.32044105349666&seq=5

4.3.1. Using dpets Data provenance element transliteration scheme

245 00 $a Folḳsshṭime = $b Di folksshtime $7 (dpets)ala-lc
246 31 $a  Folksshtime $7 (dpets)ala-lc
880 00 $6 245-01 $a פאלקסשטימע = $b Ди фолксштиме
880 31 $6 246-02 $a Фолксштиме

4.3.2. Using dpebcp Data provenance element BCP 47 tag

245 00 $a Folḳsshṭime = $b Di folksshtime $7 (dpebcp/dpsfa)yi-Latn-t-yi-m0-alaloc $7 (dpebcp/dpsfb)yi-Latn-t-yi-Cyrl-m0-alaloc
246 31 $a  Folksshtime $7 (dpebcp)yi-Latn-t-yi-Cyrl-m0-alaloc
880 00 $6 245-01 $a פאלקסשטימע = $b Ди фолксштиме $7 (dpebcp/dpsfa)yi $7 (dpebcp/dpsfb)yi-Cyrl
880 31 $6 246-02 $a Фолксштиме $7 (dpebcp)yi-Cyrl

4.4. Example 4: Multiscript bibliographic record


4.4.1
. Using dpets Data provenance element transliteration scheme

245 00 $a Epitaphiōn Hephta Thymiamata / ‡c Stelios Papantōniou $7 (dpets)ala-lc
880 00 $6 245-01 $a Επιταφίων Εφτά Θυμιάματα / ‡c Στέλιος Παπαντωνίου

4.4.2. Using dpebcp Data provenance element BCP 47 tag

245 00 $a Epitaphiōn Hephta Thymiamata / ‡c Stelios Papantōniou $7 (dpebcp)grc-Latn-t-grc-m0-alaloc
880 00 $6 245-01 $a Επιταφίων Εφτά Θυμιάματα / ‡c Στέλιος Παπαντωνίου $7 (dpebcp)gr

5. BIBFRAME DISCUSSION

BIBFRAME has not treated the $7 yet. This is in part because the $7 is relatively new and sample data are rare, and in part because each $7 code potentially needs specific handling rules, which increases the level of effort and requires more time to map. As discussed in the background section, field- and subfield-level tagging of the language properties of a string, as opposed to record-level tagging, aligns much more closely with practices in contemporary standards like HTML, XML, and RDF, and therefore with expected practices in BIBFRAME.  Indeed, BCP 47 is the only standard permitted for tagging the language properties of a given string value in RDF, and therefore the only viable standard for tagging these string characteristics in BIBFRAME.  The proposed changes, therefore, would aid in any conversion between MARC and BIBFRAME.

6. SUMMARY OF PROPOSED CHANGES

In the MARC 21 Authority and Bibliographic Formats, add two new MARC data provenance category codes:


HOME >> MARC Development >> Proposals List

The Library of Congress >> Especially for Librarians and Archivists >> Standards
(11/26/2025)
Legal | External Link Disclaimer Contact Us