lib-MOCS-KMC364-20131012113359 210 Reports and Working Papers Inclusion of Nonroman Character Sets The following document was prepared by staff of the Library of Congress as a work- ing paper for discussions on incorporating the techniques described into the MARC communications format. The document defines the principles for inclusion of nonroman alphabet character sets in the MARC communications format and the procedural changes needed to allow implementation of the principles. This technique was agreed upon at the MARBI Committee meeting on February 2, 1981. Any questions on the description of the inclusion of nonroman character sets in the MARC communications format should be addressed to: Library of Congress, Process- ing Services, Attention: Mrs. Margaret Pat- terson, Washington , DC 20540. 1. INTRODUCTION The cataloging rules followed by Ameri- can libraries favor recording the title page data in the original script when possible. This helps those who consult catalogs to read the most essential information about the book. (Reading his or her name in ro- manized form is just as difficult for someone who knows Arabic as reading your name when it's written in Arabic. ) The new cata- loging rules also specify that names and ti- tles in notes be given in their original script, AACR2 l. 7 A.3. Technological advances have made it possible to provide many, if not all , nonroman alphabets in machine- readable cataloging records. OCLC and RLIN are in the process of enhancing their systems so they can handle some nonroman writing systems. The Library of Congress has entered into a cooperative agreement with RLIN for the development and use of an augmented RLIN system for East Asian (i.e., Chinese, Japanese, and Korean) bib- liographic data. Although the Library itself will not be creating and distributing MARC records with nonroman characters in the near term , the goal of this proposal is to define how these data can be included now so others can do so soon. The technique known as an escape se- quence announces that the codes which fol- low will represent letters in a specific differ- ent alphabet instead of the roman letters the codes would otherwise stand for. 2. PRINCIPLES The following principles will govern in- clusion of other alphabets in MARC rec- ords. Note that these deal only with the MARC communications format record, not the details of its processing-keying, sort- ing, display, etc.-by any bibliographic agency or utility. These principles are a slightly revised version of ones reviewed and approved in principle by the MARBI Character Set Committee in 1976. The ear- lier version was also distributed that year as working paper N77 of ISO TC46/SC4/ WGI. (1) Standard character sets should be used when available. (2) Standard escape sequences should be used when available. (3) Escape sequences should be used only when needed. (4) Escape sequences are locking within a subfield but revert at any delimiter or field or record terminator code. Example: (For demonstration purposes only, EC represents escape to Cyrillic and EA escape to ASCII) 245 10$aECRussian title proper :$bECRussian Subtitle. F not 245 10$aECRussian title proper :EA$bECRussian subtitle. EAF and not 245 10$aECRussian title proper :$bRus- sian subtitle. F (5) Records which contain an escape se- quence will also contain a special field which specifies what unusual character sets are present. 3. IMPLEMENTATION The following will be done to realize these principles. • The ALA character set will be redefined-see table 1. • A new character sets present field will be defined. • Details of application such as distribu- tion, filing indicator values, etc., will be defined. 3.1 Discussion- ALA Character Set A character set is a list of characters with the code used to represent each one. Using this definition , the ALA character set as given in appendixes III.B and III.C of MARC Formats for Bibliographic Data ac- tually consists of eight character sets. (1) ASCII and ALA diacritics and spe- cial characters with their eight-bit code. (2) Superscript zero to nine, plus, minus, open and close parentheses with their eight-bit code. Table 1. Proposed Revised ALA Character Set - ~ p ~ p p p p I p p I p P P I I p I p p P I P I P I I P ~ I I I I ~ ~ ~ I P ~ I I ~ I p I P I I I I P P I I P I I I I P I I I I 4 3 2 I BITS p I 2 3 4 ~ 6 7 R 9 10 II 12 13 14 , 'I p I p I 2 NUL OLE SP SOH DCI ! fSTX DC2 . ETX DC3 " EOT DC4 s ENQ NAK " ACK SYN & BEL !::TO OS CAN I HT EM I LP SUB VT F:SC + FF FS CR OS , - so ns , S l us' I ~ p l p 9 I I I I p p p I p I 3 ~ ~ . p @ p I A Q 2 B R 3 c s 4 I> T ~ E u 6 p v 7 G w 8 H X !I I y J z ; K I < 1. \ - M I > N - ' 1 0 I ASCII 6 • b c d c r • h ; j k I m n 0 Reports and Working Papers 211 (3) Subscript zero to nine, plus, minus, open and close parentheses with their eight-bit code. (4) Greek lowercase alpha, beta, and gamma with their eight-bit code. (5-8) The same characters with their six- bit codes. The six-bit character sets are used to dis- tribute MARC records on seven-track tapes. There are very few subscribers. It is un- likely that a method can be devised for dis- tribution of nonroman character sets rec- ords on such tapes. The present seven-track subscribers should be asked if they know of any way to do so. If they do not, the alterna- tives are to cease distribution of seven-track tapes entirely or limit them to those records containing only roman alphabet characters-those without a character sets present field. In the latter case, they should pay proportionately less for their subscrip- tion. The present four eight-bit character sets and their escape sequences do not conform to present standards. The present standards did not exist when the character sets were being defined. To avoid creating and dis- tributing records containing both standard and nonstandard character sets and stan- p p I I I I p p I I p p p I p I 7 8 9 . p q r ' l u ,. w X y , I' : I' -. DEL l I I /I p ~ I I I I I I p p I I p I p I II p ~~. I I 10 II 12 13 · u 0 L I l ' e • < 2 0 d ' J - .. p ~ 4 4 - . ;E • 5 s - u > . B "' b * . p p © I II r ® II J 1 ® L " '-../ Escape sequences would be given where needed in data fields. If necessary, it is per- missible to embed escape sequences within a word. For example, a Latin diacritic might be needed with an extended Cyrillic letter to represent a letter in one of the non- Slavic languages of Central Asia which uses the Cyrillic alphabet. In addition to escape sequences for non- roman alphabets described above in which one code stands for one letter, the escape standards also define escape sequence pro- cedures for changing to multiple byte char- acter sets. Because the ideographic writing 214 Journal of Library Automation Vol. 14/3 September 1981 Table 3. Escape Sequence Character Set p p p p p p p 1 p p 1 p ~ ~ 1 1 U I P P p I P 1 p I 1 P P 1 I 1 I P P U 1 u u 1 I ~ 1 P I ~ I I I I P U I I Q I I I I P I 1 I I ·I 3 2 I JilTS g 1 2 J 4 r. 6 7 8 9 Ill 11 12 IJ (.1 1$ fl ~ ~ u ~ p 1 p 1 ~ p Q p 1 1 p I 2 3 SP p ! 1 " 2 # J ll 4 \\ $ & 6 7 I 8 I 9 : • : < - - > I ? l ~ I I I g g I p I ~ 4 !o G 10 n 10 a « A 15 p !j 1.1 c ll A T .n e >' E "' r • r " • X u .. lo( -,, 3 ~ K w " J1 . n ... U1 M K ~ ll 0 . - 0 ~ I I l I I u g p u I 1 p ~ I 1 u 1 ~ 1 ~ I ~ 7 8 9 10 11 12 n r .!l ~ p r c c T ~ y s lK j B 'i b j bl "' 3 ,_ ill II 3 " Ul y 'l ,, ,, I I p 1 13 -t 9 v .. [ J /I I I I 1 H ~ 1!o 1 / r '!; ,; e f y c ;;{ j:: s 1 .. J Jb H, 1\ ,( y ll " I Hll 7 I GT r, s COST 13052-67 Russian ISO DIS 5~27 Extended Cyrillic systems of East Asia use thousands of differ- ent characters, it will be necessary to use two or three bytes/codes to identify a single specific character uniquely. The Japanese Industrial Standard character set, JIS 6226, uses two bytes per character, and it has been submitted to ISO to obtain a registered es- cape sequence. The first volume of the Chi- nese Character Code for Information Inter- change, CCCII, has been issued; the second is expected in December. It uses three bytes per character. In all probability the LC/ RLIN East Asian cooperative project will adopt either these character sets and their escape sequences or machine reversible ad- aptations of them. The need to expand East Asian character sets constantly to provide for infrequently used characters poses prob- lems whose solutions cannot be predicted at this time. 3. 3 Discussion- Character Sets Present Field As specified in the sixth principle, there is need for a special field which specifies what character sets are present whenever a set other than ASCII and the ALA extension of ASCII are present in a record. The pro- posed field will use tag 066 and be defined as follows: 066 Character Sets Present This field specifies what character sets are present in the other than ASCII and the ALA extension of ASCII. The field is not repeatable. Both indicators are unused and will contain blanks. $a This subfield will contain all but the first character of the escape sequence to the default character set in columns 2-7 whenever the default character set is not ASCII. This is not likely to occur in records created in the United States. Since there can only be one default character set, the subfield is not repeatable. $b This subfield will contain all but the first character of the escape sequence to the default character set in columns 10- 15 whenever the default character set is not the ALA extension of ASCII. This is not likely to occur in records created in the United States. Since there can be only one default extension character set, this subfield is not repeatable. $c This subfield will contain all but the first character (or all but the first if a longer escape sequence is used) of every escape sequence found in the record. If the same escape sequence occurs more than once, it will be given only once in this subfield. The subfield is repeatable. This subfield does not identify the default charac- ter sets. Example : l'>l'>~c)W A record containing the ISO extended Cyrillic character set. l'>l'>$c)W$c)X A record 3.4 Discussion-Other Details containing both the ISO Greek and extended Cyrillic character sets. When a field has an indicator to specify the number of leading characters to be ig- nored in filing and the text of the field be- gins with an escape sequence, the length of the escape sequence will not be included in the character count. When fields contain escape sequences to languages written from right to left, the field will still be given in its logical order. For example, the first letter of a Hebrew title would be the eighth character in a field (following the indicators, a delimiter, a subfield code, and a three-character escape sequence). The first letter would not appear just before the end of field character and proceed backwards to the beginning of the field. A convention exists in descriptive cata- loging fields that subfield content designa- tion generally serves as a substitute for a space. An escape sequence can occur within a word, after a subfield code, or between two words not at a subfield boundary. For simplicity, the convention that an escape sequence does not replace a space should be adopted. One other convention is also advo- cated: when a space, subfield code, or punctuation mark (except open quote, pa- Reports and Working Papers 215 renthesis or bracket) is adjacent to an escape sequence, the escape sequence will come last. Wayne Davison of RLIN raised the fol- lowing issue. After the Library of Congress has prepared and distributed an entirely ro- manized cataloging record for a Russian book, a library with access to automated Cyrillic input and display capability will create a record for the same book with the title in the vernacular. (Since AACR2 says to give the title in the original script "wher- ever practicable," the library could be said to be obligated to do so.) In such an event the local record could have all the authori- tative Library of Congress access points. To keep this record current when the Library of Congress record is revised and redistrib- uted, it would be necessary to carry the LC control number in the local record . Most automated systems are hypersensitive to the presence of two records with the same con- trol number. The two records can be easily distinguished: in the Library of Congress record, the modified record byte in field 008 will be set to "o" and it will not have any 066, character sets present field. A Comparison of OCLC, RLG/RLIN, and WLN University of Oregon Library The following comparison of three major bibliographic utilities was prepared by the University of Oregon Library's Cataloging Objectives Committee, Subcommittee on Bibliographic Utilities. Members of the sub- committee were Elaine Kemp, acting assis- tant university librarian for technical ser- vices; Rod Slade, coordinator of the library's computer search service; and Thomas Stave, head documents librarian. The subcommittee attempted to produce a comparison that was concise and jargon- free for use with the university community in evaluating the bibliographic utilities un- der consideration. The University Faculty Library Committee was enlisted to review this document in draft jorm and held three meetings with the subcommittee for that purpose. The document was also shared with library faculty and staff in order to elicit suggestions for revision.