Microsoft Word - Tracking Romani publications-Husic.docx Tracking the History of Romani Publications: Challenges Presented by Flawed Data Geoff Husic1 Abstract: Romani is a language of northern Indic origin spoken natively by an estimated 2.5 million people, primarily in Eurasia but also in North America. The history of publication patterns in Romani has not been well documented. Extracting data about this history based on available information in large bibliographic databases such as OCLC WorldCat has been hampered by unfortunate misapplication of certain language codes, making it all but impossible to efficiently filter search results using Romani language as a parameter. The author discusses how he was able to correct much of this inaccurate data in OCLC WorldCat. Keywords: Romani (Romany) language publications, OCLC WorldCat, Cataloging databases, Language codes. The history of publication in the Romani language or on the topic of Romani has not been very well documented.2 This holds true especially for Internet publications, but 1 Slavic & Near East Studies Librarian. BA Russian and German (Middlebury College), MA Slavic Languages and Literatures (University of Kansas), MS Library and Information Science (University of Illinois). Room 519, Watson Library, University of Kansas, 1425 Jayhawk Blvd, Lawrence, KS 66045-7544. (husic@ku.edu). is also true for more traditional paper-based publications. As the Internet has been embraced as a convenient means for publishing and blogging, it has become a boon for linguistic minorities, such as the Roma, that wish to make information about their cultures and languages better known to the world at large. It has been able to serve as a convenient venue for publishing news, cultural information, literature, blogs, and chat room content in lesser-known languages in a way that would have been politically or economically unfeasible in the pre-Internet print world. One good example of this phenomenon can be found in Wikipedia, where much excellent information about lesser- known languages has been entered, curated, and edited by those who have obvious interest in and affection for these languages and the cultures of their speakers. I have been monitoring the development of Romani-language Web publishing over the last ten years in this very context. Due to the nature of Romani, it can be written in a variety of dialects and rather chaotic writing systems, so that discovering Romani sources on the Web can be challenging. Because of the rapid movement of many aspects of publishing to the Internet, it seemed to me a good time, as a companion project, to try to create a thorough retrospective bibliography of materials published about the Romani language, or in any of the several Romani dialects, that have appeared in print in the last 300 or so years. By necessity this bibliography, when completed, will primarily index materials on the whole book or journal level. Unlike the Web, in which data is, for the most part, still very unstructured, library bibliographic databases are universally based on very structured data 2 Romani is Library of Congress spelling of the language, but is more conventionally spelled Romany in English. schemes. Theoretically this should make it very easy to extract bibliographic data concerning the publication history of a particular language, or, in my case, Romani. However, the path to even beginning an analysis of Romani publications has been somewhat complicated. The database, based on which I wished to conduct the analysis, was OCLC WorldCat3, the bibliographic database used by the majority of North American academic libraries for cataloging and reference purposes. Its use has also steadily expanded to libraries worldwide. Bibliographic records in this database are encoded in the MARC format, which allows for very granular encoding of information for each bibliographic entity it represents. These entities most typically represent texts (books, journals, manuscripts, etc.) but can also be scores, maps, computer files, sound files, etc. Catalogers, who use this database for cataloging locally held library materials, the records of which then get uploaded to their library’s catalog, usually do their work through a technical-services interface called Connexion. Some libraries employ another method, where they create records in their local library catalog client, and then upload their records back to OCLC WorldCat. The latter libraries, when cataloging ‘copy’ records (those that already have some kind of record in OCLC WorldCat and which can vary widely in quality and fullness), may choose to make corrections and enhancements in their local catalog only. This sometimes makes sense from a workflow perspective, but it does not benefit other libraries, that subsequently need to use these bibliographic records, if errors have not been corrected in the WorldCat master records. As a result of 3 WorldCat has a variety of public and cataloging interfaces. The client version used by libraries for cataloging purposes is called Connexion. this practice, some libraries will have somewhat more accurate data in their local catalogs then was originally imported from WorldCat. Among the kinds of information encoded about each book or journal are the language or languages represented in the work. Each languages represented is encoded using OCLC’s three-letter language designation, on which the ISO Codes for the Representation of Names of Languages is also based.4 These codes are added to a dedicated portion (the fixed-field language field) of the MARC record and can be used in WorldCat and other local library catalogs, into which MARC records have been uploaded, to limit catalog search results by language. For items that are multilingual, there is a additional MARC field, the 041 field, in which further language information can be recorded, such as multilingual texts, original language of translations, languages of summaries, etc. Most of the common language codes are transparent and easy to remember, e.g. eng for English, rus for Russian, etc. Occasionally, some codes for less-commonly encountered language must be looked up by catalogers, who can’t be expected to have memorized all of the thousands of language codes available. However, due to oversight by some catalogers, a situation had developed in WorldCat that had eroded the ability to identify Romani-language materials in the database. This is the essence of the problem: The official OCLC language code for Romanian is rum. This code was chosen because at the time these codes were established, the spelling Rumanian was still the more common spelling in English. The spelling Romanian became more common starting in 4 See http://www.loc.gov/standards/iso639-2/php/code_list.php for a full list of these language codes. the late 1960s and is now the standard in English. Understandably, most libraries are much more likely to encounter and catalog materials in Romanian than in Romani. What has occurred over the years is that many libraries, when cataloging Romanian-language materials in the OCLC WorldCat database, have been miscoding Romanian (language code rum) with the language code rom. However rom is actually the language code assigned to Romani. These coding errors have resulted in many Romanian records being miscoded as Romani. As there are obviously several magnitudes more published works in Romanian than Romani, this has made the task of extracting information about Romani publications all but impossible. In late 2011, I contacted OCLC to alert them to this problem and to solicit their cooperation in correcting it. I informed them that my goal was to extract information from the database about Romani materials in order to construct a thorough retrospective bibliography and chronology, as well as my desire, as a cataloger, to see these coding errors corrected. OCLC technical staff was very cooperative and eager to help. They provided me with an initial spreadsheet of all records in OCLC that had the language code rom (Romani), either in the MARC language fixed field or the MARC 041 field. This spreadsheet was helpful for getting an initial overview of the scope of the problem. For a variety of reasons, I soon decided to abandon the spreadsheet approach to scrutinizing the data and to do my work directly through the OCLC Connexion interface. The main impediment was that the spreadsheet bibliographic records also included hundreds of duplicate records for many books based on cataloging done by mainly European national libraries. In these records, the language of cataloging, i.e. the description, notes, subject headings, etc., are not in English but rather in the language of the national cataloging agency. I felt that it was unmanageable for me to attempt to correct all these records. A recent enhancement to the OCLC Connexion software allowed me to easily limit to bibliographic records produced by English-language cataloging agencies. These are the records that will be used by North American and British libraries, so I felt my efforts were best placed in correcting these. Sifting through and correcting the number of records that needed to be reviewed (over 2600) required being familiar with Romanian, Romani, and a number of other languages. Fortunately, as a specialist in Eastern European languages, being proficient in Romanian and very knowledgeable in several Romani dialects, I was eager to lend my assistance to the task. Scouring through the records encoded rom, I was able to eliminate records that clearly had some Romani content and thus eliminate them from the problematic set. I had to scrutinize the remainder with care. While the problem originated in the confusion of the language codes for Romanian and Romani, there are in fact many works that contain both Romanian and Romani content, so I needed to assure I didn’t eliminate works purely based on, say, a Romanian title. Ultimately I ended up with approximately 1400 items that required further scrutiny. I then downloaded the OCLC records for these items so that I could view the full bibliographic information, such as subject headings to help me ascertain the content. The following is a brief overview of the kinds of errors I identified when examining the problematic set: Approximately 1200 were actually Romanian language materials that had been miscoded as rom. Forty or so were Romansh language materials, the correct code for which is roh. In addition, there were quite a few other oddities, such as rom being coded for Latin texts, or apparently as a result of confusion with the place of publication, such as an example of an English text published in Rome. Several dozen more were actually texts in Hungarian and other languages, incorrectly coded rom, that were printed in Romania. In these cases, a cataloger, unfamiliar with the languages, presumably extrapolated the incorrect language based on the place of publication. Or perhaps they were artifacts from an automatic conversion project. A small number were coded rom because a cataloger apparently thought this was the proper way to indicate something in the roman script. Finally, in a few cases, not only was the record coded with rom but the record also contained textual language notes such as “In Hungarian and Romanian.” This is rather curious and perhaps was the result of automatically generated notes based on the fixed field or 041 languages codes. It is difficult to tell for certain. In those cases where I found the language code rom to be incorrectly assigned, as a cataloger in an OCLC Enhance Program library, I was able, in most cases, to correct the OCLC WorldCat master record to reflect the correct language.5 In many cases I also fix incorrect language notes as appropriate. There were quite a few cases where I was unable to correct the codes. These were mainly cases in which there was other incorrect coding in the records that made it impossible to update the master record. Some errors also appeared in records for music scores. I am not familiar with the scores format, so I was reluctant to make corrections to these records for fear of causing unforeseen problems. The bulk of this project has now been completed. Users of WorldCat will now be able to filter results using the Romani language as a search parameter much more 5 The Enhance Program allows qualified member libraries to correct and added additional information to bibliographic records in OCLC WorldCat. reliably, if not yet perfectly, than before. There are a number of items I will need to contact OCLC or other libraries to fix in the master records. I intend to monitor new items added to WorldCat periodically in order to catch new incorrectly coded records that are sure to be added. I would encourage other WorldCat users to also correct these mistaken language codes, especially for minority language, such as Romani, when encountered. A few major academic libraries have also made corrections in their local catalogs based on my personal communications with them about this issue. Now begins the hard part! Having cleaned up much of the data in WorldCat, I have begun to examine how best to extract the data. I will likely import the data either into Endnote (a citation management program), Zotero (a bibliographic tool plugin for the Mozilla Firefox browser) or work with both tools. There are certain character encoding issues that occur when importing bibliographic data into these software tools that will also have to be addressed to minimize the amount of manual editing I must do. When the majority of the data is imported satisfactorily, I can begin to correct any additional errors and add additional useful metadata as well as my annotations. I can then finally begin an analysis of the actually publication patterns based on time period, dialect, place of publication, genre, and other parameters of interest. Results will be published upon completion in a venue to be determined.