SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 181 Digital technology encourages the hope of searching across and between different media forms (text, sound, image, numeric data). Topic searches are described in two different media: text files and socioeconomic numeric databases and also for transverse searching, whereby retrieved text is used to find topically related numeric data and vice versa. Direct transverse searching across different media is impossible. Descriptive metadata pro- vide enabling infrastructure, but usually require map- pings between different vocabularies and a search-term recommender system. Statistical association techniques and natural-language processing can help. Searches in socioeconomic numeric databases ordinarily require that place and time be specified. A hope for libraries is that new technology will support searching across an increasing range of resources in a growing digital landscape. The rise of the Internet provides a technological basis for shared access to a very wide range of resources. The reality is that network-accessible resources, like the contents of a well-stocked reference library, are quite heterogeneous, especially in the variety of indexing, classification, catego- rization, and other forms of metadata. However, the use of digital technology implies a degree of technical compat- ibility between different media, sometimes referred to as “media convergence,” and these developments encourage the prospect of being able to search across and between different media forms—notably text, images, sound, and numeric data sets—for different kinds of material relat- ing to the same topic. To examine the practical problems involved, the authors undertook to demonstrate searching between and across two different media forms: text files and socioeconomic numeric data sets.1 Two kinds of search are needed. First, it should be pos- sible to do a topical search in multiple media resources, so that one can find, for example, both pertinent factual numeric data and relevant discussion. (One difficulty is that the vocabulary used to classify the numeric data is ordinarily quite different from the subject headings used for books, magazine articles, and newspaper stories about the same topic.) Second, when intriguing data values are encountered, one would like to move directly to topically relevant texts. Likewise, when a questionable statement is read, one would like to be able to find relevant statisti- cal evidence. Therefore, there needs to be search support that facilitates such transverse searching among resources, establishing connections, transferring data, and invoking appropriate utilities in a helpful way. Both problems were addressed through the design and demonstration of a gateway providing search sup- port for both text and socioeconomic numeric databases. First, the gateway should help users conduct searches in databases of different media forms by accepting a query in the searcher’s own terms and then suggesting the spe- cialized categorization terms to search for in the selected resource. Second, if something interesting was found in a socioeconomic database, the gateway would help the searcher to find documents on the same topic in a text database, and vice versa. Selection of the best search terms in target databases is supported by the use of indexes to the categories (entries, headings, class numbers) in the system to be searched. These search-term recommender systems (also known as “entry vocabulary indexes”) resemble Dewey’s “Relativ Index,” but are created using statistical association techniques.2 Four characteristics of this investigation need to be noted: 1. Searching independent sources: The authors were not concerned with ingesting resources from differ- ent sources into a consolidated local data repository and searching within it. The interest lay, instead, in being able to search effectively in any accessible resource as and when one wants. This implies that interoperability issues in dealing with the native query languages and metadata vocabularies of remote repositories can be solved. 2. Search for independent content: Numeric data sets commonly have associated text in the form of documentation, code books, and commentary. However, the authors were interested in finding topical content that had no such formal or liter- ary connection. Independent means, for example, a newspaper article written by someone unaware that relevant statistical data existed or had been written before the author’s article existed. In the other direction, having found statistical data of interest, could topically related text created inde- pendently of this particular data point be found? 3. Two different media forms were chosen: text and numeric data sets. They look similar because they both use arabic numerals, but the traditional reli- ance on information retrieval in a text environment Search across Different Media: Numeric Data Sets and Text Files Michael Buckland, Aitao Chen, Fredric C. Gey, and Ray R. Larson Michael Buckland (buckland@sims.berkeley.edu) is Emeritus Professor, School of Information, University of California, Berkeley; Aitao Chen (aitao@yahoo-inc.com) is a researcher at Yahoo!, Sunnyvale, California; Fredric C. Gey (gey@berkeley .edu) is an Information Scientist, UC Data Archive and Technical Assistance at the University of California, Berkeley; and Ray R. Larson (ray@sims.berkeley.edu) is a Professor, School of Information at the University of California, Berkeley. 182 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 of using any character string from the corpus as a query, although technically feasible, cannot be expected to be useful here. One can copy a number expressing quantity, such as 12,941, from a numeric data cell, use it as a query in a text search engine such as Google, and retrieve a large and eclectic retrieved set, usually involving “12941” as an iden- tifying number for a postal code, a memorandum, a part number, software bug report, and so on, but the relationship is spurious. It requires great faith in numerology to expect anything topically mean- ingful to the original data cell one started with. With other combinations of media forms, not even spurious results are feasible: one cannot submit a musical fragment or some pixels from an image as a text query. 4. The authors’ interest was in how to achieve a bet- ter return on existing investments in well-formed, edited resources with descriptive metadata. This project built directly on prior work on how to make more effective use of existing, expertly developed metadata, rather than creating or replacing meta- data. Search of multiple resources comes in two forms: 1. Parallel search is when a single query is sent to two or more resources at more or less the same time. For example, a researcher interested in the import of shrimp would like to see pertinent newspaper articles and trade statistics. Thus, one might send a query to the Census Bureau’s United States (U.S.) Imports and Exports numeric data series and look at SIC 0913 for shrimp and prawn and note a dra- matic increase in imports from Vietnam through Los Angeles from 1995 onwards. One would also search newspaper indexes for articles such as “Normalizing ties to Vietnam important steps for U.S. firms; California stands to profit handsomely when barriers fall to trade with fast-growing coun- try.”3 Different sources are likely to use different index terms or categories, so the challenge is how to express the searcher’s query in terms that will be effective for searching in the target resources, which, mostly likely, will use different vocabular- ies. As one example, the term for “automobiles” is 3711 in the Standard Industrial Classification; TL 205 in the Library of Congress (LC) Classification, 180/280 in the U.S. Patent Classification; and, in the Census Bureau’s U.S. Imports and Exports data series, PASS MOT VEH, SPARK IGN ENG.4 2. Transverse search is when an item of interest found in one resource is used as the basis for a query to be forwarded to a different resource. The challenge here, again, is that when a query using the topical metadata in one resource needs to be expressed in the vocabulary of the target resource, the metadata vocabularies in the two resources will usually be different from each other, and, quite likely, both are unfamiliar to the searcher. When searching within a single media form, it may be possible to use content itself directly as a query: A frag- ment of text in a source-text database is commonly used as a query in a target-text database. Similarly, one might start with an image and seek images that are measur- ably similar. However, because such direct search cannot be done when searching across different media forms, an indirect approach relying on the use of interpretive representations becomes necessary. As the network envi- ronment expands, mapping between vocabularies will be increasingly important. ■ Text and numeric resources Text resource A library catalog—a special case of text file—was chosen for use as a text file rather than a corpus of “full text.” The reasons were practical: In this exploratory investiga- tion, it was important to start with resources that had rich metadata; it needed to be a resource that was sufficiently controllable to enable experimentation with it. A library catalog was in the spirit of the project in that it would lead to additional text resources; and a suitable resource was available, which was intended for metadata mapping: a set of several million MARC records, derived from MELVYL, the University of California online library catalog. Socioeconomic numeric data set Initially, and in prior work, the authors had worked on access to U.S. federal data series, especially import and export statistics and county business reports. Although some progress was made with interfaces to these data series, it became clear that the investment needed to craft interoperable access was high relative to the available staff. Crafting access to individual data series did not appear to be a scalable way to demonstrate variety within the authors’ limited resources, so attention was turned to a single collection comprising many diverse numeric tables, the Counting California database.5 ■ Mapping topical metadata Well-edited, high-quality databases typically have topi- cal metadata expertly assigned from a vocabulary (the- saurus, classification, subject-heading system, or set of SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 183 categories). But there is a Babel of different vocabularies. Not only do the names of topics vary, but the underlying concepts or categories may also differ. Effective searching requires expert familiarity with a system’s vocabulary; but as access to digital resources expands, the diversity of vocabularies increases and accessible resources are decreasingly likely to use vocabularies familiar to any individual searcher. The best answer is twofold: First, it is desirable to have an index (a “mapping”) from the natural language of each group of searchers to the entries used in each metadata vocabulary. Such a mapping provides an index from a vocabulary familiar to the searcher to the vocabulary used in entries of the target system and so is called a search-term recommender system. (The authors called it an “entry-vocabulary index,” or EVI.) Dewey’s “Relativ Index” to his Decimal Classification is a famil- iar example. When searching across databases, one also wants a second kind of mapping: between pairs of system vocabularies. Unfortunately, mappings between different vocabularies are rare, expensive, time-consuming, and hard to maintain. (The Unified Medical Language System is a notable example.)6 It is the authors’ impression that this problem is worse in searching across different media forms because data bases in different media forms tend to be created by different communities, increasing the chances that they will use different categories, vocabularies, and ways of thinking. Fortunately where data containing two forms of vocabulary are available, they can be used as training sets for statistical-association techniques to generate EVIs auto- matically, and this is the approach that was used. (More details can be found in the appendix.) From text words to Library Subject Headings An EVI from ordinary English words to Library of Congress Subject Headings (LCSH) was created by taking catalog records containing at least one subject heading (6xx field in the MARC bibliographic format). From each of the 4,246,510 records used, main subject headings were extracted (subfield a from fields 600, 610, 611, 630, 650, and 651) and fields containing text: titles (245a), subtitles (245b), and summaries describing the scope and general content of the material (520a). The underlying assump- tion is that for each record, the words in the “text” fields (245a,b and 520a) tend to be characteristic of discourse on the subject (6xxa). Two examples, with identifying LCCNs in the <001> field are: <001>73180254 //r86 <245>A study of operant conditioning under delayed reinforcement in early infancy <650>Infant psychology <650>Operant conditioning <001>73180255 <245>Reptilian diseaserecognition and treatment <650>ReptilesDiseases The words in the text fields (245a, 245b, and 520a) were extracted. Stop words were removed and the remainder normalized. Then the degree to which each word is asso- ciated with each subject heading (by co-occurring in the same records) was computed using a maximum likelihood ratio-based measure. Natural-language processing can be used to identify adjective-noun phrases to support more precise searching using phrases as well as individual words. A very large matrix shows the association of each text word (or phrase) with each subject heading; so, for any given word (or combination of words), a list of the most closely associated headings, ranked by degree of association, can be derived from the matrix. Queries A query, which can be a single word, a phrase, a set of keywords, a book title, and so on, is normalized in the same way and looked up in the matrix to produce a ranked list of the most closely associated subject headings as candidate LCSH search terms. For example, entering the textual query words “Peanut” and “Butter” generates the following ranking list of LCSH main headings as candi- dates for searching: Rank LCSH (subfield 650a) 1. Peanut 2. Cookery (peanut butter) 3. Cookery (peanuts) 4. Peanut industry 5. Peanut butter 6. Butter 7. Schulz, Charles M. This display is an important departure from traditional fully automatic searching. The list is, in effect, a prompt, indicating probably suitable query terms in the vocabulary of the target resource. It introduces the searcher to the categories and terminology of the system and enables the searcher to use expert judgment to select the heading that seems best for the search. From text words to the metadata vocabularies in numeric data sets A training set of records containing both descriptive words and topical metadata is often not readily available for numeric data sets. The authors’ first effort was to create an EVI to the Standard Industrial Classification (SIC), widely used over many years in numeric data sets. (SIC codes were associated with words by using, as a training set, the 184 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 titles in a bibliographic database that used SIC codes.) But by the time the SIC EVI was completed, SIC had been dis- continued and replaced by the North American Industry Classification System (NAICS), so a mapping was created from SIC codes to NAICS codes. Figures 1–3 show stages in an interface that accepts a searcher’s query “car” (figure 1), prompts with a ranked list of NAICS codes (figure 2), then extends the search with the selected NAICS code to retrieve numeric data (figure 3). By this time, however, it had become apparent that, with the current low level of interoperability in software and in data formats, the labor required to create EVIs and interfaces to each large traditional numeric data series was enormous. Therefore, attention was turned to a collection of different numeric data sets available through a single interface, Counting California, made available by California Digital Library at http://countingcalifornia.cdlib.org. This resource is a collection of some three thousand numeric tables containing statistics related to a range of topics. The numeric data sets are mainly from the California Department of Health Services, the California Department of Finance, and the federal Bureau of the Census. The tables are organized under a two-level classification scheme. There are sixteen topics at the top level, which are subdi- vided into a total of 184 subtopics. All the numeric tables were assigned to one or more subtopics and each table has a caption. At the Counting California Web site, a searcher can browse for tables by selecting a higher-level topic, then a lower-level subtopic, and then a table. Two additional ways were created to access the tables: Probabilistic retrieval, and an EVI to the topical categories. The cap- tions, topics, and subtopics were extracted for each of the three thousand tables, and XML records were created in the following form: education libraries
library statistics, statewide summary by type of library California 1992–93 to 1997–98
Retrieval Two search methods were used: Direct Probabilistic Retrieval. An in-house implementa- tion was used of a probabilistic full-text retrieval algo- rithm developed at Berkeley.7 This search engine takes a free-form text query and returns a ranked list of captions of tables ranked according to their relevance scores. For example, the five top-ranked captions returned to the query “Public Libraries in California” were: Figure 1. Query interface for search-term recommender system f or the North American Industry Classification System Figure 2. Display of NAICS code search-term recommendations for “car” Figure 3. Display of numeric data retrieved using selected NAICS code SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 185 1. Library statistics, Statewide summary by type of library California, 1992–93 to 1997–98 Table F6. 2. Library statistics, Statewide summary by type of library California, 1993–94 to 1998–99 Table F6YR0-0. 3. Number of California libraries, 1989 to 1999 Table F5YR00 4. Number of California libraries, 1989 to 1998, as of September Table F5. 5. California Public Schools, Grades K–12, 1989 to 1998 Table F4. Each entry in the retrieved set list is linked to a numeric table maintained at the Counting California Web site and, by clicking on the appropriate link, a user can display the table as an MS Excel file or as a PDF file. Mediated Search. From the same extracted records the words in the captions were used to create an EVI to the sub- topics in the topic classification using the method already described. As an example, the query “personal individual income tax,” when submitted to the EVI, generated the following ranked list of subtopics: 1. Income 2. Government earnings and tax revenues 3. Personal income 4. Property tax 5. Personal income tax 6. Corporate income tax 7. Per capita income A user can click on any selected subtopic to retrieve the cap- tions of tables assigned that subtopic. For example, clicking on the fifth subtopic, Personal income tax, retrieves: ■ Personal income tax returns: Number and amount of adjusted gross income reported by adjusted gross income class California, 1998 taxable year. Table D10YR00 ■ Personal income tax returns: Number and amount of adjusted gross income reported by adjusted gross income class California, 1997 taxable year. Table D9 ■ Personal income statistics by county, California 1997 taxable year. Table D10 ■ Personal income statistics by county, California 1998 taxable year. Table D11YR00 ■ Transverse searching between text- and numeric-data series To demonstrate the searching capability from a bib- liographic record to numeric-data sets, the first step is to retrieve and display a bibliographic record from an online catalog. A Web-based interface for searching online catalogs was implemented using an in-house implementation of the Z39.50 protocol. Besides the Z39.50 protocol, an important component that makes searching remote online catalogs feasible is the gateway between the HTTP (Hypertext Transfer Protocol) and the Z39.50 protocol. While HTTP is a connectionless-oriented protocol, the Z39.50 is a connec- tion-oriented protocol. The gateway maintains connections to remote Z39.50 servers. All search requests to any remote Z39.50 server go through the gateway. Searching from catalog records to numeric data sets Having selected some text (for the purposes of this study, a catalog record), how could one identify the facts or statis- tics in a numeric database that are most closely related to the topic? Clicking on a “formulate query” button placed at the end of a displayed full MARC record creates a query for searching a numeric database. The initial query will contain the words extracted from the title, subtitle, and the subject headings and is placed in a new window where the user can modify or expand the query before submitting it to the search engine for a numeric database. So, for example, the following text extracted from a catalog record: Library laws of the State of California, Library legislation. California. Public libraries when submitted as a query, retrieves a ranked list of table names, of which two, covering different time periods, are entitled Library Statistics, Statewide Summary by Type of Library, California. Searching from numeric data sets from catalog records Transverse search in the other direction, starting from a data table, is achieved by forwarding the caption of a table to the word-to-LCSH EVI to generate a prompt list of the seven top-ranked LCHSs, any one of which can be used as a query submitted to the catalog. ■ Architecture Figure 4 shows the structure of the implementation. The boxes shown in the figure are: 1. A search interface for accessing bibliographic/tex- tual resources through a word-to-LCSH EVI. 2. A word to the LCSH EVI. 3. A ranked list of LCSHs closely associated with the query. 4. An online catalog. 186 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 5. Results of searching the online catalog using an LCSH. 6. A full MARC record displayed in tagged form. 7. A new query formed by extracting the title and sub- ject fields from the displayed full MARC record. 8. A numeric database. 9. A list of captions of numeric tables ranked by rel- evance score to the query. 1 0. Numeric table displayed in PDF or MS Excel for- mat. 11. A search interface for numeric databases based on a probabilistic search algorithm. A user can start a search using either interface (boxes 1 or 11) and, from either starting point, find records on the same topic of interest in a textual (here bibliographic) database and a socioeconomic database. ■ Conclusions and further work Enhanced access to numeric data sets The descriptive texts associated with numeric tables, such as the caption, headers, or row labels, are usually very short. They provide a rather limited basis for locating the table in response to queries, or describing a data cell sufficiently to form a usefully descriptive query from it. Sometimes the title (caption) of a table may be the only searchable textual description about the content of the table, and the titles are sometimes very general. For example, one of the titles, Library Statistics, Statewide Summary by Type of Library California, 1992–93 to 1997–98, is so general that neither the kinds of statistics nor the types of libraries are revealed. If a user posed the question, “What are the total operating expenditures of public libraries in California?” to a query system that indexes table titles only, the search may well be ineffective since the only word in common between the table title and the user’s query is “California” and, if the plurals of nouns have been normalized, to the singular form, “library.” Table column headings and row headings provide additional information about the content of a numeric table. However, the column and row headings are usu- ally not directly searchable. For example, a table named “Language spoken at home” in Counting California databases consists of rows and columns. The column headings list the languages spoken at home, while the row headings show the county names in California. Each cell in the table gives the number of people, five years of age and older, who speak a specific language at home. To answer questions such as “How many people speak Spanish at home in Alameda County, California?” using the table title alone may not retrieve the table that contains the answer to the example question. It is recommended that the textual descriptions of numeric tables be enriched. Automatically combining the table title and its column and row headings would be a small but practical step toward improved retrieval. Geographic search Socioeconomic numeric data series refer to particular areas and, in contrast to text searching, the geographical aspect ordinarily has to be specified. To match the geographical area of the numeric data, a matching text search may also have to specify the same place. The authors found that this was hard to achieve for several reasons. Place names are ambiguous and unstable: A search for data relating to Trinidad might lead to Trinidad, West Indies, instead of Trinidad, California, for example. The problem is compounded because, in numeric data series, specialized geopolitical divisions, such as census tracts and counties, are commonly used. These divisions do not match conve- niently with searchers’ ordinary use of place names. Also, the granularity of geographical coverage may not match well. Data relating to Berkeley, for example, may be avail- able only in aggregated data for Alameda County. It was eventually concluded that reliance on the names of places could never work satisfactorily. The only effective path to reliable access to data relating to places would be to use geospatial coordinates (latitude and longitude) to establish unambiguously the identity and location of any place and the relationship between places. This means that gazetteers and map visualizations become important. Gazetteers relate named places to defined spaces, and thereby reveal spatial relationships between places, e.g., the city of Alameda is on Alameda Island within Alameda County. This problem has been addressed in a subsequent Figure 4. Architecture of the prototype SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 187 study entitled “Going Places in the Catalog: Improved Geographical Access.”8 Temporal search Searches of text files and of socioeconomic numeric data series also differ substantially with respect to time periods: Numeric data searches ordinarily require the years of inter- est to be specified; text searches rarely specify the period. An additional difficulty arises because in text, as in speech, a period is commonly referred to by a name derived meta- phorically from events used as temporal markers, rather than by calendar time, as in “during Vietnam,” “under Clinton,” or “in the reign of Henry VIII.” Named time periods have some of the characteristics of place names: they are culturally based and tend to be multiple, unstable, and ambiguous. It appears that an analogous solution is indicated: directories of named time periods mapped to calendar definitions, much as a gazet- teer links place names to spatial locators. This problem is being addressed in a subsequent study entitled “Support for the Learner: What, Where, When, and Who.”9 Media forms The paradox, in an environment of digital “media conver- gence,” that it appears impossible to search directly across different media forms invites closer attention to concepts and terminology associated with media. A view that fits and explains the phenomena as the authors understand them, distinguishes three aspects of media: ■ Cultural codes: All forms of expression depend on some shared understandings, on language in a broad sense. Convergence here means cultural convergence or interpretation. ■ Media types: Different types of expression have evolved: Texts, images, numbers, diagrams, art. An initial classification can well start with the five senses of sight, smell, hearing, taste, and feel. ■ Physical media: Paper; film; analog magnetic tape; bits; . . . Being digital affects directly only this aspect. Anything perceived as a meaningful document has cul- tural, type, and physical aspects, and genre usefully denotes specific combinations of code, type, and physical medium adopted by social convention. Genres are historically and culturally situated. Convergence can be understood in terms of interoper- ability and is clearly seen in physical media technology. The adoption of English as a language for international use in an increasingly global community promotes conver- gence in cultural codes. Nevertheless, the different media types are fundamentally distinct. Metadata as infrastructure It is the metadata and, in a very broad sense, “biblio- graphic” tools that provide the infrastructure necessary for searches across and between different media—thesauruses, mappings between vocabularies, place-name gazetteers, and the like. In isolation, metadata is properly regarded as description attached to documents, but this is too narrow a view. Collectively, the metadata forms the infrastructure through which different documents can be related to each other. It is a variation on the role of citations: Individually, references amplify an individual document by validating statements made within it; collectively, as a citation index, references show the structure of scholarship to which docu- ments are attached. ■ Summary A project was undertaken to demonstrate simultane- ous search of two different media types (socioeconomic numeric data series and text files) without ingesting these diverse resources into a shared environment. The project objective was eventually achieved, but proved harder than expected for the following reasons: Access to these differ- ent media types has been developed by different commu- nities with different practices; the systems (vocabularies) for topical categorization vary greatly and need interpre- tative mappings (also known as relative indexes, search- term recommender systems, and EVIs); specification of geographical area and time period are as necessary for search in socioeconomic data series and, for this, existing procedures for searching text files are inadequate. ■ Acknowledgement This work was partially supported by the Institute of Museum and Library Services through National Library Leadership Grant No. 178 for a project entitled “Seamless Searching of Numeric and Textual Resources,” and was based on prior research partially supported by DARPA Contracts N66001-97-C-8541; AO# F477: “Search Support for Unfamiliar Metadata Vocabularies” and N66001-00-1- 8911, TO# J290: “Translingual Information Management Using Domain Ontologies.” References 1. Michael K. Buckland, Fredric C. Gey, and Ray R. Larson, Seamless Searching of Numeric and Textual Resources: Final Report on Institute of Museum and Library Services National Leadership 188 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 Grant No. 178 (Berkeley, Calif.: Univ. of California, School of Information Management and Systems, 2002), http:// metadata.sims.berkeley.edu/papers/SeamlessSearchFinal Report.pdf (accessed July 18, 2006); Michael Buckland et al., “Seamless Searching of Numeric and Textual Resources: Fri- day Afternoon Seminar, Feb. 14, 2003,” http://metadata.sims .berkeley.edu/papers/seamlessfri.ppt (accessed July 18, 2006). 2. Michael Buckland et al., “Mapping Entry Vocabulary to Unfamiliar Metadata Vocabularies,” D-Lib Magazine 5, no. 1 (Jan. 1999), www.dlib.org/dlib/january99/buckland/01buckland .html (accessed July 18, 2006); Michael Buckland, “The Sig- nificance of Vocabulary,” 2000, http://metadata.sims.berkeley .edu/vocabsig.ppt (accessed July 18, 2006); Fredric C. Gey et al., “Entry Vocabulary: A Technology to Enhance Digital Search,” in Proceedings of the First International Conference on Human Lan- guage Technology, San Diego, Mar. 2001 (San Francisco: Morgan Kaufmann, 2001), 91–95, http://metadata.sims.berkeley.edu/ papers/hlt01-final.pdf (accessed July 18, 2006). 3. Los Angeles Times, July 12, 1995: D1. 4. Michael Buckland, “Vocabulary As a Central Concept in Library and Information Science,” in Digital Libraries: Interdisci- plinary Concepts, Challenges, and Opportunities. Proceedings of the Third International Conference on Conceptions of Library and Infor- mation Science (CoLIS3), Dubrovnik, Croatia, May 23–26, 1999, ed. T. Arpanac et al. (Lokve, Croatia: Benja Pubs., 1999), 3–12, www .sims.berkeley.edu/~buckland/colisvoc.htm (accessed July 18, 2006); Buckland et al., “Mapping Entry Vocabulary.” 5. Counting California, http://countingcalifornia.cdlib.org (accessed July 18, 2006). 6. “Factsheet: Unified Medical Language System,” www .nlm.nih.gov/pubs/factsheets/umls.html (accessed July 18, 2006). 7. William S. Cooper, Aitao Chen, and Fredric C. Gey, “Full- Text Retrieval Based on Probabilistic Equations with Coefficients Fitted by Logistic Regression,” in D. K. Harman, ed., The Second Text REtrieval Conference (TREC-2), March 1994, 57–66 (Gaith- ersburg, Md.: National Institute of Standards and Technol- ogy, 1994), http://trec.nist.gov/pubs/trec2/papers/txt/05.txt (accessed July 18, 2006). 8. “Going Places in the Catalog: Improved Geographical Access,” http://ecai.org/imls2002 (accessed Jul. 18, 2006). 9. Vivien Petras, Ray Larson, and Michael Buckland, “Time Period Directories: A Metadata Infrastructure for Placing Events in Temporal and Geographic Context,” in Opening Information Horizons: Joint Conference on Digital Libraries (JCDL), Chapel Hill, N.C., June 11–15, 2006, forthcoming, http://metadata.sims .berkeley.edu/tpdJCDL06.pdf (accessed July 18, 2006); “Support for the Learner: What, Where, When, and Who,” http://ecai .org/imls2004 (accessed July 18, 2006). SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 189 Appendix: Statistical association methodology A statistical maximum likelihood ratio weighting tech- nique was used to construct a two-way contingency table relating each natural-language term (word or phrase) with each value in the metadata vocabulary of a resource, e.g., LCSH, LCCNs, U.S. Patent Classification Numbers, and so on.1 An associative dictionary that will map words in natural languages into metadata terms can also, in reverse, return words in natural language that are closely associated with a metadata value. Training records containing two different metadata vocabularies can be used to create direct mappings between the values of the two metadata vocabularies. For example, U.S. patents contain both U.S. and International Patent Classification numbers and so can be used to create a mapping between these two quite different classifica- tions. Multilingual training sets, such as catalog records for multilingual library collections, can be used to create multilingual natural language indexes to metadata vocabu- laries and, also, mappings between natural language vocabularies. In addition to the maximum likelihood ratio-based association measure, there are a number of other asso- ciation measures, such as the Chi-square statistic, mutual information measure, and so on, that can be used in creat- ing association dictionaries. The training set used to create the word-to-LCSH EVI was a set of catalog records with at least one assigned LCSH (i.e., at least one 6xx field). Natural language terms were extracted from the title (field 245a), subtitle (245b), and summary note (520a). These terms were tokenized; the stopwords were removed; and the remaining words were normalized. A token here can contain only letters and digits. All tokens were then changed to lower case. The stoplist has about six hundred words considered not to be content bearing, such as pronouns, prepositions, coordinators, determiners, and the like. The content words (those not treated as stopwords) were normalized using a table derived from an English morphological analyzer.2 The table maps plural nouns into singular ones; verbs into the infinitive form; and comparative and superlative adjectives to the positive form. For example, the plural noun printers is reduced to printer, and children to child; the comparative adjective longer and the superlative adjective longest are reduced to long; and printing, printed, and prints are all reduced to the same base form print. When a word belonging to more than one part-of-speech category can be reduced to more than one form, it is changed to the first form listed in the morphological analyzer table. As an example, the word saw, which can be a noun or the past tense of the verb to see, is not reduced to see. Subject headings (field 6xxa) were extracted without qualifying subdivisions. The inclusion of foreign words (alcoholismo, alcoolisme, alkohol, and alcool), derived from titles in foreign languages, demonstrate that the technique is language independent and could be adopted in any country. It could also support diversity in U.S. libraries by allowing searches in Spanish or other languages, so long as the training set contains sufficient content words. EVIs are accessible at http://metadata. sims.berkeley.edu/prototypesI.html. Fuller descriptions of the project methodology can be found in the literature.3 ■ References 1. Ted Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence,” Computational Linguistics 19 (March 1993): 61–74. 2. Daniel Karp et al., “A Freely Available Wide Cover- age Morphological Analyzer for English,” in Proceedings of COLING-92, Nantes, 1992 (Morristown, N.J.: Association for Computational Linguistics, 1992), 950–55, http://acl.ldc.upenn .edu/C/C92/C92-3145.pdf (accessed July 18, 2006). 3. Michael K. Buckland, Fredric C. Gey, and Ray R. Larson, Seamless Searching of Numeric and Textual Resources: Final Report on Institute of Museum and Library Services National Leadership Grant No. 178 (Berkeley, Calif.: Univ. of California, School of Informa- tion Management and Systems, 2002), http://metadata.sims .berkeley.edu/papers/SeamlessSearchFinalReport.pdf (accessed Jul. 18, 2006); Youngin Kim et al., “Using Ordinary Language to Access Metadata of Diverse Types of Information Resources: Trade Classification and Numeric Data,” in Knowledge: Creation, Organization, and Use. Proceedings of the American Society for Infor- mation Science Annual Meeting, Oct. 29–Nov. 4, 1999 (Medford, N.J.: Information Today, 1999), 172–80.