SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 181
Digital technology encourages the hope of searching
across and between different media forms (text, sound,
image, numeric data). Topic searches are described in
two different media: text files and socioeconomic numeric
databases and also for transverse searching, whereby
retrieved text is used to find topically related numeric
data and vice versa. Direct transverse searching across
different media is impossible. Descriptive metadata pro-
vide enabling infrastructure, but usually require map-
pings between different vocabularies and a search-term
recommender system. Statistical association techniques
and natural-language processing can help. Searches in
socioeconomic numeric databases ordinarily require that
place and time be specified.
A
hope for libraries is that new technology will
support searching across an increasing range of
resources in a growing digital landscape. The rise
of the Internet provides a technological basis for shared
access to a very wide range of resources. The reality is
that network-accessible resources, like the contents of a
well-stocked reference library, are quite heterogeneous,
especially in the variety of indexing, classification, catego-
rization, and other forms of metadata. However, the use of
digital technology implies a degree of technical compat-
ibility between different media, sometimes referred to as
“media convergence,” and these developments encourage
the prospect of being able to search across and between
different media forms—notably text, images, sound, and
numeric data sets—for different kinds of material relat-
ing to the same topic. To examine the practical problems
involved, the authors undertook to demonstrate searching
between and across two different media forms: text files
and socioeconomic numeric data sets.1
Two kinds of search are needed. First, it should be pos-
sible to do a topical search in multiple media resources,
so that one can find, for example, both pertinent factual
numeric data and relevant discussion. (One difficulty is
that the vocabulary used to classify the numeric data is
ordinarily quite different from the subject headings used
for books, magazine articles, and newspaper stories about
the same topic.) Second, when intriguing data values are
encountered, one would like to move directly to topically
relevant texts. Likewise, when a questionable statement
is read, one would like to be able to find relevant statisti-
cal evidence. Therefore, there needs to be search support
that facilitates such transverse searching among resources,
establishing connections, transferring data, and invoking
appropriate utilities in a helpful way.
Both problems were addressed through the design
and demonstration of a gateway providing search sup-
port for both text and socioeconomic numeric databases.
First, the gateway should help users conduct searches in
databases of different media forms by accepting a query
in the searcher’s own terms and then suggesting the spe-
cialized categorization terms to search for in the selected
resource. Second, if something interesting was found in
a socioeconomic database, the gateway would help the
searcher to find documents on the same topic in a text
database, and vice versa. Selection of the best search terms
in target databases is supported by the use of indexes to the
categories (entries, headings, class numbers) in the system
to be searched. These search-term recommender systems
(also known as “entry vocabulary indexes”) resemble
Dewey’s “Relativ Index,” but are created using statistical
association techniques.2
Four characteristics of this investigation need to be
noted:
1. Searching independent sources: The authors were
not concerned with ingesting resources from differ-
ent sources into a consolidated local data repository
and searching within it. The interest lay, instead, in
being able to search effectively in any accessible
resource as and when one wants. This implies that
interoperability issues in dealing with the native
query languages and metadata vocabularies of
remote repositories can be solved.
2. Search for independent content: Numeric data
sets commonly have associated text in the form
of documentation, code books, and commentary.
However, the authors were interested in finding
topical content that had no such formal or liter-
ary connection. Independent means, for example,
a newspaper article written by someone unaware
that relevant statistical data existed or had been
written before the author’s article existed. In the
other direction, having found statistical data of
interest, could topically related text created inde-
pendently of this particular data point be found?
3. Two different media forms were chosen: text and
numeric data sets. They look similar because they
both use arabic numerals, but the traditional reli-
ance on information retrieval in a text environment
Search across Different
Media: Numeric Data
Sets and Text Files
Michael Buckland, Aitao Chen,
Fredric C. Gey, and Ray R. Larson
Michael Buckland (buckland@sims.berkeley.edu) is Emeritus
Professor, School of Information, University of California,
Berkeley; Aitao Chen (aitao@yahoo-inc.com) is a researcher
at Yahoo!, Sunnyvale, California; Fredric C. Gey (gey@berkeley
.edu) is an Information Scientist, UC Data Archive and Technical
Assistance at the University of California, Berkeley; and Ray
R. Larson (ray@sims.berkeley.edu) is a Professor, School of
Information at the University of California, Berkeley.
182 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006
of using any character string from the corpus as
a query, although technically feasible, cannot be
expected to be useful here. One can copy a number
expressing quantity, such as 12,941, from a numeric
data cell, use it as a query in a text search engine
such as Google, and retrieve a large and eclectic
retrieved set, usually involving “12941” as an iden-
tifying number for a postal code, a memorandum,
a part number, software bug report, and so on, but
the relationship is spurious. It requires great faith
in numerology to expect anything topically mean-
ingful to the original data cell one started with.
With other combinations of media forms, not even
spurious results are feasible: one cannot submit a
musical fragment or some pixels from an image as
a text query.
4. The authors’ interest was in how to achieve a bet-
ter return on existing investments in well-formed,
edited resources with descriptive metadata. This
project built directly on prior work on how to make
more effective use of existing, expertly developed
metadata, rather than creating or replacing meta-
data.
Search of multiple resources comes in two forms:
1. Parallel search is when a single query is sent to two
or more resources at more or less the same time.
For example, a researcher interested in the import
of shrimp would like to see pertinent newspaper
articles and trade statistics. Thus, one might send a
query to the Census Bureau’s United States (U.S.)
Imports and Exports numeric data series and look
at SIC 0913 for shrimp and prawn and note a dra-
matic increase in imports from Vietnam through
Los Angeles from 1995 onwards. One would also
search newspaper indexes for articles such as
“Normalizing ties to Vietnam important steps for
U.S. firms; California stands to profit handsomely
when barriers fall to trade with fast-growing coun-
try.”3 Different sources are likely to use different
index terms or categories, so the challenge is how
to express the searcher’s query in terms that will
be effective for searching in the target resources,
which, mostly likely, will use different vocabular-
ies. As one example, the term for “automobiles” is
3711 in the Standard Industrial Classification; TL
205 in the Library of Congress (LC) Classification,
180/280 in the U.S. Patent Classification; and, in
the Census Bureau’s U.S. Imports and Exports data
series, PASS MOT VEH, SPARK IGN ENG.4
2. Transverse search is when an item of interest found
in one resource is used as the basis for a query to
be forwarded to a different resource. The challenge
here, again, is that when a query using the topical
metadata in one resource needs to be expressed in
the vocabulary of the target resource, the metadata
vocabularies in the two resources will usually be
different from each other, and, quite likely, both are
unfamiliar to the searcher.
When searching within a single media form, it may be
possible to use content itself directly as a query: A frag-
ment of text in a source-text database is commonly used
as a query in a target-text database. Similarly, one might
start with an image and seek images that are measur-
ably similar. However, because such direct search cannot
be done when searching across different media forms,
an indirect approach relying on the use of interpretive
representations becomes necessary. As the network envi-
ronment expands, mapping between vocabularies will be
increasingly important.
■ Text and numeric resources
Text resource
A library catalog—a special case of text file—was chosen
for use as a text file rather than a corpus of “full text.”
The reasons were practical: In this exploratory investiga-
tion, it was important to start with resources that had rich
metadata; it needed to be a resource that was sufficiently
controllable to enable experimentation with it. A library
catalog was in the spirit of the project in that it would lead
to additional text resources; and a suitable resource was
available, which was intended for metadata mapping: a set
of several million MARC records, derived from MELVYL,
the University of California online library catalog.
Socioeconomic numeric data set
Initially, and in prior work, the authors had worked on
access to U.S. federal data series, especially import and
export statistics and county business reports. Although
some progress was made with interfaces to these data
series, it became clear that the investment needed to craft
interoperable access was high relative to the available staff.
Crafting access to individual data series did not appear
to be a scalable way to demonstrate variety within the
authors’ limited resources, so attention was turned to a
single collection comprising many diverse numeric tables,
the Counting California database.5
■ Mapping topical metadata
Well-edited, high-quality databases typically have topi-
cal metadata expertly assigned from a vocabulary (the-
saurus, classification, subject-heading system, or set of
SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 183
categories). But there is a Babel of different vocabularies.
Not only do the names of topics vary, but the underlying
concepts or categories may also differ. Effective searching
requires expert familiarity with a system’s vocabulary;
but as access to digital resources expands, the diversity
of vocabularies increases and accessible resources are
decreasingly likely to use vocabularies familiar to any
individual searcher. The best answer is twofold: First, it is
desirable to have an index (a “mapping”) from the natural
language of each group of searchers to the entries used in
each metadata vocabulary. Such a mapping provides an
index from a vocabulary familiar to the searcher to the
vocabulary used in entries of the target system and so is
called a search-term recommender system. (The authors
called it an “entry-vocabulary index,” or EVI.) Dewey’s
“Relativ Index” to his Decimal Classification is a famil-
iar example. When searching across databases, one also
wants a second kind of mapping: between pairs of system
vocabularies. Unfortunately, mappings between different
vocabularies are rare, expensive, time-consuming, and
hard to maintain. (The Unified Medical Language System
is a notable example.)6 It is the authors’ impression that
this problem is worse in searching across different media
forms because data bases in different media forms tend to
be created by different communities, increasing the chances
that they will use different categories, vocabularies, and
ways of thinking.
Fortunately where data containing two forms of
vocabulary are available, they can be used as training sets
for statistical-association techniques to generate EVIs auto-
matically, and this is the approach that was used. (More
details can be found in the appendix.)
From text words to Library Subject Headings
An EVI from ordinary English words to Library of
Congress Subject Headings (LCSH) was created by taking
catalog records containing at least one subject heading
(6xx field in the MARC bibliographic format). From each
of the 4,246,510 records used, main subject headings were
extracted (subfield a from fields 600, 610, 611, 630, 650,
and 651) and fields containing text: titles (245a), subtitles
(245b), and summaries describing the scope and general
content of the material (520a). The underlying assump-
tion is that for each record, the words in the “text” fields
(245a,b and 520a) tend to be characteristic of discourse on
the subject (6xxa). Two examples, with identifying LCCNs
in the <001> field are:
<001>73180254 //r86001>
<245>A study of operant conditioning under delayed
reinforcement in early infancy245>
<650>Infant psychology650>
<650>Operant conditioning650>
<001>73180255 001>
<245>Reptilian diseaserecognition and
treatment245>
<650>ReptilesDiseases650>
The words in the text fields (245a, 245b, and 520a) were
extracted. Stop words were removed and the remainder
normalized. Then the degree to which each word is asso-
ciated with each subject heading (by co-occurring in the
same records) was computed using a maximum likelihood
ratio-based measure. Natural-language processing can be
used to identify adjective-noun phrases to support more
precise searching using phrases as well as individual
words. A very large matrix shows the association of each
text word (or phrase) with each subject heading; so, for
any given word (or combination of words), a list of the
most closely associated headings, ranked by degree of
association, can be derived from the matrix.
Queries
A query, which can be a single word, a phrase, a set of
keywords, a book title, and so on, is normalized in the
same way and looked up in the matrix to produce a ranked
list of the most closely associated subject headings as
candidate LCSH search terms. For example, entering the
textual query words “Peanut” and “Butter” generates the
following ranking list of LCSH main headings as candi-
dates for searching:
Rank LCSH (subfield 650a)
1. Peanut
2. Cookery (peanut butter)
3. Cookery (peanuts)
4. Peanut industry
5. Peanut butter
6. Butter
7. Schulz, Charles M.
This display is an important departure from traditional
fully automatic searching. The list is, in effect, a prompt,
indicating probably suitable query terms in the vocabulary
of the target resource. It introduces the searcher to the
categories and terminology of the system and enables the
searcher to use expert judgment to select the heading that
seems best for the search.
From text words to the metadata vocabularies
in numeric data sets
A training set of records containing both descriptive words
and topical metadata is often not readily available for
numeric data sets. The authors’ first effort was to create an
EVI to the Standard Industrial Classification (SIC), widely
used over many years in numeric data sets. (SIC codes
were associated with words by using, as a training set, the
184 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006
titles in a bibliographic database that used SIC codes.) But
by the time the SIC EVI was completed, SIC had been dis-
continued and replaced by the North American Industry
Classification System (NAICS), so a mapping was created
from SIC codes to NAICS codes. Figures 1–3 show stages
in an interface that accepts a searcher’s query “car” (figure
1), prompts with a ranked list of NAICS codes (figure 2),
then extends the search with the selected NAICS code to
retrieve numeric data (figure 3).
By this time, however, it had become apparent that,
with the current low level of interoperability in software
and in data formats, the labor required to create EVIs and
interfaces to each large traditional numeric data series was
enormous. Therefore, attention was turned to a collection
of different numeric data sets available through a single
interface, Counting California, made available by California
Digital Library at http://countingcalifornia.cdlib.org. This
resource is a collection of some three thousand numeric
tables containing statistics related to a range of topics.
The numeric data sets are mainly from the California
Department of Health Services, the California Department
of Finance, and the federal Bureau of the Census. The tables
are organized under a two-level classification scheme.
There are sixteen topics at the top level, which are subdi-
vided into a total of 184 subtopics. All the numeric tables
were assigned to one or more subtopics and each table
has a caption.
At the Counting California Web site, a searcher can
browse for tables by selecting a higher-level topic, then
a lower-level subtopic, and then a table. Two additional
ways were created to access the tables: Probabilistic
retrieval, and an EVI to the topical categories. The cap-
tions, topics, and subtopics were extracted for each of the
three thousand tables, and XML records were created in
the following form:
education
libraries
library statistics, statewide summary by
type of library California 1992–93 to 1997–98
Retrieval
Two search methods were used:
Direct Probabilistic Retrieval. An in-house implementa-
tion was used of a probabilistic full-text retrieval algo-
rithm developed at Berkeley.7 This search engine takes a
free-form text query and returns a ranked list of captions
of tables ranked according to their relevance scores. For
example, the five top-ranked captions returned to the
query “Public Libraries in California” were:
Figure 1. Query interface for search-term recommender system f
or the North American Industry Classification System
Figure 2. Display of NAICS code search-term recommendations
for “car”
Figure 3. Display of numeric data retrieved using selected
NAICS code
SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 185
1. Library statistics, Statewide summary by type of
library California, 1992–93 to 1997–98 Table F6.
2. Library statistics, Statewide summary by type of
library California, 1993–94 to 1998–99 Table
F6YR0-0.
3. Number of California libraries, 1989 to 1999 Table
F5YR00
4. Number of California libraries, 1989 to 1998, as of
September Table F5.
5. California Public Schools, Grades K–12, 1989 to
1998 Table F4.
Each entry in the retrieved set list is linked to a numeric
table maintained at the Counting California Web site and,
by clicking on the appropriate link, a user can display the
table as an MS Excel file or as a PDF file.
Mediated Search. From the same extracted records the
words in the captions were used to create an EVI to the sub-
topics in the topic classification using the method already
described. As an example, the query “personal individual
income tax,” when submitted to the EVI, generated the
following ranked list of subtopics:
1. Income
2. Government earnings and tax revenues
3. Personal income
4. Property tax
5. Personal income tax
6. Corporate income tax
7. Per capita income
A user can click on any selected subtopic to retrieve the cap-
tions of tables assigned that subtopic. For example, clicking
on the fifth subtopic, Personal income tax, retrieves:
■ Personal income tax returns: Number and amount
of adjusted gross income reported by adjusted gross
income class California, 1998 taxable year. Table
D10YR00
■ Personal income tax returns: Number and amount
of adjusted gross income reported by adjusted gross
income class California, 1997 taxable year. Table D9
■ Personal income statistics by county, California 1997
taxable year. Table D10
■ Personal income statistics by county, California 1998
taxable year. Table D11YR00
■ Transverse searching between text- and numeric-data series
To demonstrate the searching capability from a bib-
liographic record to numeric-data sets, the first step is to
retrieve and display a bibliographic record from an online
catalog. A Web-based interface for searching online catalogs
was implemented using an in-house implementation of the
Z39.50 protocol. Besides the Z39.50 protocol, an important
component that makes searching remote online catalogs
feasible is the gateway between the HTTP (Hypertext
Transfer Protocol) and the Z39.50 protocol. While HTTP is
a connectionless-oriented protocol, the Z39.50 is a connec-
tion-oriented protocol. The gateway maintains connections
to remote Z39.50 servers. All search requests to any remote
Z39.50 server go through the gateway.
Searching from catalog records
to numeric data sets
Having selected some text (for the purposes of this study, a
catalog record), how could one identify the facts or statis-
tics in a numeric database that are most closely related to
the topic? Clicking on a “formulate query” button placed
at the end of a displayed full MARC record creates a query
for searching a numeric database. The initial query will
contain the words extracted from the title, subtitle, and the
subject headings and is placed in a new window where the
user can modify or expand the query before submitting it to
the search engine for a numeric database. So, for example,
the following text extracted from a catalog record:
Library laws of the State of California,
Library legislation. California.
Public libraries
when submitted as a query, retrieves a ranked list of
table names, of which two, covering different time periods,
are entitled Library Statistics, Statewide Summary by Type of
Library, California.
Searching from numeric data sets
from catalog records
Transverse search in the other direction, starting from a
data table, is achieved by forwarding the caption of a table
to the word-to-LCSH EVI to generate a prompt list of the
seven top-ranked LCHSs, any one of which can be used
as a query submitted to the catalog.
■ Architecture
Figure 4 shows the structure of the implementation. The
boxes shown in the figure are:
1. A search interface for accessing bibliographic/tex-
tual resources through a word-to-LCSH EVI.
2. A word to the LCSH EVI.
3. A ranked list of LCSHs closely associated with the
query.
4. An online catalog.
186 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006
5. Results of searching the online catalog using an
LCSH.
6. A full MARC record displayed in tagged form.
7. A new query formed by extracting the title and sub-
ject fields from the displayed full MARC record.
8. A numeric database.
9. A list of captions of numeric tables ranked by rel-
evance score to the query.
1 0. Numeric table displayed in PDF or MS Excel for-
mat.
11. A search interface for numeric databases based on
a probabilistic search algorithm.
A user can start a search using either interface (boxes
1 or 11) and, from either starting point, find records on
the same topic of interest in a textual (here bibliographic)
database and a socioeconomic database.
■ Conclusions and further work
Enhanced access to numeric data sets
The descriptive texts associated with numeric tables, such
as the caption, headers, or row labels, are usually very short.
They provide a rather limited basis for locating the table in
response to queries, or describing a data cell sufficiently
to form a usefully descriptive query from it. Sometimes
the title (caption) of a table may be the only searchable
textual description about the content of the table, and the
titles are sometimes very general. For example, one of the
titles, Library Statistics, Statewide Summary by Type of Library
California, 1992–93 to 1997–98, is so general that neither the
kinds of statistics nor the types of libraries are revealed. If
a user posed the question, “What are the total operating
expenditures of public libraries in California?” to a query
system that indexes table titles only, the search may well
be ineffective since the only word in common between the
table title and the user’s query is “California” and, if the
plurals of nouns have been normalized, to the singular
form, “library.”
Table column headings and row headings provide
additional information about the content of a numeric
table. However, the column and row headings are usu-
ally not directly searchable. For example, a table named
“Language spoken at home” in Counting California
databases consists of rows and columns. The column
headings list the languages spoken at home, while the
row headings show the county names in California. Each
cell in the table gives the number of people, five years of
age and older, who speak a specific language at home.
To answer questions such as “How many people speak
Spanish at home in Alameda County, California?” using
the table title alone may not retrieve the table that contains
the answer to the example question. It is recommended
that the textual descriptions of numeric tables be enriched.
Automatically combining the table title and its column and
row headings would be a small but practical step toward
improved retrieval.
Geographic search
Socioeconomic numeric data series refer to particular areas
and, in contrast to text searching, the geographical aspect
ordinarily has to be specified. To match the geographical
area of the numeric data, a matching text search may also
have to specify the same place. The authors found that
this was hard to achieve for several reasons. Place names
are ambiguous and unstable: A search for data relating
to Trinidad might lead to Trinidad, West Indies, instead
of Trinidad, California, for example. The problem is
compounded because, in numeric data series, specialized
geopolitical divisions, such as census tracts and counties,
are commonly used. These divisions do not match conve-
niently with searchers’ ordinary use of place names. Also,
the granularity of geographical coverage may not match
well. Data relating to Berkeley, for example, may be avail-
able only in aggregated data for Alameda County.
It was eventually concluded that reliance on the names
of places could never work satisfactorily. The only effective
path to reliable access to data relating to places would be
to use geospatial coordinates (latitude and longitude) to
establish unambiguously the identity and location of any
place and the relationship between places. This means
that gazetteers and map visualizations become important.
Gazetteers relate named places to defined spaces, and
thereby reveal spatial relationships between places, e.g.,
the city of Alameda is on Alameda Island within Alameda
County. This problem has been addressed in a subsequent
Figure 4. Architecture of the prototype
SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 187
study entitled “Going Places in the Catalog: Improved
Geographical Access.”8
Temporal search
Searches of text files and of socioeconomic numeric data
series also differ substantially with respect to time periods:
Numeric data searches ordinarily require the years of inter-
est to be specified; text searches rarely specify the period.
An additional difficulty arises because in text, as in speech,
a period is commonly referred to by a name derived meta-
phorically from events used as temporal markers, rather
than by calendar time, as in “during Vietnam,” “under
Clinton,” or “in the reign of Henry VIII.”
Named time periods have some of the characteristics
of place names: they are culturally based and tend to be
multiple, unstable, and ambiguous. It appears that an
analogous solution is indicated: directories of named time
periods mapped to calendar definitions, much as a gazet-
teer links place names to spatial locators. This problem is
being addressed in a subsequent study entitled “Support
for the Learner: What, Where, When, and Who.”9
Media forms
The paradox, in an environment of digital “media conver-
gence,” that it appears impossible to search directly across
different media forms invites closer attention to concepts
and terminology associated with media. A view that fits
and explains the phenomena as the authors understand
them, distinguishes three aspects of media:
■ Cultural codes: All forms of expression depend on
some shared understandings, on language in a broad
sense. Convergence here means cultural convergence
or interpretation.
■ Media types: Different types of expression have
evolved: Texts, images, numbers, diagrams, art. An
initial classification can well start with the five senses
of sight, smell, hearing, taste, and feel.
■ Physical media: Paper; film; analog magnetic tape; bits;
. . . Being digital affects directly only this aspect.
Anything perceived as a meaningful document has cul-
tural, type, and physical aspects, and genre usefully denotes
specific combinations of code, type, and physical medium
adopted by social convention. Genres are historically and
culturally situated.
Convergence can be understood in terms of interoper-
ability and is clearly seen in physical media technology.
The adoption of English as a language for international
use in an increasingly global community promotes conver-
gence in cultural codes. Nevertheless, the different media
types are fundamentally distinct.
Metadata as infrastructure
It is the metadata and, in a very broad sense, “biblio-
graphic” tools that provide the infrastructure necessary for
searches across and between different media—thesauruses,
mappings between vocabularies, place-name gazetteers,
and the like. In isolation, metadata is properly regarded as
description attached to documents, but this is too narrow
a view. Collectively, the metadata forms the infrastructure
through which different documents can be related to each
other. It is a variation on the role of citations: Individually,
references amplify an individual document by validating
statements made within it; collectively, as a citation index,
references show the structure of scholarship to which docu-
ments are attached.
■ Summary
A project was undertaken to demonstrate simultane-
ous search of two different media types (socioeconomic
numeric data series and text files) without ingesting these
diverse resources into a shared environment. The project
objective was eventually achieved, but proved harder than
expected for the following reasons: Access to these differ-
ent media types has been developed by different commu-
nities with different practices; the systems (vocabularies)
for topical categorization vary greatly and need interpre-
tative mappings (also known as relative indexes, search-
term recommender systems, and EVIs); specification of
geographical area and time period are as necessary for
search in socioeconomic data series and, for this, existing
procedures for searching text files are inadequate.
■ Acknowledgement
This work was partially supported by the Institute of
Museum and Library Services through National Library
Leadership Grant No. 178 for a project entitled “Seamless
Searching of Numeric and Textual Resources,” and was
based on prior research partially supported by DARPA
Contracts N66001-97-C-8541; AO# F477: “Search Support
for Unfamiliar Metadata Vocabularies” and N66001-00-1-
8911, TO# J290: “Translingual Information Management
Using Domain Ontologies.”
References
1. Michael K. Buckland, Fredric C. Gey, and Ray R. Larson,
Seamless Searching of Numeric and Textual Resources: Final Report
on Institute of Museum and Library Services National Leadership
188 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006
Grant No. 178 (Berkeley, Calif.: Univ. of California, School
of Information Management and Systems, 2002), http://
metadata.sims.berkeley.edu/papers/SeamlessSearchFinal
Report.pdf (accessed July 18, 2006); Michael Buckland et al.,
“Seamless Searching of Numeric and Textual Resources: Fri-
day Afternoon Seminar, Feb. 14, 2003,” http://metadata.sims
.berkeley.edu/papers/seamlessfri.ppt (accessed July 18, 2006).
2. Michael Buckland et al., “Mapping Entry Vocabulary to
Unfamiliar Metadata Vocabularies,” D-Lib Magazine 5, no. 1 (Jan.
1999), www.dlib.org/dlib/january99/buckland/01buckland
.html (accessed July 18, 2006); Michael Buckland, “The Sig-
nificance of Vocabulary,” 2000, http://metadata.sims.berkeley
.edu/vocabsig.ppt (accessed July 18, 2006); Fredric C. Gey et al.,
“Entry Vocabulary: A Technology to Enhance Digital Search,”
in Proceedings of the First International Conference on Human Lan-
guage Technology, San Diego, Mar. 2001 (San Francisco: Morgan
Kaufmann, 2001), 91–95, http://metadata.sims.berkeley.edu/
papers/hlt01-final.pdf (accessed July 18, 2006).
3. Los Angeles Times, July 12, 1995: D1.
4. Michael Buckland, “Vocabulary As a Central Concept in
Library and Information Science,” in Digital Libraries: Interdisci-
plinary Concepts, Challenges, and Opportunities. Proceedings of the
Third International Conference on Conceptions of Library and Infor-
mation Science (CoLIS3), Dubrovnik, Croatia, May 23–26, 1999, ed.
T. Arpanac et al. (Lokve, Croatia: Benja Pubs., 1999), 3–12, www
.sims.berkeley.edu/~buckland/colisvoc.htm (accessed July 18,
2006); Buckland et al., “Mapping Entry Vocabulary.”
5. Counting California, http://countingcalifornia.cdlib.org
(accessed July 18, 2006).
6. “Factsheet: Unified Medical Language System,” www
.nlm.nih.gov/pubs/factsheets/umls.html (accessed July 18,
2006).
7. William S. Cooper, Aitao Chen, and Fredric C. Gey, “Full-
Text Retrieval Based on Probabilistic Equations with Coefficients
Fitted by Logistic Regression,” in D. K. Harman, ed., The Second
Text REtrieval Conference (TREC-2), March 1994, 57–66 (Gaith-
ersburg, Md.: National Institute of Standards and Technol-
ogy, 1994), http://trec.nist.gov/pubs/trec2/papers/txt/05.txt
(accessed July 18, 2006).
8. “Going Places in the Catalog: Improved Geographical
Access,” http://ecai.org/imls2002 (accessed Jul. 18, 2006).
9. Vivien Petras, Ray Larson, and Michael Buckland, “Time
Period Directories: A Metadata Infrastructure for Placing Events
in Temporal and Geographic Context,” in Opening Information
Horizons: Joint Conference on Digital Libraries (JCDL), Chapel
Hill, N.C., June 11–15, 2006, forthcoming, http://metadata.sims
.berkeley.edu/tpdJCDL06.pdf (accessed July 18, 2006); “Support
for the Learner: What, Where, When, and Who,” http://ecai
.org/imls2004 (accessed July 18, 2006).
SEARCH ACROSS DIFFERENT MEDIA | BUCKLAND, CHEN, GEY, AND LARSON 189
Appendix: Statistical association
methodology
A statistical maximum likelihood ratio weighting tech-
nique was used to construct a two-way contingency table
relating each natural-language term (word or phrase) with
each value in the metadata vocabulary of a resource, e.g.,
LCSH, LCCNs, U.S. Patent Classification Numbers, and
so on.1 An associative dictionary that will map words in
natural languages into metadata terms can also, in reverse,
return words in natural language that are closely associated
with a metadata value.
Training records containing two different metadata
vocabularies can be used to create direct mappings
between the values of the two metadata vocabularies. For
example, U.S. patents contain both U.S. and International
Patent Classification numbers and so can be used to create
a mapping between these two quite different classifica-
tions. Multilingual training sets, such as catalog records
for multilingual library collections, can be used to create
multilingual natural language indexes to metadata vocabu-
laries and, also, mappings between natural language
vocabularies.
In addition to the maximum likelihood ratio-based
association measure, there are a number of other asso-
ciation measures, such as the Chi-square statistic, mutual
information measure, and so on, that can be used in creat-
ing association dictionaries.
The training set used to create the word-to-LCSH EVI
was a set of catalog records with at least one assigned
LCSH (i.e., at least one 6xx field). Natural language terms
were extracted from the title (field 245a), subtitle (245b),
and summary note (520a). These terms were tokenized;
the stopwords were removed; and the remaining words
were normalized. A token here can contain only letters
and digits. All tokens were then changed to lower case.
The stoplist has about six hundred words considered not
to be content bearing, such as pronouns, prepositions,
coordinators, determiners, and the like.
The content words (those not treated as stopwords)
were normalized using a table derived from an English
morphological analyzer.2 The table maps plural nouns
into singular ones; verbs into the infinitive form; and
comparative and superlative adjectives to the positive
form. For example, the plural noun printers is reduced
to printer, and children to child; the comparative adjective
longer and the superlative adjective longest are reduced
to long; and printing, printed, and prints are all reduced to
the same base form print. When a word belonging to more
than one part-of-speech category can be reduced to more
than one form, it is changed to the first form listed in the
morphological analyzer table. As an example, the word
saw, which can be a noun or the past tense of the verb to
see, is not reduced to see. Subject headings (field 6xxa) were
extracted without qualifying subdivisions. The inclusion
of foreign words (alcoholismo, alcoolisme, alkohol, and alcool),
derived from titles in foreign languages, demonstrate
that the technique is language independent and could be
adopted in any country. It could also support diversity
in U.S. libraries by allowing searches in Spanish or other
languages, so long as the training set contains sufficient
content words. EVIs are accessible at http://metadata.
sims.berkeley.edu/prototypesI.html.
Fuller descriptions of the project methodology can be
found in the literature.3 ■
References
1. Ted Dunning, “Accurate Methods for the Statistics of
Surprise and Coincidence,” Computational Linguistics 19 (March
1993): 61–74.
2. Daniel Karp et al., “A Freely Available Wide Cover-
age Morphological Analyzer for English,” in Proceedings of
COLING-92, Nantes, 1992 (Morristown, N.J.: Association for
Computational Linguistics, 1992), 950–55, http://acl.ldc.upenn
.edu/C/C92/C92-3145.pdf (accessed July 18, 2006).
3. Michael K. Buckland, Fredric C. Gey, and Ray R. Larson,
Seamless Searching of Numeric and Textual Resources: Final Report on
Institute of Museum and Library Services National Leadership Grant
No. 178 (Berkeley, Calif.: Univ. of California, School of Informa-
tion Management and Systems, 2002), http://metadata.sims
.berkeley.edu/papers/SeamlessSearchFinalReport.pdf (accessed
Jul. 18, 2006); Youngin Kim et al., “Using Ordinary Language to
Access Metadata of Diverse Types of Information Resources:
Trade Classification and Numeric Data,” in Knowledge: Creation,
Organization, and Use. Proceedings of the American Society for Infor-
mation Science Annual Meeting, Oct. 29–Nov. 4, 1999 (Medford,
N.J.: Information Today, 1999), 172–80.