190 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 Digital tool making offers many challenges, involving much trial and error. Developing machine learning and assistance in automated and semi-automated Internet resource discovery, metadata generation, and rich-text identification provides opportunities for great discov- ery, innovation, and the potential for transformation of the library community. The areas of computer science involved, as applied to the library applications addressed, are among that discipline’s leading edges. Making applied research practical and applicable, through placement within library/collection-management systems and ser- vices, involves equal parts computer scientist, research librarian, and legacy-systems archaeologist. Still, the early harvest is there for us now, with a large harvest pending. Data Fountains and iVia, the projects discussed, dem- onstrate this. Clearly, then, the present would be a good time for the library community to more proactively and significantly engage with this technology and research, to better plan for its impacts, to more proactively take up the challenges involved in its exploration, and to better and more comprehensively guide effort in this new territory. The alternative to doing this is that others will develop this territory for us, do it not as well, and sell it back to us at a premium. Awareness of this technology and its current capabilities, promises, limitations, and probable major impacts needs to be generalized throughout the library management, metadata, and systems communi- ties. This article charts recent work, promising avenues for new research and development, and issues the library community needs to understand. T his article is intended to discuss Data Fountains (http://datafountains.ucr.edu) project work and thinking (and its foundation in the iVia system, http://ivia.ucr.edu) regarding tools and services, for use in collection creation and augmentation. Both systems emphasize automated and semi-automated Internet resource discovery, metadata generation, and rich-text harvest. These areas of work and research occur within the larger realms of machine assistance and machine learning. They are of critical value to libraries as they currently or potentially concern: significant resource savings; ampli- fication and re-tasking of expert effort to better match librarian expertise with tasks that truly require it (through the automation of routine tasks); and better scaling of col- lections by providing them the technological wherewithal to grow, as appropriate, and better match the explosion of significant available knowledge and information that the Internet has accelerated. This article is organized into three major sections: ■ Part I details machine assistance work to date in the Data Fountains and iVia systems project. ■ Part II describes current and upcoming promising research directions in machine assistance. ■ Part III delves into planning and organizational issues that may arise for the library community as a result of these technologies. ■ Part I: Recent work in Data Fountains and iVia Part I covers work to date on Data Fountains and iVia. Section 1, “A new service and open source software,” describes concrete project work with Data Fountains, a new open service and suite of open-source software tools for the educational and library communities, in developing practical machine learning to provide machine assistance in collection building. Data Fountains is an expansion of work based upon the iVia systems foundation.1 It is an effort that has been ongoing and evolving since 1994.2 Section 2, “Role and niche definition for machine assis- tance in collection building,” covers recent developments in our ongoing effort to better research and define roles and niches for machine assistance of the types offered by Data Fountains. The spectrum—ranging from collection building with an emphasis on expertise that receives small assists from machine tools to an emphasis on machine tools that are configured and thereafter assisted through small refinements by expertise—is examined. Results from an initial exploratory survey in these areas are summarized. ■ A new service and open-source software—Data Fountains Description Data Fountains is an Internet resource discovery, meta- data-generation, and selected, full-text harvesting service as well as the open source (Lesser General Public License Machine Assistance in Collection Building: New Tools, Research, Issues, and Reflections Steve Mitchell Steve Mitchell (smitch@ucr.edu) is the Science Librarian for iVia/NSDL Data Fountains/Data Fountains Projects, Science Library, University of California, Riverside. MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 191 (LGPL) and General Public License (GPL) licensed) soft- ware that makes the services possible. It is a set of tools for use by organizations and institutions serving the greater learning community that create and maintain Internet por- tals, subject directories, digital libraries, virtual libraries, or library catalogs with portal-like capabilities (IPDVLCs) containing significant collections of Internet resources. It is an evolved variant of the iVia system, with which it shares many components. The Data Fountains/iVia code base rep- resents more than 250,000 lines of primarily C++ code. On the systems level, Data Fountains operates as an array of independent systems containing crawler, text classifier, text extraction, portal, and database software components customized to the needs of participat- ing projects. Each cooperator and subject community works with, fine tunes, and benefits from its own set of crawler(s), classifier(s), and database manager(s), i.e., its own specific Data Fountain. Note that in this article, Data Fountains’ portal/metadata repository/database man- agement, content management, import-export, or content search/browse capabilities, which are substantial, will not be discussed.3 Instead, the article will focus on its machine assistance and machine-learning components. The Data Fountains system and service has been developed through a research partnership among com- puter scientists and academic librarians that is beginning to provide technological solutions to some of the major overall problems associated with the scalability and effi- cient running of IPDVLCs. Much project effort is based on applying machine-learning techniques to partially automate and provide help in a number of laborious and costly IPDVLC activities. Included here, more specifically, are the following needs/scaling challenges: reducing to some degree the high costs of manually created metadata; better coverage of the ever-increasing number of important Internet resources (relatedly, the relatively small size of most library Internet collections, where searches yielding very few or no results are common); reducing or making more efficient expert-involved tasks requiring little exper- tise; and reducing redundant efforts among IPDVLCs (both in content and systems building). By providing inexpensive, universally needed raw materials (i.e., metadata and rich full text represent- ing important resources), the Data Fountains service is intended to offer major support and resource savings to cooperating IPDVLC participants that otherwise have strong ongoing commitments to their established institu- tional identity or “brand,” interface or look, system, and, more generally, “established way of doing things.” Data Fountains viability and sustainability is keyed to providing universally needed service and very generic information products that do not require IPDVLCs to change—this often being seen as prohibitively expensive in time and resources. Data Fountains is intended to lower barriers for substantive cooperation in collection building and resource savings on the part of large numbers of IPDVLCs by developing, sharing, and distributing the benefits of machine learning in its areas of application. The Data Fountains service will be useful to a large spectrum of academic and library-based finding tools including metadata repositories and catalogs with Internet portal-like capabilities.4 Increasingly, library-catalog soft- ware is developing more flexibility, including, hopefully, the means by which full MARC (MAchine-Readable Cataloging) records coexist with more streamlined (and less expensive) records, e.g., Dublin Core (DC) and other types, and, moreover, metadata records that include or can be closely associated with selected rich full-text, among many other catalog need areas.5 Data Fountains offers mul- tiple levels of products and services geared to fit the needs of IPDVLCs of differing sizes, subject needs, and desired data “completeness” or depth (this being the amount and type of metadata and full-text needed to properly represent each resource). Uses, products, and services Overall, Data Fountains automatically or semi-automati- cally supplies varying levels of what represents the basic “ore” required by IPDVLCs for Internet resource and article collection building: access to significant, previ- ously undiscovered resources as well as the metadata and selected full-text that describe or represent them. This ore is available in both raw (relatively unprocessed) and more refined products depending on the needs of the partici- pating IPDVLC including, perhaps most importantly, the degree to which expertise is available to provide further refinement and how and for whom the material is intended to be used. Data Fountains multiple product and usage models supports the building of a wide array of IPDVLC collections. A number of usage or service models are supported by Data Fountains, including: Collection development support for single hybrid record type collections The first usage model, based on full automation, involves the utilization of Data Fountains metadata and rich, full- text “as is,” without review, to populate a collection. These records can be used by themselves or mixed with other types of records. They can also be used as part of a hybrid collection to undergird another, more primary, or fully expert-created, collection.6 While more accurate, expert- created collections are not only comparatively more labor intensive and expensive to create and maintain, but often smaller, with narrower and more limited coverage. This has been the INFOMINE (http://infomine.ucr.edu) model that features two distinct collections, with the automati- cally generated collection supporting, as a second tier of 192 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 data, the expert-built content in the primary collection. Users can search one or both. Internet resource discovery service A second model uses Data Fountains primarily as an Internet resource discovery service where links and titles and other minimal metadata are supplied but where the user’s intent is to identify new resources and build metadata records emphasizing a considerable amount of metadata not generated by Data Fountains (e.g., different subject schema). This is done by utilizing the Targeted Link Crawler, Expert Guided Crawler, or Focused Crawler. Because little to no metadata/rich-text generation/extrac- tion occurs, this is the least complex of the usage models. Crème de la Data Fountains A third approach, a variation of the second, utilizes only those Data Fountains records that have been automatically determined, through a user-set threshold, to represent the most highly significant resources (e.g., the top 20 percent). These can be flagged for expert review or automatically harvested without review. The Data Fountains metadata retained for expert review, post-processing, and improve- ment can be minimal or full. Metadata records intended for expert refinement A fourth approach, which is semi-automated, involves using Data Fountains as both a discovery service and as a metadata record-building service where employment of records from the Data Fountains data stream is selective but the machine-created record is routinely retained as a foundation record to be refined or augmented by the expert. Metadata records plus full-text A fifth approach is to use the rich full-text selectively iden- tified and harvested from the Internet resource, either in addition to the metadata generated or by itself, to populate a collection and greatly boost retrieval. That is, some col- lections may want to utilize metadata differing from that produced by Data Fountains but have Data Fountains perform the service of augmenting their metadata with rich full-text. All or parts of the object and full-text can be harvested. ■ Standards, metadata, and full-text Data Fountains’ record format is Dublin Core (DC) and features standard research library subject schemas includ- ing slightly modified Library of Congress Subject Headings (LCSH) and Library of Congress Classification (LCC). As part of upcoming work, development of additional classifiers to apply other subject/classification schemas/ vocabularies will occur, notably DDC and those that can be automatically invoked from the terminology found in the collection objects. Cooperators may choose to help develop new formats, subject schemas, and metadata to meet custom needs in collecting and classification. Other important metadata generated include: Title, Creators, Description (an annotation-like construct), Keyphrases, Capitalized Terms, and Resource Language, among a total of thirty-plus fields. In addition to fielded metadata, Data Fountains delivers selected rich text harvested from the resource. This is important for enhancing IPDVLC retrieval capabilities and user-searching success. The rich text can be harvested verbatim and offered as-is for search or, if this is problematical, further processed into keyphrases. Data post-processing, transfer, and product relevance assurance Participants determine and download resources of rel- evance automatically in batch mode via subject-profiled, custom Internet crawls and editable results sets created by and for each IPDVLC to reflect its particular interests. These profiled crawls and metadata generation routines are stored and can be re-executed at selected intervals. Results are transferred using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) or SDF (Standard Delimited Format) in DC, MARC, and eXtensible hypertext markup language (XHTML) formats. In addition to batch transfers, participants can manually and interactively identify individual records or groupings of records that suit their needs for harvest. Selective, interactive, sort- ing/browsing of results, followed often by evaluation and editing of metadata and full-text fields (as individual records or globally in patterns), is enabled prior to export. These capabilities allow precisely targeted, custom record identification, modification, and downloading. This in turn enables the most general, as well as the most subject-spe- cialized, IPDVLCs certainty in identifying and receiving only records that meet their need criteria. Open-source software The software making the above possible is available to all for free through the LGPL/GPL open-source licenses and model. The open-source model should work well for tool development as fundamental as that described. Open source of this type generally means that users freely use and perhaps participate in further development of the functionality of the software and, at intervals, contribute their innovations back to the code base for all to use. LGPL/GPL supports a wide diversity of forms of com- MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 193 mercial service development. Open source has worked well for large applications such as many forms of the Linux operating system (a number of variants of this are supported), Apache server software, and MySQL database management software (all of which are used by the Data Fountains system). Using this model has the intent of cooperatively benefiting the community as a whole. It is the author’s belief that tools of the Data Fountains type will have wide enough usage within and are crucial enough to the library community to support the development of an open-source community around them. Data Fountains software is of use to thousands of institutions that build IPDVLC collections. Open source also means that the development and evolution of a core tool or system for a community can potentially occur faster and more flexibly, with the proper community support, than many types of proprietary effort. This is needed given the continuing and increasingly greater revolutions in computing power and software potential. The community needs to be able to evolve faster in response to changing conditions, and free, community- based, open-source software development is one strategy for achieving this. ■ Current systems design, development, and features To date, most of the work has emphasized research and development leading to innovations in preferential focused crawling, subject classification using Logistic Regression, kNearest Neighbor (kNN) and other classifiers, and rich full-text identification and extraction. A major emphasis in systems development has been identifying points of intervention in crawling, classification, and extraction, whereby initial, periodic or ongoing interactive expert input can be employed to improve machine processes and results. That is, the work has emphasized usage not only of fully automated machine processes but semi-automated machine processes intended to interactively augment, amplify, and improve the efforts of experts. Experts assist machine processes, and machine processes assist expert judgment/labor. The programming has also been done with an eye toward modularity among different systems components. ■ Internet resource discovery/ identification—expert guided and focused crawling A number of crawling systems have been used; cur- rently, for Data Fountains, three are used that represent two approaches to crawling: expert guided and focused. Expert-guided crawling is accomplished by a Targeted Link Crawler (TLC) and an Expert Guided Crawler (EGC). TLC is concerned with crawling a user-specified link or list of links. EGC differs from TLC in that the single “Start URL” link given is only the beginning point from which the crawler will either drill down (find onsite links at multiple depths in a site) or drill out (find external links not on the Start URL site). The result is that, compared with TLC, many more links than just those given the EGC crawler initially are crawled. With all crawlers, a metadata record with accompanying rich full-text is generated for each resource crawled. A preferential focused crawler, called the Nalanda iVia Focused Crawler (NiFC) after the name of the ancient seat of learning in India, continues to be developed. Focused crawling makes possible focused identification of signifi- cant Internet resources by identifying specific, interlinked, and semantically similar communities of sites of shared subject interest. Generally, NiFC traverses subject expert- targeted regions of the Internet to find resources that are strongly interlinked and thereby represent coherent subject-interest communities and sites of shared interest and mutual use (i.e., are often concerned with and contain content similar to one another). Communities sharing interests often identify and cite one another through link- ages on their Internet resources. Through this mechanism, these communities and their sites/resources can be identi- fied, mapped, and harvested. Preferential focused crawl- ing makes focused crawling more efficient by employing algorithms that can respond to clues in Web resource page layout and structure (e.g., using document object models, visual cues, and text windows adjacent to anchor text, among others) that indicate the more “promising” links to crawl. The result is more efficient focused crawling (figure 1).7 The focused crawling process starts with exemplary sites/pages/URLs being supplied by participating IPDVLC experts. These highly on-topic exemplars are used to form a seed set of model pages used for training/guid- ing the crawler. As the crawling progresses, an interlinkage graph is developed of which resources link to one another (i.e., cite and co-cite). Highly interlinked resources are evaluated, differentiated, and rated as to the degree to which they are linked to/from as well as for their capaci- ties as authoritative resources (e.g., a primary resource such as an important technical report that receives many in-links to it from other resources) or hubs (e.g., secondary sources such as expert virtual library collections that pro- vide out-links to other, authoritative resources). As hubs, expert-created, high-quality IPDVLC collections of links (e.g., INFOMINE) play an important role as milestones and navigation aids in the guidance of many types of crawling. Another automated process works to rate resources, as a second indirect measure of resource quality, by comparing for similarity of content (e.g., similarities among key-word 194 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 vocabularies) between the potential new resources and model resources. The most linked to/from authorities and hubs, with terminology most similar to the exemplars, are thus identified and become prime candidates for adding to the collection and for indicating other resources to add. The overall architecture of Data Fountains involves multiple concurrent crawls and an array of multiple crawlers and associated classifiers on multiple machines (i.e., there are one or more Data Fountains for each major subject area or major cooperator). Areas of expert interaction in focused crawling Expert interactive and semi-automated approaches to improve crawling are employed in and constitute special design areas of Data Fountains since many participating projects and communities have access to considerable subject expertise. There is much promise in amplifying the role of this expertise in the crawling process. Experts can create and refine crawls by: ■ determining the most appropriate seeds (exemplary resources) to use (whether found in their own collec- tions or generated from other sources); ■ choosing degree of “on-topic-ness” desired (a preci- sion versus recall setting); ■ determining the total number of resources to be crawled; ■ editing initial crawl results (e.g., de-selecting or blacklisting resources found) with an eye toward generally refining and developing a super seed set of very large numbers of increasingly on-target seeds that are then crawled anew. (This process of refine- ment and enlargement can be reiterated as desired in achieving increasing accuracy in and numbers of exemplars and therefore accuracy in the final crawl.) ■ In addition, expert truing of crawler Web graph weightings (i.e., manually “lifting” the values of selected hubs and authorities) either during or after a crawling run is being explored to improve crawling accuracy. This lifting process can be aided through tools to visualize the crawl so that the expert can quickly identify, among the masses of results, the most promising areas of a Web graph for the crawler to emphasize. ■ Expert-created blacklists of URLs for types of sites or pages that are not valuable can be stored to save future crawling and expert time. There is such a blacklist for each participating Data Fountains com- munity group and individual. ■ Metadata generation— automated and semi-automated subject classification Data Fountains and iVia embody innovations in automated metadata generation, including identifying and applying controlled subject terms (using academic library-stan- dard subject schema), keyphrases, and annotation-like constructs (figure 2). Automated classifier programs apply these and other metadata and are part of a suite of programs known as the record builder. Controlled subject terminology applied currently includes LCSH, LCC, DDC, and Medical Subject Headings (MeSH). In assigning these, the system generally first looks for HTML and DC metatags and then extracts these data. With some fields, when these data are not present (which is common), original metadata are then generated automatically. In the case of LCSH, LCC, and DDC, if not present in metatags, or if users choose to override metatag extraction (in cases where metatags are not accurate, such as when they are spammy or when top-page boilerplate metadata is carried onto all pages regardless of subject relevance), then classification processes are invoked. These derive a set of keywords and key phrases from the resource that serve as a surrogate in representing and summarizing its content. Then, using a model that encapsulates the rela- tionships between these natural-language terms and the set of controlled-subject terms, the closest corresponding set of controlled terms is assigned. The model is learned from training data sets that consist of large sets of records (more than thirty million in corpora loaned for research purposes by the Cornell University Library, Library of Congress, California Digital Library [CDL], and OCLC) Figure 1. Focused and preferential crawling (courtesy of S. Chakrabarti) MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 195 from library catalogs and virtual libraries. With LCC, the aim has been to assign one or more LCCs to a resource based on the set of LCSHs associated with that resource. SVM, kNN, and Logistic Regression classifiers have been used. Generally, performance has been acceptable in cases where there were two hundred examples of the usage of a particular LCSH (in a record with a URL). Unfortunately, as large as the training data sets have been, there simply haven’t been enough records for classification purposes with URLs and associated text. This problem will more than likely be resolved shortly as catalogs increasingly incorporate Web resources. Metadata generation—Automated extraction of known, named entities Named-entity (e.g., data elements that can be expected to be in a resource and that are placed by authors/publish- ers within a known textual/markup pattern) extraction is primarily practiced through the simple means of identify- ing and extracting data elements indicated by HTML/DC metatags, when present on a page. Data for more than thirty Dublin Core common (and not so common) fields are extracted. With some fields, extraction can be guided, as needed, in the interests of original metadata creation through pattern recognition and profiling, or through classification (e.g., title, subjects, description). ■ Rich-text identification and harvest Refinement of our “aboutness” measure for identifying the most relevant pages or sections in a resource or document (i.e., those intended by the author to be rich in descrip- tive information about the topics within and the type of resource) from which to extract text is a continuing pursuit. Involved in this quest has been better determination of author-created structures and conventions in document or resource layout (e.g., locating introductions, summaries, etc., and determining/proportioning the amount of text to be extracted from each). More accurate rich-text identification in turn yields more accurate identification, extraction, and application of key phrases and, from these, more accurate controlled subject term and other metadata application. This is at the foundation of many metadata generation processes. Crucially, rich full-text is also important from an end-user information-retrieval perspective because the natural-lan- guage terminology contained partially corrects for the limi- tations inherent in many controlled metadata and subject vocabulary/schema approaches (e.g., new or specialized subject terminology is often slow to appear or weakly represented in the often generalist library-standard subject schemas). Refinement of the “aboutness” measure in identi- fying terms indicating that rich text follows is an important and ongoing task that involves formulating fairly intricate text-extraction rules in reflecting conventions in rich-text placement in resources and documents of differing types (e.g., Web sites, articles, database interfaces), formats (e.g., HTML, PDF, postscript), and languages. ■ A modular architecture that supports a federated array of subject-specific focused crawlers and classifiers The architecture that Data Fountains is based upon is shown in figures 3 and 4. Data Fountains operates on the systems level as an array of separate sets of bundled crawl- ers (both guided and focused), classifiers, and extractors; this bundled array of crawlers approach provides greater flexibility and efficiency, as compared with using a more monolithic, single-crawler, multiple-subject approach. A bundle can occupy a whole machine or several can exist independently, as virtual Data Fountains, on a single machine. Instead of one broad, multiple-subject, multiple- audience Data Fountain that follows a broad shotgun approach to Internet resource discovery and classification, there are several vertical, subject- and audience-focused Data Fountains. A Data Fountain is intended to exist for each distinct, major subject area and the subject-specific IPDVLC collections (e.g., visual arts, business, horticulture) associated with them. Figure 2. Metadata choices in Data Fountains 196 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 Data Fountains systems architecture emphasizes modularity. It has been enabled and assumed that sepa- rate components of the system (e.g., the crawlers, classi- fiers, database management systems) could be developed further for other uses independent of the Data Fountains system. In addition, as technologies that the system is dependent upon advance, users will be able to more easily swap out and replace older modules. These capabilities contribute to system sustainability. ■ Service design and sustainability Data Fountains was conceived to be a cooperative, non- profit, low-overhead, cost-recovery-based service intended to sustain itself after start-up. Access will be provided to IPDVLC cooperators who demonstrate interest and sup- port for the work and service. By so doing, cooperators share in supporting the continuing evolution and improve- ment of Data Fountains. As an additional sustainability consideration, the software has been released as open source so that it can develop and evolve in many directions (to directly fit unique needs) as well as benefit through distributed effort. ■ “Small is beautiful”: Roles for and advantages of appropriate small- to medium-scaled tools Approaches like those Data Fountains has taken may be among the few ways that Internet finding tools can con- tinue to be relevant to the learning/library community and offer the accuracy and significant content needed by that community. The technical challenges faced by the large engines in their quest to cover an infinitude of audiences and Internet resources do not need to be grappled with by the community of research libraries and are not faced by focused crawlers and classifiers of the type Data Fountains relies upon. The latter are better able to develop targeted, more accurate approaches to their subjects because they enable machine assistance for, as well as amplification of, authoritative subject expertise (e.g., librarians) as a core interactive component in the process of finding and describing new resources. The processes involved target more narrowly defined, distinct, and finite subject uni- verses and intellectual communities. This, in turn, allows them to scale appropriately for their tasks and to apply more complex and varied types of metadata for faculty, researchers, graduate students, and librarians, who gener- ally require more precision (and authority) in their finding tools but still need to move beyond collections (even allied) that are essentially catalogs moved forward a notch. The smaller scale of this work also potentially enables inno- vations in effective linkage and similarity (i.e., semantic) analysis. Some experts note that the future of Internet searching as a whole may lie in searching federated finding tools based in these techniques.8 Such a federation could be an academic’s or librarian’s Web of high-quality find- ing tools. Data Fountains may offer part of the foundation needed to support such a Web. From a related perspective, these tools represent an appropriate approach for library and library-community- scaled resource identification and description tasks that emphasizes perhaps the great advantage the library com- Figure 3. Interaction of fully and semi-automated and manual collection building processes in Data Fountains Figure 4. Overall Data Fountains architecture MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 197 munity can bring to bear in creating useful finding and metadata generation tools, which no others have. That is, the community’s unparalleled subject and description expertise in finding and ordering significant resources into coherent rich collections might be amplifiable shortly, through machine assistance. If such an effort was sensi- bly coordinated and focused, and minor modifications in approach and established standards made to enable best use of these new tools, then the best Internet find- ing tools/collections could be made possible yielding high-quality and significant coverage. These collections would benefit by having the capability to catalyze, out of the mass of the Web, the resources that constitute much of its intelligent fraction and make this coherently visible and available to learners and researchers. Moreover, this could be done in such a way that digital and print record and object collections could seamlessly interact as one, rendering what would be the best information-finding tools/collections without regard to type of resource. This effort in fact has been unfurling for a long time, though, to date, in small and somewhat sporadic, uncoordinated ways. For example, INFOMINE and similar collections have provided credible links to and for the academic com- munity for well over a decade. ■ Role and niche definition for machine assistance in collection- building exploratory survey An exploratory survey conducted in fall 2005 illuminated new perspectives, desired products and services, and research opportunities as perceived by a sampling of digital library and library leaders in regard to a number of areas involving machine assistance in collection building. Generally, areas explored concerned, among others: new roles projected for machine learning/machine assistance in libraries for metadata generation, resource discovery, and rich full-text identification and extraction; new finding-tool niches and opportunities existing in the service spectrum between Google and OPAC; acceptance of streamlined, more minimal, and cost-saving approaches to metadata creation or augmentation; the role of cost-recovery-based service and cooperative, participatory business models in digital libraries. More specifically, the purposes of the survey were to: 1. Elicit leading library attitudes in relation to the types of services, software development, and research that generally will constitute Data Fountains; 2. Test the waters in regard to attitudes toward implementing machine-learning/machine-assis- tance-based services for semi-automated collection building within the general context of libraries; 3. Probe for new avenues or niches for these services and tools in distinction to both traditional library services/tools and large Web search engines; 4. Concretely define Data Foundations’ initial set of automatically and semi-automatically generated metadata/resource-discovery products, formats, and services; 5. Examine attitudes toward the value and roles of rich, full-text in library-related finding tools; 6. Examine attitudes toward hybrid databases contain- ing heterogeneous records (e.g., multiple formats, types, and amounts of metadata); 7. Gather ideas on cooperatively organizing such ser- vices; and 8. Generally define new ideas in all interest areas for development of products and services. The survey, comprised of fifty-nine questions, was sent to thirty-five managers of leading digital libraries/librar- ies/information projects.9 There was roughly a 40 percent return from those targeted (fourteen out of thirty-five). Responding institutions and individuals were guaranteed anonymity of response. ■ Survey result summary There was considerable agreement on most answers. As such, this initial definitional survey has proven helpful in design and product definition. Though the survey sample set/number of respondents was limited and while results need to be seen as tentative, the views expressed are from well-regarded experts in the fields of digital library and library technology, development, and services. In addi- tion to helping define current Data Fountains services, the survey results also indicated the need for further explora- tion in the areas of services, tools, overall niche definition, and publicity. While conclusions remain tentative, barring future, larger surveys, some of the more relevant results are as follows: ■ There appear to be significant niches for an auto- mated/semi-automated collection-building/aug- mentation service given inadequacies in serving research-library users found in Google (and presum- ably other large commercial search engines) and commercial-library OPAC/catalog systems. Survey results indicate a need for services of the types char- acterized by Data Fountains. ■ Generally, academic libraries get a slightly above middle-value (neutral) grade in terms of meeting shifting researcher and student information needs over the last decade. This indicates that, above and beyond specific library and commercial-finding tools, 198 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 there are information needs not being met by librar- ies in regard to information discovery and retrieval that new services may be able to help provide. ■ There is support, above and beyond creating machine-assistance-based collection-building ser- vices, for developing and distributing the free, open- source software tools supporting these services. Tools that make possible machine assistance in resource description and collection development are seen as potentially providing very useful services. ■ Automated metadata creation and automated resource discovery/identification, specifically, are perceived as potentially important services of signifi- cant value to libraries/digital libraries. ■ There is support for the notion of automated iden- tification and extraction of rich, full-text data (e.g., abstracts, introductions) as an important service and augmentation to metadata in improving user retrieval. ■ The notion of hybrid databases/collections (such as INFOMINE) containing heterogeneous metadata records (referring to differing amounts, types, and origins of metadata) representing heterogeneous information objects/resources, of different types and levels of core importance, was supported in most regards. ■ Many notions that were foreign to library and even leading-edge digital library managers/leaders (the respondents) two to three years ago appear to be acknowledged research and service issues now. Included among these are: machine assistance in collection building; crawling, extraction, and clas- sification tools; more streamlined types of metadata; open-source software for libraries; limitations of Google for academic or research uses; limitations of commercial-library OPAC/catalog systems; and the value of rich full-text as a complement to metadata for improved retrieval. ■ There is strong support, given the resource savings and collection growth made possible, for the notion of machine-created metadata: both that which is cre- ated fully automatically and, with even more sup- port, that which is automatically created and then expert reviewed and refined. ■ Amounts, types, and formats of desired metadata (very streamlined DC metadata was supported for most uses and contexts) and means of data transfer (OAI-PMH was preferred) were specified by respondents. ■ Summary of Part I Data Fountains is a unique service and system for inex- pensively supporting aspects of collection building among IPDVLCs. Developing and utilizing advances in focused crawling and classification, this service automatically and semi-automatically identifies useful Internet resources (both open-Web as well as closed-collection resources including articles and reports, etc.) and generates metadata (and selected rich text) to accompany them. Data Fountains is a cooperative service, a free open-source software sys- tem, and a research-and-development project exploring machine assistance as well as machine-expert interfaces and synergies in collection building. Several useful service niches and roles for the work have been identified and have been or are being developed. ■ Part II: New directions in research This section discusses important new directions in research for machine assistance in collection building as they relate to upcoming and expanding research, development, and prototyping within Data Fountains and iVia. Among focus areas are promising means of: automated classification for applying library standard controlled subject vocabular- ies/schema, including hybrid and ensemble classification; smarter and more accurate named-entity extraction (e.g., capturing object/article metadata “facts” such as publisher and publishing date); improvements in rich-text identifi- cation and harvesting; article/report collection level co- citation and subject gisting functionality; and generally improved expert-guided and focused Web crawling. ■ New research in machine assistance for collection building The iVia and Data Fountains projects have recently received a fourth National Leadership Grant from the United States Institute of Museum and Library Services that supports three years of research and development in machine assistance in collection building. In addition, the National Science Digital Library is continuing funding. The areas that will be worked in are discussed below. These have been determined through experience gained over the last eight years of work in machine-assistance-oriented systems development and dialogue with computer scientists and collection coordinators. These areas of technology work and application, though complex and challenging, are very important. That is, assuming it is important that the learn- ing/library community not be dis-intermediated by such technologies but instead becomes more fully empowered by them. This can only occur through developing a much larger role in actively defining, guiding, and putting the technologies to best possible use. Looking into the future, it is clear that libraries cannot simply continue to wait for or rely on good companies like MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 199 Google, OCLC, or OPAC creators to deliver them, much like a cargo cult, as they have in the past. To the degree that this is done, there is the risk of becoming vendor vectors blinded by the limitations of these companies and their product lines. These products are often incorrectly assumed to be the known technical and organizational universe of what is possible or doable. The revolutions coming in computing power together with the low cost of this power—which will be almost ubiquitously distributed among users of library collections and services—promise much more change than libraries have seen in the last decade. Among the changes underway are those in machine learning and machine assistance in libraries. As the changes take place, organization size may not guarantee much as, over the last decade, librarians and researches have witnessed large academic and other research libraries, with some exception, demonstrate a profound organizational entropy in almost direct propor- tion to the magnitude of what are essentially paradigm shifts in scholarly communications, information provi- sion, and research. To some degree, these simply reflect larger blockages within the universities and institutions in which libraries are embedded. As these changes play out, it should be noted that history in information or library- related public or scholarly information provision/access probably will not end with Google or OCLC—wonderful and fairly open companies—just as history in automobile manufacture has not ended with GM, computer manufac- ture with IBM, or Web finding with Alta Vista. With this as background and in the vein of open planning (as well as open services and open software) and given the size of the work areas addressed and their challenges, much of the projects’ technical planning and direction are being presented in this paper. These areas of computer and information-science research and develop- ment, which will affect libraries in many ways, are evolving rapidly into practical application. The current major research areas are: ■ Named-entity identification and extraction, and unified models of information extraction and data mining Named-entity identification and extraction is concerned with finding and harvesting generally concise factual data—often common bibliographic metadata—present in the targeted resource such as publisher, title, and publish- ing date. This type of metadata usually is associated with particular collections containing information objects that are often homogeneous (e.g., scientific article collections) and in which author-intended placement of metadata (or data) elements follows an established pattern and location in the object (e.g., an abstract is typically present and indi- cated in a pattern following presentation of title/author). While making extraction easy is one of the functions of metatagged metadata in Internet resources, generally few authors or collection coordinators in academia, or else- where, use metatags or applicable naming schema in any significant or uniform way (often, in fact, it is used very sparingly or not at all). Extractors therefore must be able not only to identify and harvest metatag metadata, but must discern and then extract specific metadata elements interspersed in bodies of text, as made identifiable by detecting the patterns of occurrence unique to the type of element as it occurs in the object or collection. Among the many advances planned for Data Foundations is the usage of conditional random fields.10 Important as well are user interfaces or dash boards that allow configuration of extractors whereby, as patterns of placement for desired data for extraction change in differing collections and types of objects, the tool can be configured appropriately to match the context and task. Also under development consideration are more hybrid, unified approaches to and models for data extraction and mining (as applies to text classification), using each to inform and improve the other.11 That is, a family of models is being developed for improving data mining of informa- tion in largely unstructured text by using methods that “have such tight integration that the boundaries between them disappear, and they can be accurately described as a unified framework for extraction and mining.”12 Much of this work is concerned with generating metadata for article/report-level collections. ■ Document-scale learning and classification A strong emphasis in the new work will be on document- scale machine learning, classification, and named-entity extraction in regard to collections of research papers, reports, theses, and monographs. Internet-object boundary detection is another impor- tant concern. Detecting and properly defining compound documents (e.g., Web hyper-books on multiple pages or sites) is a goal, as is identifying compound-document points of author-intended entry and intended-user paths (i.e., author-intended main connective threads in distrib- uted or compound documents).13 Relatedly, improved internal-document structure identification for better docu- ment-level classification and extraction is critical. Involved are standard-document internal-structure identification (e.g., abstract, introduction, summary text, captions for tables/figures) including units of rich text and micro- information units of text organized via subtopic.14 Methods of document-level word-and-phrase graphing as per 200 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 TextRank and other means of identifying small-world and micro-information units are currently being pursued.15 A strong emphasis as well will be on examining and implementing new means of co-referencing among docu- ments in collections and new means of identifying latent topics in a well-defined collection. By way of explanation, another term for co-referencing is co-citation. An example of such co-referencing is referencing work, described in papers, that has been funded through the same agency and program or that shares principal investigators in addition to standard bibliographic citation. This will improve on work done in CiteSeer.IST (ResearchIndex) and similar projects through integrating and advancing the promising approaches of Rexa open-source collection-management software.16 The focus of this effort will be on integrating article-level named-entity extraction as well as co-citation and bibliometric-refined subject identification within col- lections of papers/reports. ■ Individual text-classification algorithm and training method improvement New research on individual text-classification algorithms will be examined and applied. The emphasis here will be on prototyping and measuring how applicable recent promising scholarly work might be to library-related meta- data-generation challenges. The major focus continues to be in the area of applying controlled, library standard sub- ject vocabularies (e.g., LCSH, LCC, and DDC). Many of the improvements relate to advances in individual text-clas- sification algorithms, classifier training and fine-tuning, training-corpora cleanup and normalization techniques, and creating the ability for the individual classifiers to hybridize with other classifiers. Of special interest are classifiers that perform well with very large numbers of classes, both small and large amounts of text, and that yield probabilistic estimates in class assignment (e.g., of a particular LCSH). The latter allows both provision of multiple class assignments for resources that have mul- tiple subjects as well as greater accuracy and knowledge of the confidence level of the assignments (thresholds of confidence level in accuracy can be set in applying, or not, a particular classification). More specifically, this work will examine, test, and— depending on test results—refine recently improved variants of the most promising of several classification algorithms.17 Among those are: ■ Support Vector Machines (SVMs)18 ■ Logistic Regression (LR)19 ■ Naïve Bayes (NB)20 ■ Hidden Markov Models (HMMs)21 ■ kNN/kNN Model22 A number of metrics to measure performance of these and other text classifiers in regard to controlled subject assignment, in both fully-automated and user-interactive (semi-automated) modes, will also be employed.23 ■ Hybrid classifiers An important effort will be to test and develop new hybrid classifiers that incorporate the best capabilities of two or more in one classifier. Much of the current research has involved developing and improving new hybrids that combine the best of discriminative (e.g., LR, SVMs, decision trees) and generative (e.g., NB and Expectation Maximization) techniques in classification. For example, NB is fast but lacking in accuracy, while SVMs are accurate but can be slow to train. Hybrid models can produce better accuracy/coverage than either their purely generative or purely discriminative counterparts.24 Various combina- tions, among others, of LR, HMM, and SVM are among the most promising.25 ■ Ensemble classification or classifier fusion This constitutes one of the main current directions in clas- sification research and has been applied to a wide range of real-world challenges. Classification ensembles are reputed to be more accurate than any individual classifier making them up.26 An important focus is on experimenting with new approaches to automated and semi-automated ensemble classification that involves creating frameworks that support metaclassifiers or classifier-recommender systems to apply multiple classifiers, as appropriate, to the classification task.27 Developing classifier ensembles, including the metaclassifiers to guide them, is a major element in making possible the self-service aspect of an open, automated metadata-generation service, given that the metaclassifier is intended to determine the nature of the collection and classification task and assign the appro- priate classifier(s) to the job.28 It is probable that expert interaction at suitable points in this process will improve performance. ■ Distributed classification Classifier ensembles are often used for distributed data mining in order to discover knowledge from inherently distributed and heterogeneous information sources and to scale-up learning to very large databases (often the context for library-related tasks). However, standard methods MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 201 of combining multiple classifiers, such as stacking, can have high performance costs. New classifier combination strategies and methods of distributed interaction will be examined to better handle very large classification needs.29 Distributed classification, by nature, would be focused on improving large-scale self-service classification. ■ Semi-automated, expert-interactive classification Means of enabling semi-automated, expert-interactive classification will be presented.30 There is much scope for building interactive classifiers that engage the tool user or collection coordinator in an active dialogue (e.g., multiple iterations of machine/expert actions and feedback loops) that leads to incorporation of expert knowledge about specific classification tasks, metadata, and collections into the classifier, thus improving performance. That is, an active learning model can be extended significantly for these processes to include both feature-selection and docu- ment-labeling conversations, exploiting rapidly increasing computing power to give the user immediate feedback on choices to improve the classification process.31 Several different models featuring domain expert- interactive classification and extraction will be evaluated. These vary from being extremely interactive, emphasizing frequent machine assists, to less interactive, where experts profile, launch, and only occasionally refine a primarily machine process. The initial focus will be on the latter models. Note that iVia and Data Fountains have included a metadata generator with semi-automated record builder for years. OLLIE and HIClass are examples of systems that are more intensively expert-interactive.32 Classification tasks and collection types will be characterized as to which lend themselves to frequent expert interactions, occasional interactions, or more fully-automated modes (i.e., little interaction or initial profiling/definition only). ■ Classifier training and evaluation techniques As important as direct work on the classifiers is work emphasizing assessment, cleaning, and testing of clas- sifier-training data and classifier-evaluation techniques. Involved are training data/corpora-normalization tech- niques, document-clustering techniques, and classifier bias/variance-reduction techniques. Also involved on the classifier side are tuning issues in regard to the data at hand, including improved feature-selection techniques and determining and using confidence estimates in apply- ing/not applying classifications. Different approaches to these will be examined, tested, and refined with a range of training corpora. Diverse training and test data from assorted collection “types” will include standardized corpora as well as data from participating library or educational community proj- ects. That is, the techniques will be assessed with regard to how they perform with: (1) open Web resources, (2) col- lections of research papers, reports, theses, or monographs (working with Rexa), (3) typical campus Web-site pages, and (4) mixes of the above.33 Each collection-type focus will require differing approaches, algorithms, training, and fine-tuning techniques and will be evaluated through a number of measures.34 ■ Improved rich-text identification and extraction for improved classification and user search/ browse Rich text is text that has the role of conveying through traditional or new document structures or conventions (e.g., introductions, tables of contents, FAQs, and captions for figures) the author-intended subject(s) and intent of the information object. Being able to accurately identify and extract this material greatly aids in classifier performance by improving significant keyphrase identification as well as in user retrieval by enabling full-text retrieval. The avail- ability of natural-language text for searching is one means of helping to resolve problems encountered in searching controlled, library standard subject vocabularies (which in turn counteract problems searchers have when only natural-language retrieval is available). Both approaches are inherently complementary. Improvements in rich-text identification and harvest through improved means of document-structure learning (e.g., identifying text windows around links or captions for figures and tables) will be sought. The lightweight semantic (e.g., use of terms that indicate “aboutness” such as “about”, FAQs, introduction, and abstract; rating the frequency and uniformity of application of these terms in a given collection; and proportioning source of harvest) and markup clues will be refined as well. Identifying aboutness text, which can be seen as micro-information units of text organized via topic and subtopic, is being pursued through work with Rexa and others.35 ■ Improved focused crawling Focused crawling is an appropriately scaled method of crawling for many library collections (see Part I). It is used to discover new Internet resources by defined topic termi- nology and topic Web-link neighborhood. Topic similarity 202 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 and semantic analysis are key measures of significance that are combined with linkage co-citation measures to indicate significance or relevance of a new resource. Topic similarity among resources will be increasingly modeled through a topic-linkage matrix (i.e., semantic similarity map).367 New means of evaluating, fine-tuning, and improving basic crawling will be examined.37 Rules reflecting the specific semantics of each major subject area are to be developed by participants for crawls/classification. ■ Combined mining and extraction that support improved focused crawling in regard to best link pursuit and expert interaction The development of hybrid, unified approaches to extrac- tion and mining can be applied to focused crawling. The processes of data mining, rich-text identification and extraction, and the newest forms of focused crawling are starting to overlap and depend upon one another in important ways (as discussed in the section on preferential focused crawling). Another focus for development efforts will therefore be work to more systematically refine best- link pursuit with an eye toward combining advances in mining, extraction, and rich-text identification in focused crawling. This work will be undertaken to improve the work on NiFC. Focused crawling will improve in many situations, as well, through use of user-interactive compo- nents and data-visualization interfaces (e.g., control boards that visualize an interactive graph to aid in expert “lifting” of the values of specific sites/subtopic neighborhoods to better reflect their significance to the expert). This in turn that will help users guide and tune the crawling, in semi- automated fashion, to better fit the goals and context of a particular crawl. ■ Modeling different approaches for a self-service, openly accessible metadata-generation service(s) The Data Fountains and iVia efforts have some experience with modeling metadata-collection related services, having provided collaborative, scholarly virtual-library service successfully for more than a decade. The Data Fountains project has improved upon earlier work and represents an automated and semi-automated resource discovery, metadata generation, and rich-text identification and harvest service for cooperating collections. The intent is that Data Fountains be a self-service operation. In related effort, with co-operators at the National Science Digital Library (NSDL) and Library of Congress (LC), the Data Fountains project has been striving to develop self-service dash boards that collection managers can use to configure, profile, and satisfy their needs. By complementing initial profiling with ongoing interactive dialogue, guidance, and refinement, more precise task definition and tool utilization can be achieved. The goal is to have a service that can, through advanced interfaces, engage users in dialogue to help them better determine their options, the tasks involved in achieving them, the capabilities and limitations of the tools available, and therefore, the best choice of tools and practices given their specific service needs and the nature of their collections. ■ Summary of Part II There are many fronts of research in machine learning as applied to text processing and new-resource discovery in regard to collection building of various types, relevant to libraries, which have opened over the last few years. The Data Fountains/iVia research described is looking into just a few of these. For libraries, the borders between computer science, information science, and library science are dis- solving rapidly. It would be hard to devise or project for- ward a five-year plan for a large working library without some understanding of current and oncoming machine- learning and machine-assistance work in each of these disciplines, the many inter-connected organizational/com- munity/technical issues, and without an understanding that goes beyond the domain of current or developing products and services from existing vendors. ■ Part III: Issues and reflections Part III is intended to define and address some of the many challenges and issues that are arising or may arise as a result of work on machine-assistance tools in the areas of automated and semi-automated resource discovery, metadata generation, and rich, full-text identification and harvest. Included here are reflections on and questions about some of the probable implications and impacts of, as well as roadblocks to, machine-learning technologies applied to collection building. Addressed are probable impacts leading to changing roles for libraries, librarian expertise, library standard vocabularies/schema, and the organizations that are the stewards of library standards. These include: ■ What might be the effect of these technologies on library operations, including changes in the areas and nature of expenditure of expertise required, shifts in amount of expertise required, and changes in divisions of labor (both human/human and human/ machine)? MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 203 ■ What are the effects on libraries and end users when the coverage of finding-tool content can be greatly and inexpensively broadened and deepened? ■ How do current or traditional approaches to library- based practices and standards help foster or hinder these technologies? ■ How will best practices develop in regard to machine- assisted activities? ■ How do these technologies amplify and enable or simply prematurely dislodge librarian expertise? ■ Who will own these technologies and tools? ■ How open to evolution are library metadata stan- dards and the organizations entrusted with their stewardship? ■ How will these technologies impact these standards? Unfortunately, most of these questions will remain as questions unanswered. The few answers offered here must remain as tentative, contradictory, and flawed as those of most who dabble in the cottage industry of imagining library futures. Still, in the effort to help map some of the new information landscape that is becoming apparent, these reflections, developed over the course of the last few years, may be small contributions toward defining and understanding what is coming. ■ Licensing for automatic agents of libraries It will become increasingly important for libraries to develop licenses with commercial-resource vendors/pub- lishers that allow crawlers/classifiers and other automated programs, to be seen as agents of and for these libraries. It is important that automated agents be allowed to work with (e.g., create or enrich metadata and therefore increase end-user success in finding) both free and fee-based materials in much the same way that an expert bibliogra- pher, cataloger, or public-services librarian would when selecting, creating original metadata for, and providing access to a new commercially vended book intended to become part of a library or other well-defined collection. Automated agents accessing and processing fee-based, Internet-delivered information objects do so with the goal of improving the finding tools of the institution paying the fee to provide access for users to these objects (i.e., “library users”). Thus, they are engaged in a bona fide, fair use of the material by and for the purchasing/subscribing institu- tion. The metadata and descriptive information these tools develop help make the materials they process more visible in collection/finding-tool contexts, a goal which should be desirable by all parties (i.e., end user, subscribing library, and owning author/publisher). ■ New medium, new organization, and an over-proliferation of electronic toll booths and borders Another challenge is that Internet access to library-col- lection contents and library catalog-described data, both free and fee-based, is becoming increasingly restricted as libraries, library service organizations, and publishers grope to create special aggregations, with exclusive access for their clienteles. Countering this in their adherence to open access, have been, among others, services devel- oped by, for example, arXiv, the Institute of Museum and Library Services open archive, CDL eScholarship, OAIster, CiteSeer, and NSDL.38 Differences in the two approaches may increasingly become an issue. On the one hand there is the broad, long- term community ethic favoring open access to an Internet with few walls or borders, and authors enabled to publish directly via the Internet through open eprint collections or dual commercial/personal-site publishing/copyrighting of their work. On the other hand there is the fairly nar- row definition of an Internet information niche in which electronic/virtual services and collection access remain mapped restrictively to the sponsoring physical librar- ies/collections/institutions/publishers. Libraries face a contradiction or tension between these two approaches. The latter mode is a natural effort to retain a tightly held clientele and access model that has characterized physi- cal libraries, reflecting narrowly conceived and decades- old organizational/budget/certification/user models of physical-library services and publisher controls. Much of this practice is necessitated by commercial publishers (for whom libraries often have no alternative but to act as vectors), together with the lack of vision for and outdated stereotypes held of libraries by the larger organizations in which they find themselves. At the same time, much of the problem is also due to the inability of libraries to develop new cooperative organizational modes, models, and services that map better to the new medium, map better to new author and user benefits enabled by this medium, and that are better able to exploit fully and fluidly the new medium’s capabilities. The types of compartmen- talization of collections, access, and services needed for physical libraries and print, or necessitated by publisher restrictions, are increasingly an obstacle when projected onto Internet access and service capabilities. Thorough rethinking is needed, just as the educational and scholarly missions of the university as a whole must be thoroughly rethought in the light of Internet-associated technologies and capabilities.39 While the information highway must be paid for, over-compartmentalization based on dated organizational and service models is yielding an over-multiplication of 204 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 toll booths and border crossings among aggregations and collections. An example has been the emphasis at many University of California campus libraries on the single campus OPAC rather than the pooling of resources across UC libraries for the strengthening and refinement of CDL’s Melvyl Union Catalog. It is likely that with systemwide, multicampus shared resources, Melvyl could improve in all respects vastly beyond the single campus OPAC. This is noted in the Final Report of the Bibliographic Services Task Force of the University of California Libraries.40 Overall, institutional parochialism can and has greatly lessened the value and fluidity of the Internet as a medium for information provision. The booths and borders of tightly held collections make material harder to find, less visible, and less useful than would be true of more open, expansive collections and archives. As Dempsey stated, libraries need to find “better ways to match supply and demand in the open network. . . . We need new services that operate at the network level, above the level of individual libraries.”41 For crawlers and classifiers, the booths and bor- ders that are proliferating in libraries can act disjunctively as barriers, reducing their performance. There are few answers to the challenges that over-pro- liferation of booths and borders represent. They are often practical solutions to immediate needs. Still, projects that are exploring new avenues in organization and open, shar- able collections (and the standards they are based upon) should be further encouraged and supported community- wide. These include the open archives already mentioned and systems such as those iVia/Data Fountains work upon that to provide services for such collections in an open, inclusive, cooperative, participatory manner. While the answer will probably remain a mix of open (reflecting capabilities of media) and closed (reflecting organizational and vendor restraints) collections, it would be progress to move the balance point more toward the middle and away from so many booths and borders. ■ Note on the related issue of meta-search Libraries often respond to some of these open/closed/ multiple-collection aggregator and “brand” challenges and issues with meta-search services. Meta-search can serve to mask the fundamental, growing problem of increasing booths and borders. Meta-search, unlike the Internet-borne conceptions of open service, collections, access, systems, software, and standards, does not really ask us to change our fundamental assumptions, organiza- tions, or data architectures to match the capabilities of the new information medium. It does not ask us to cooperate more fully and share at the level of collection and data; it also doesn’t encourage uniform-standards adoption and development. While meta-search is a fine answer to certain needs, sometimes it is used as a technical means to attempt to avoid these more fundamental issues. In addition, meta-search can be constraining for user search/access—i.e., it frequently disallows use of signifi- cant or unique search and metadata capabilities of each individual database to which it is applied. Meta-search in libraries is becoming increasingly central, though it has many current operational flaws. Among these flaws are: ■ simplification or dumbing-down of search in order to access lowest-common-denominator fields; ■ clumsy cross-walking among fields, or metadata ter- minologies that really are not equivalents; ■ difficulty in collating results/eliminating duplicates; and ■ difficulty of matching differing results ranking weightings/systems held by different bases. Libraries emphasizing this approach may be increas- ingly themselves perceived as dumbed down by academ- ics, grad students, or serious researchers, who must reach beyond Google, the OPAC, and meta-search search and display. Instead of, or in addition to, meta-search, it might be wise to pursue more fully the hybrid database approach of combining heterogeneous records for multiple collec- tions (and multiple retrieval languages as needed) in one database.42 As computing power increases geometrically and price decreases drastically every couple of years, the challenge that the hybrid-database approach poses in regard to searching and maintenance of very large hybrid databases may soon become less of a problem. This power also implies that meta-search become more useful. ■ Library standard controlled subject schema/vocabularies As the promise of automated and semi-automated meta- data generation and related tools becomes better known, it may be important for the community as a whole to urge our major subject vocabulary standards organizations, i.e., LC and OCLC, to open more fully their standards and input in standard making for wider participation on the part of new communities of researchers, developers, and end users. Both organizations maintain important library standard subject vocabularies/schema, LCSH/LCC, and DDC, and related large bibliographic databases and clas- sifier-training data embodying these standards. In this work, both organizations need to more actively seek out and encourage a wider variety of open innovation and development, both within and outside of the library community. This means involving more researchers, end users, and other perspectives in the effort of contribut- ing to the more rapid evolution of these standards in an attempt both to better meet end-user finding needs and MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 205 to facilitate application of the standards through machine assistance. While OCLC and LC have been generous in providing their data and standards for iVia research (others that have been generous with training data have been the Cornell University Library and CDL), most known work on these standards is funneled through their organizations, allies, and organizational filters. This is, of course, critical to a point for coordination; however, if overdone it may unnecessarily inhibit wider pollinations, new perspectives (e.g., a wider variety of linguists, computer scientists, and subject vocabulary/schema experts from other disciplines such as medicine and the sciences), decision making, and faster movement forward. Informing the perspective here is that, while there are major costs involved in maintaining and coordinating these vocabularies/schemas, such costs are being borne directly or indirectly by the community in fees paid, mon- ies applied (often public monies through the large par- ticipating public university/land-grant libraries, among others), or labor volunteered/provided. LC is a public agency and OCLC a corporate cooperative. In many ways then, libraries, through their metadata expert/cataloger community, should be seen as “owning,” as both co-author and funding agent, more of a share in these vocabularies (and other standards in library metadata) than their stew- arding organizations. A significant portion of the success of thousands of individual libraries is dependent on the successful evolution (replacement?) of these standards through the facilitation efforts and new roles adopted by these two organizations. Ultimately, it must be recognized that in many ways, OCLC and LC metadata schema and vocabularies (as well as conventions, styles, and customs in practical applica- tion) represent the codified wisdom, in the form of very large knowledge bases, of decades of resource descrip- tion practice on the part of information professionals in thousands of institutions. The library community is the co-author of these, and OCLC and LC are their stewards. When viewing the community as owner, and when taking into account that the community needs to evolve more rapidly with its users to survive, then periodic clarifica- tion and renewal of the origin, intent, and understanding of the stewarding organizations and the standards they coordinate might help encourage more rapid, far-sighted change. Libraries may or may not sink to the degree that this is realized. In this light, it should be noted that some communities, including path-breaking projects within NSDL, have made well-reasoned decisions not to use these library subject vocabulary standards (Carl Lagoze, pers. comm.). These are just recent examples, given that abstracting and indexing services/databases, for the jour- nal literature, have in most cases long ago chosen to use their own specialist vocabularies, often supplementing these by enabling key-word or natural-language searching of abstracts or complete full-text. Among other core practical concerns here are that the library community’s standards may not be seen as useful and as widely applicable as other information communi- ties may desire. That is, if an important goal is to evolve and expand standards long associated with and emanating from the library community into becoming the standards of new, larger communities outside of libraries, then a more- guarded-than-not approach, which is slow to respond to early adaptors or innovators and slows sensible change, may not be the best path. Here it should be said that there are significant ongo- ing efforts to overcome some of the challenges and better evolve LCSH/LCC. OCLC’s Faceted Application of Subject Terminology (FAST) may represent a step in the right direc- tion.43 Having an entry-level vocabulary to translate end- user terminology to appropriate library subject standard vocabulary terms would be of great importance to most types of end user.44 OCLC has also been working with the Resource Description Network (RDN) to streamline DDC application.45 There just need to be more of these efforts moving at a more rapid clip. As MacEwan concluded in 1998, “if LCSH does not change it will sooner or later be abandoned. . . .”46 The same might be said of library subject vocabulary/classification standards. However, in the worst-case scenario, assuming the existing subject standards cannot evolve more rapidly to meet new user needs in information access, collec- tion building, and metadata creation, now may even be an appropriate juncture for a large-scale rethinking and rebuilding, from the ground up.47 The architecture, intent, end-user audience, form, and substance of these standards would need to be rebuilt and expanded. A capability for organizationally responding more quickly to what has amounted over the last few years to far-reaching paradigm shifts would be enabled. Now may be the time because, in addition to the questions of the openness/innova- tion/evolutionary adaptability of these standards, they exhibit significant, long-noted, functional flaws in terms of a non-librarian end user finding success. Among others often noted are: ■ Misuse/lack of understanding on the part of end users (and, rarely, poor learning materials and guidance sup- plied by librarians) due to real or perceived complex- ity, often associated with the use of subheadings and arcane terms that are far from intuitive for users).48 ■ Typically sparse application that doesn’t fully repre- sent the number or depth of topics addressed by a work. Despite the time needed to create the MARC record manually, very few LCSHs are applied (often three or less in the University of California’s Melvyl Union Catalog). ■ The arcane and overly general nature of many terms that sometimes do not accord with terminology used by practitioners in the field.49 206 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 ■ The lack of currency of terms describing new or recent phenomenon (see discussion of entry vocabu- lary.50 ■ The lack of uniformity of subject granularity in their application across multiple cataloging institutions for the same/similar works. ■ The significant amounts of expensive expert labor involved in their application. ■ Their complexity often at least partially assumes some expert mediation (that may not be available, given that access is increasingly from outside the library) or long-term experience with the vocabu- lary. ■ Overdone detail/complexity, some of it either not extremely useful to researchers and nonlibrarian end users or already instantly verifiable by users. ■ Their arcane-ness and complexity, which limits capa- bilities for machine assistance in application and, thus thwarts a major, inexpensive means for future collection growth, increased coverage, and more use- ful collections. Fortunately, and this is crucial, it turns out that much of the tonic needed for improvement may reside in the areas of inexpensively augmenting, as opposed to changing, the LCSH/LCC/DDC schema/vocabularies. For example, it is probable that most significant objects, when not digitized themselves, will be accompanied increasingly by digitized, representative cores of searchable natural-language rich text, as LC is doing with its table of contents digitization.51 Automated and semi-automated tools for rich-text iden- tification, extraction, and end-user searching are showing applicability now (see part I). Similarly, keyphrase identifi- cation and application can be accomplished automatically with a good degree of reliability; these processes play a role similar to rich text in providing useful retrieval terms and in augmenting subject searching with/without these controlled vocabularies. Finally, reasonably good overall subject gisting is occurring in the creation of annotation- like constructs. All of these—rich text, keyphrases, and annotation-like constructs alike—are of great potential value in addressing controlled subject vocabulary/schema inadequacies and in complementing LCSH/LCC/DDC in end-user finding. It is also probable that use of machine means to aug- ment overarching standard subject vocabularies with complementary and much more granular/detailed spe- cialist vocabularies (both expert created and controlled as well as those that are automatically invoked) will shortly be practical and prove very useful. Streamlined LCSH/ LCC/DDC could be made perhaps to function as linguistic “switching yards” with specialist vocabularies oriented to them and acting as extensions via the spine provided by the generalist vocabularies (similar to work being explored by Vizine-Goetz). All of this could be hinged on the syn- onomy and other term/concept relationships supplied by WordNet or other whole natural-language corpora.52 In such a manner, reconceived LCSH/LCC/DDC can basi- cally work as multi-vocabulary integration and translation tools in cases where the granularity of the subject becomes very fine-grained or specialized.53 Such synonymy, lin- guistic linkages, and switching capabilities would make possible more meaningful and accurate interrelations and more fluid user movement among the vocabularies and concepts of multiple disciplines and multiple-controlled vocabularies/schema. This would also better enable the end user when employing terms actually used by practi- tioners/researchers/students in their disciplines.54 These and other efforts are crucial because, despite their problems, LCSH/LCC/DDC are comprehensive, overarch- ing vocabularies and schema that, though complex (as are the subject vocabularies of BIOSIS and Pubmed/Medline, which successfully represent very large subject universes of their own), have done a generally useful job of repre- senting and coherently organizing finding terminology for most known worldly (and unworldly) phenomena. This, on any basis, is no easy task. These library standard vocabularies might best be seen as both essential connective tissue and as spines that could coherently thread many disciplines and interests, and many of the more specific vocabularies, together. Without such a spine, interdisciplinarians, researchers/students new to an area, and generalists—whose focus requires wide knowledge often across among many disciplines (and therefore subject vocabularies)—may find themselves handicapped. Each sub- and then sub-sub-specialization might develop its own mutually exclusive and contra- dictory terminology in a manner that natural-language substitutions such as keyphrase and rich-text availability can only partially fix. Many end users and librarians noted the downsides of natural-language-text-only searching two decades ago while using newspaper and other full-text databases offered by Dialog or BRS. Finally, one cannot ignore that LCSH/LCC/DDC have huge established bases of practitioners and metadata records employing them. Therefore, their value is large. To summarzie, the solutions to the problems inherent in using library standard subject vocabulary/schema and other controlled metadata will involve the following: ■ openness to extensive hybridization of approaches to rethinking subject vocabularies/schema and other metadata; ■ awareness of, design for, guidance of, and incorpora- tion of new machine-assisted technologies to boost collection coverage and reduce costs of application; ■ embracing machine assistance, as appropriate, as a means of amplifying and extending expertise and application; ■ applying existent technologies for generation of key- MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 207 phrases, description-like constructs, and rich text in order to augment controlled subject vocabularies; ■ developing a better conception of end-user metadata expectations and needs against the backdrop and expectations generated by the Web, such as instant end-user access/verification; and ■ making use of specialist vocabularies that might be dovetailed well with and coordinated through stan- dard vocabularies. ■ Invoked subject vocabularies—hierarchical and otherwise It is important to track recent research into automated and semi-automated means for creating (often referred to in the computer-science literature as “inducing” or extracting) hierarchical and other subject vocabularies/ontologies from natural-language corpora (see part II). The intent of this work is to have the natural-language terms used by practitioners directly populate and structure the sub- ject-finding approach. Automated induction of subject vocabularies will be useful to augment and increase the capabilities, flexibility, and interactivity of standard subject vocabularies/schema.55 At the very least, and this is important, they could func- tion to automatically suggest synonyms or new terminology for ongoing vocabularies/schema. And these approaches could be put to use in building entry-level vocabularies that front the vocabularies of the standards.56 They could also be used to aid in the semi-automated or automated repopulation/reworking of the standards, if large-scale, from-the-ground-up reworking is deemed necessary at some point. This would be done on a disci- pline-by-discipline, subject-by-subject basis. ■ Resource discovery, search engines, and your library’s subject portal Library collections, virtual libraries, portals, and Internet- enabled catalogs of openly accessible, significant Internet resources all function as “hubs” (see part I). Along with other types of expert-created hubs, they have played a role in providing most large, sophisticated, commercial search engines with a significant means for modeling and determining high-quality resources and, when accurate, a considerable portion of their accuracy. Though Google and others do not detail how their search algorithms work, most advanced crawlers highly weight (give authority to) sites that contain large numbers of links to research and other significant resources, especially when expert created. Similarly, resources from specific domains such as .edu, .org, and .gov, and institutions such as libraries, universi- ties, and scholarly societies can be identified and more highly weighted. This is another case of the community’s expertise/authority functioning as a knowledge base that, when offered as a public good (as library-created hubs often are), helps better enable directional tools for these commercial and noncommercial crawlers. There is nothing wrong with this as long as the community is aware of its contribution and as long as its efforts are recognized by these businesses. Expert library-based subject portals often reciprocate usage by using commercial engines for resource discovery, though this usually represents a minor way of collecting because other expert sources are preferred. ■ Enumeration of catalysts for, impacts of, and issues in machine assistance in the library community Related to these research and technical developments, the library community needs to think through a great many interrelated and diverse issues and questions regarding (1) impacts of the machine assistance we have been dis- cussing; (2) the possible massive automation of metadata generation and resource discovery in libraries, (3) who will “own” these technologies and ideas, and (4) changes in expectations/roles of metadata practitioners and standards and their stewards, in the following areas: ■ When will machine learning/machine assistance yield reliable, inexpensive, and therefore massive application of metadata on an Internet scale, that meets librarian, and more importantly, end-user expectations in terms of usefulness? Machine assis- tance should begin to be factored into long-term planning. ■ What will be the effects of this machine amplification in changing the importance/roles/content of subject standards? That is, how and to what degree will a new means and scale of application change these standards generally, and how they’re perceived and used by end users and librarians and, therefore, be applied by the library community? How might these standards themselves change both in terms of changes in and approaches to vocabulary and schema? That is to say, how would massive, machine-assisted appli- cation in and of itself change the makeup of the vocabulary, schema, and the styles/conventions with which they are applied? ■ How might the roles of the stewards of these stan- dards change, given massive application as well as possible interest on the part of other communi- ties? Can library standards penetrate and be effec- tively used by other information communities? What changes in the standards would be required to achieve this? 208 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 ■ What are the trade-offs between highly manual or craftsman/guild approaches and highly automated or more industrial approaches to applying meta- data? Within which contexts, collections, resources, and budgets are these approaches to be best used, either singly or combined in various proportions, in building/expanding a collection? How does each approach best complement the other in library collec- tions? ■ To what degree will changing end-user information usage and access patterns change approaches in regard to collection design and access assumptions, the metadata standards the collections are based upon, and the stewarding organizations of the stan- dards? ■ To what degree may labor and resource savings, as well as the ability to provide for more comprehen- sive collections, as offered by this technology, dictate changes within the library community in regard to expectations for metadata quality and specificity? In which information-seeking contexts and collections and to what degree will the Google-type record or minimal, streamlined DC become, if not a necessity themselves, then a pole toward which library biblio- graphic metadata evolves? ■ A question self-evident to most but not to all is: to what degree will the nature of the Internet itself continue to change our approach to supplying meta- data? Again, researchers in academic departments no longer need walk across campus to the library by virtue of having many bibliographic details of an object present in a metadata record. Increasingly, they can go to the object on the Internet and instantly verify the detail for themselves. Should libraries de- emphasize data elements/fields that are dependably and quickly end-user verifiable in favor of expend- ing more expertise, time, and resources in gisting/ describing the subject, intent, and perhaps even esti- mated quality or significance of the work? ■ In which specific ways will labor be saved and machines be capable of assisting in resource discov- ery and metadata generation? That is, what level of automation/semi-automation is acceptable to the community and reliably deployable in production over horizons of one to five years? What level of qual- ity/depth will users accept in metadata designed to occupy the continuum existing between the MARC record and the Google “record” (this being a large and significant service area; see part I)? How will this technology change old and enable new roles, tasks, and production routines for library subject experts and other staff? How will libraries ramp up and tran- sition into this? ■ Will the substantial potential economic advantages of automated or semi-automated generation of library standard metadata such as LCSH/LCC/DDC vocab- ularies/schema drive a rethinking toward greater uniformity/simplicity/streamlining of these stan- dards and conventions in their application, explicitly with machine application in mind? For example, per- haps only a subset of a whole vocabulary will be used and those that are used will become less detailed and less rich for experts but also—for most end users— less complex and arcane, and more intuitive.57 ■ In some ways, the existence of DC is a recognition that this kind of rethinking and streamlining of library description standards, in the interest of repre- senting and providing access to a much larger scale of communities and resources, is already well under way. What are the obstacles to greater usage of DC? ■ What should the balance be in streamlining metadata for automated application, in relation to its cur- rent complexity/depth while augmenting with rich text? From another perspective, what is the balance when considering the oversimplification and loss of descriptive power when using machine methods as compared with that otherwise achievable through use of subject expertise? How will libraries deter- mine best balances of expert and machine in regard to different tasks? How will this be quantified and determined through examination of user retrieval success/satisfaction—with this, in turn, factored against the backdrop of metadata creation costs, full- text data harvesting and retrieval, and the need for collections with much greater reach? ■ As accurate means of metadata and rich-text gen- eration for/from text objects improve, machine assis- tance will allow a shifting of expertise to provide better collection coverage and expression of subject- domain expertise (e.g., in abstracts). How will this new capability for breadth and depth be defined and used in library collections? For example, will new visual, multimedia, and data objects—which the Web has made possible on a mass basis and which librar- ies generally do not cover well—become a major goal in repurposing expertise since these do not eas- ily lend themselves to machine processing (Karen Calhoun, pers. comm.)? ■ Might streamlining and the usage of multiple depths/types of metadata application first require the acceptance within the community of the concept of the multitiered collection/database that supports multiple levels and types of heterogeneous resources representing differing levels of importance to users?58 Or, can this need be met through more fully evolved meta-search approaches? ■ Helping to structure this metadata heterogeneity might be the sliding-scale application of varying levels of metadata-generation labor expenditures and amounts/type of metadata, with the lower- MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 209 and middle-value resources receiving application of streamlined standard vocabularies/schema and rich text, automatically or semi-automatically, at low cost. High-value resources would continue to receive expert-applied, expensively created, com- plex, and high-quality metadata as well as rich text. Libraries already make such distinctions in qual- ity/significance to some degree through purchasing (e.g., departmental collecting profiles/weightings by subject and object type and cost) and order-of- cataloging priority decisions, as well as by student/ faculty input on specific items. More specifically, we would need to discuss and develop criteria in deter- mining the core or peripheral value of a resource for its subjects and user communities and then, based on the judgments derived, appropriately apportion amount and type of metadata and expert labor or machine assistance, on a sliding scale. Again, while it should be noted that the library community has gen- erally avoided rendering judgments on the possible use/relevance of a resource to a subject community, libraries nevertheless do routinely make general calls that effectively function this way to some degree. In making this judgement, it would be critical to involve resource users. Reviewer-researcher, library user, and librarian evaluations for purchases as well as find- ing tool/collection-usage statistics for the specific subject or author and item all could be woven into the means by which the core weighting of a resource could be assigned and be refined over time via usage. Developing this value is important from a library standpoint. It is a key that may help unlock solu- tions for some of the community’s bigger challenges, including those revolving around the best marriage of machine assistance with librarian expertise. How do libraries go about making these sliding-scale evaluations with some uniformity, among different collection types and interests, with an eye toward tasking expert and machine? ■ Can some of the general end-user search deficiencies commonly acknowledged for LCSH/LCC/DDC be rectified to some extent by automatically/semi-auto- matically providing rich full-text accompaniment for each record/resource, either in the form of “selected” excerpts verbatim or as processed into significant key- phrases representing this text? How could the pres- ence of this rich text not so much change as augment these standards? For example, rich full-text might be relied upon to contain detail that obviated the need to use certain LCSH subdivisions or other types of MARC metadata. Could inadequacies/inaccuracies in expert-applied and machine-applied metadata be partially countered, for end-user retrieval purposes, through the presence of rich full-text? Rich text, as well as keyphrases/terms and descriptions that serve the same purpose in this context, can now be reliably generated in many cases automatically. What would be the right mix of subject-vocabulary standard meta- data and accompanying, selected natural-language text for best end-user success? How might rich-text extraction and searching improve upon searching of whole-object full-text? How much rich text is needed and how distilled should it be? Large, whole-object full-text searching can often be a searcher’s quag- mire, clouding results rankings and weightings. ■ Could a new scale of application and interest on the part of new communities be better catalyzed through the incentive offered by opening up the LCSH/LCC/ DDC subject vocabularies/schema on an open-stan- dards/open-source, free-software model? ■ If development of these technologies is constrained with regard to action/inaction on the part of the community and its stewards, will the standards be replaced—or become obsolete—for major existing or prospective sectors of users? If so, what does this mean for the library community? ■ By and for whom is such standard subject vocab- ulary/schema application technology developed within the community? Classifiers are actually trained through great amounts of what, in many cases, is really community-created knowledge in order to apply community-developed schema/vocabularies. Smart crawlers and extractors similarly use (have “learned”) collectively created information patterns, derived from open-knowledge bases of various sorts. Who should own these tools/models and how open/ closed should the programming code/ideas be, con- sidering they could not be built without using the collective wisdom embodied in these knowledge bases? These tools exploit decades of labor by thou- sands of institutions, whose assumption has gener- ally been that the knowledge base and, by extension, the tools that are built on and benefit from it, are and should remain directly or indirectly, public goods. ■ For whom is machine learning/assistance in collec- tion building patented? The ideas, training corpora, algorithms, and data models discussed need to be observed and protected for the public domain to encourage their widespread and inexpensive avail- ability, as well as their evolution. The U.S. Patent and Trademark Office is now more commonly supporting the patenting of whole, generic processes that have heretofore had one or both feet in the commons, as compared with solely granting patent rights in more discrete areas of original invention. It would be unfortunate to find one day that machine assistance in collection building had been patented. This is especially an issue, given that there is little machine learning of interest to libraries that does not mine, apply, and extend the stored wisdom and knowledge that the community has built for decades. 210 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 ■ Summary of Part III It is important to think through and anticipate a great number of issues and concerns—including those of open models and open development—regarding machine-assis- tance tools (e.g., classifiers, extractors, and related algo- rithms/models) that generate library standard metadata, and identify and extract useful natural-language data. It is important because these tools could become central activi- ties in libraries over the next one to five years. Reflection here is especially appropriate, given the degree that these tools are trained on exemplars from library collections and come to distill and embody models of library metadata, standards, and expertise that represent the knowledge created over decades through the effort of a whole com- munity. It is important to think through what machine- assistance technologies in collection building imply for the future role of the librarian’s expertise. Specifically, libraries need to reconceptualize machine-assistance software not as fully automated “AI” but rather, as enabling expert driven, strongly interactive, “servo-mechanisms” that semi-automate some work to increase the reach, quality, and user-finding success within library collections. While it will probably start out with ten or fifteen minutes of expert time saved per record by such tools, this is a lot of time saved when aggregated across the entire community and will only increase. And the community needs to think through what this implies for the evolution of library- standard metadata, given that machine assistance will increasingly allow for massive and economic application, if a convergence of machine capabilities and machine- friendly metadata standards is architected. This large-scale amplification of usage will quite likely involve changing the value/roles of these standards for the community, as well as for the larger communities that may come to use them at the cost of simplification, streamlin- ing, and a greater reliance on end users to verify some of their own metadata details (often interacting directly with the digital resource). The tools also imply a restructuring of expertise and its application in metadata creation in libraries to reflect a division of labor, with semi-automated machine description processes spent on the mass of useful but mid- to lower-value materials; with and expert time being spent on high-value resources; and with both types of records residing in the same multitiered, heterogeneous collection.58 Finally, needing examination will be the roles of the stewardship organizations in: ■ shepherding the community’s metadata standards during a period of great change; ■ openly evolving the application of metadata stan- dards within the context of machine assignment for the greatest possible good; ■ rapidly evolving the application of metadata stan- dards to retain guidance of and to keep pace with open and proprietary developments in these areas; ■ distilling the metadata knowledge base and wisdom created by the community as this is transformed into the programmatic knowledge (rule bases and mod- els) used by new tools. This knowledge base is a priceless asset for the library community in sustaining service roles in an age of the large-scale advent of commercial-information access, delivery, and ownership. ■ Conclusion This article discusses work over the last several years in machine-learning software and services relevant to collection building in libraries. A number of promising avenues for exploration and research are detailed. Deeper understanding of and more direct involvement in areas of machine learning are urged for libraries in order to reflect advances in the computer sciences and other disciplines as well as to meet changing end-user needs among infor- mation seekers. ■ Acknowledgements The author would like to thank the U.S. Institute of Museum and Library Services; the Library of the University of California at Riverside; the National Science Foundation’s National Science Digital Library; the Fund for the Improvement of Post-Secondary Education of the U.S Department of Education; the Librarians Association of the University of California; and the Computing and Communications Group of the University of California at Riverside for current or past funding support. The author would also like to thank the Library of Congress; Cornell University Library; OCLC; and the California Digital Library for providing training data and other assistance for the research. Thanks to Karen Calhoun (Cornell University Library) and two anonymous readers for some excellent comments and suggestions. Finally, the author would like to commend iVia lead programmer Johannes Ruscheinski, primary author of the Data Fountains and iVia code bases, for his excellent work over the years, as well as Gordon Paynter, Walt Howard, Jason Scheirer, Keith Humphries, Anthony Moralez, Paul Vander Griend, Artur Kedzierski, Margaret Mooney, John Saylor, Laura Bartolo, Carlos Rodriguez, Jan Herd, Carolyn Larson, Diane Hillmann, and Ruth Jackson for their invaluable contributions to the MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 211 projects. The views expressed here are solely those of the author and not intended to represent those of the Library of the University of California, Riverside, our funding agencies, or cooperators. ■ References and notes 1. S. Mitchell et al., “iVia: Open Source Virtual Library Software,” D-Lib Magazine (January 2003). http://www.dlib .org/dlib/january03/mitchell/01mitchell.html (accessed Oct. 20, 2006); G. Paynter, “Developing Practical Automatic Meta- data Assignment and Evaluation Tools for Internet Resources,” in Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (Denver: ACM Pr., 2005), 291–300 (Winner of the JCDL 2005 Vannevar Bush Best Paper Award), http://ivia.ucr.edu/ projects/publications/Paynter-2005-JCDL-Metadata-Assign- ment.pdf, (accessed Oct. 20, 2006); S. Mitchell, “Collaboration Enabling Internet Resource Collection-Building Software and Technologies,” Library Trends 53, no. 4 (May 2005): 604–19; J. Mason et al., “INFOMINE: Promising Directions in Virtual Library Development,” First Monday (2000), http://www.first monday.dk/issues/issue5_6/mason/ (accessed Oct. 20, 2006). 2. S. Mitchell, “INFOMINE: The First Three Years of a Vir- tual Library for the Biological, Agricultural, and Medical Sci- ences,” in Proceedings of the Contributed Papers Session, Biological Sciences Division, Special Libraries Association Annual Conference (Seattle: Special Libraries Assocation, 1997). 3. Mitchell, “Collaboration Enabling Internet Resource Col- lection-Building Software and Technologies.” 4. J. Phipps et al., “Orchestrating Metadata Enhancement Services: Introducing Lenny,” in Proceedings of DC-2005: Inter- national Conference on Dublin Core and Metadata Applications (Madrid, Spain: Universidad Carlos III de Madrid, 2005), http://arxiv.org/pdf/cs.DL/0501083, (accessed Oct. 20, 2006). 5. Mason et al., “INFOMINE: Promising Directions in Vir- tual Library Development.” 6. Ibid. 7. S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext (San Francisco: Morgan Kauffman, 2003); S. Chakrabarti et al., Accelerated Focused Crawling through Online Relevance Feed- back, http://www2002.org/CDROM/ refereed/336/ (accessed Oct. 20, 2006); S. Chakrabarti, The Structure of Broad Topics on the Web, http://www2002.org/CDROM/refereed/338/index.html (accessed Oct. 20, 2006); S. Chakrabarti, Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction, http://www10.org/cdrom/papers/489 (accessed Oct. 20, 2006). 8. Chakrabarti et al., Accelerated Focused Crawling; F. Menc- zer, “Mapping the Semantics of Web Text and Links” IEE Internet Computing, 9, no. 3 (May/June 2005): 27–36; F. Menczer, G. Pant, and P. Srinivasan, “Topical Web Crawlers: Evaluating Adaptive Algorithms” Transactions on Internet Technology, 4, no 4 (2004): 378–; F. Menczer, “Correlated Topologies in Citation Networks and the Web” European Physical Journal B, 38 no. 2 (March 2004): 211–21. 9. S. Mitchell, “Data Fountains Survey,” 2005, http:// datafountains.ucr.edu/ datafountainssurvey.doc, (accessed Oct. 20, 2006). 10. A. Culotta and A. McCallum, “Confidence Estimation for Information Extraction,” in Proceedings of Human Language Technology Conference and North American Chapter of the Asso- ciation for Computational Linguistics (Boston: Association for Computational Linguistics, 2004), http://www.cs.umass.edu/ ~mccallum/papers/crfcp-hlt04.pdf, (accessed Oct. 20, 2006); F. Peng and A. McCallum, “Accurate Information Extraction from Research Papers Using Conditional Random Fields,” in Pro- ceedings of the Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (2004). http://ciir.cs.umass.edu/pubfiles/ir-329.pdf, (accessed Oct. 20, 2006); C. Sutton and A. McCallum, “An Introduction to Conditional Random Fields for Relational Learning,” in Introduction to Statistical Relational Learning, Lise Getoor and Ben Taskar, eds. (Cambridge, Mass.: MIT Pr., 2006). http://www .cs.umass.edu/~mccallum/papers/crf-tutorial.pdf, (accessed Oct. 20, 2006). 11. A. McCallum and D. Jensen, “A Note on the Unification of Information Extraction and Data Mining Using Conditional- Probability, Relational Models,” in Proceedings of the IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, Aca- pulco, Mexico: IJCAI, http://www.cs.umass.edu/~mccallum/ papers/iedatamining-ijcaiws03.pdf, (accessed Oct. 20, 2006); U. Nahm and R. Mooney, “A Mutually Beneficial Integration of Data Mining and Information Extraction,” in Proceedings of the American Association for Artificial Intelligence/Innovative Applica- tions of Artificial Intelligence (Austin, Texas: American Asso- ciation for Artificial Intelligence, 2000). http://www.cs.utexas .edu/users/ ml/papers/discotex-aaai-00.pdf, (accessed Oct. 20, 2006); R. Raina et al., “Classification with Hybrid Generative/ Discriminative Models,” in Proceedings of Neural Information Pro- cessing Systems (2003). http://www.cs.umass.edu/~mccallum/ papers/hybrid-nips03.pdf, (accessed Oct. 20, 2006); G. Bouchard and B. Triggs, “The Trade-Off Between Generative and Discrimi- native Classifiers,” COMPSTAT 2004. (Prague: Springer, 2004) http://lear.inrialpes.fr/pubs/2004/BT04/Bouchard-comp stat04.pdf, (accessed Oct. 20, 2006). 12. McCallum and Jensen, “A Note on the Unification of Information Extraction.” 13. N. Eiron and K. McCurley, “Untangling Compound Docu- ments on the Web,” in Conference on Hypertext (Nottingham, UK: ACM Conference on Hypertext and Hypermedia, 2003), http://citeseer .ist.psu.edu/eiron03untangling.html, (accessed Oct. 20, 2006). http://www.almaden.ibm.com/cs/people/mccurley/pdfs/ pdf.pdf, (accessed Oct. 20, 2006); P. Dimitriev et al., “As We May Perceive: Inferring Logical Documents from Hypertext,” presented at HT 2005, 16th ACM Conference on Hypertext and Hypermedia (Salzburg: ACM, 2005); K. Tajima, “Finding Context Paths for Web Pages,” in Proceedings of ACM Hypertext (Darm- stad, Germany: ACM, 1999), http://www.jaist.ac.jp/~tajima/ 212 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 papers/ht99www.pdf, (accessed Oct. 20, 2006); K. Tajima et al., “Discovery and Retrieval of Logical Information Units in Web,” in Proceedings of the Workshop of Organizing Web Space (in conjunc- tion with ACM Conference on Digital Libraries) (Berkeley, Calif.: ACM, 1999), 13–23, http://www.jaist.ac.jp/~tajima/ papers/ wows99www.pdf, (accessed Oct. 20, 2006); E. de Lara et al., “A Characterization of Compound Documents on the Web,” TR99-351, University of Toronto (1999), http://www.cs.toronto .edu/~delara/papers/compdoc.pdf, (accessed Oct. 20, 2006), http://www.cs.toronto.edu/~delara/ papers/compdoc_html/, (accessed Oct. 20, 2006); L. Xiaoli et al., “Web Search Based on Micro Information Units,” (Honolulu, Hawaii: Eleventh Inter- national World Wide Web Conference, 2002), http://www2002 .org/CDROM/poster/78.pdf, (accessed Oct. 20, 2006); W. Lee et al., Retrieval and Organizing Web Pages by Information Unit, http://www10.org/cdrom/papers/466/, (accessed Oct. 20, 2006). 14. Tajima et al., “Discovery and Retrieval of Logical Informa- tion Units in Web”; Xiaoli et al., “Web Search Based on Micro Information Units”; Lee et al., Retrieval and Organizing Web Pages. 15. R. Mihalcea, “Graph-Based Ranking Algorithms for Sen- tence Extraction, Applied to Text Summarization,” in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, companion volume (Barcelona, Spain: Associa- tion for Computational Linguistics, 2004), http://www.cs.unt .edu/~rada/papers/mihalcea.acl2004.pdf, (accessed Oct. 20, 2006); R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Texts,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (Barcelona, Spain: Empirical Meth- ods in Natural Language Processing, 2004), http://www.cs.unt .edu/~rada/papers/mihalcea.emnlp04.pdf, (accessed Oct. 20, 2006); R. Mihalcea, P. Tarau, and E. Figa, “PageRank on Semantic Networks, with Application to Word Sense Disambiguation,” in Proceedings of the 20th International Conference on Computational Linguistics (Geneva, Switzerland: COLING 2004). http://www .cs.unt.edu/~rada/papers/ mihalcea.coling04.pdf, (accessed Oct. 20, 2006); Y. Matsuo et al., “KeyWorld: Extracting Keywords in a Document as a Small World,” in Proceedings of Discovery Sci- ence (Berlin, New York: Springer, 2001), 271–81 (Lecture Notes in Computer Science, v. 2226), http://www.miv.t.u-tokyo.ac.jp/ papers/matsuoDS01.pdf, (accessed Oct. 20, 2006); Y. Matsuo and M. Ishizuka, “Keyword Extraction from a Single Document Using Word Co-Occurrence Statistical Information,” Interna- tional Journal on Artificial Intelligence Tools 13, no.1 (2004): 157–69, http://www.miv.t.u-tokyo.ac.jp/papers/matsuoIJAIT04.pdf, (accessed Oct. 20, 2006); Xiaoli et al., “Web Search Based on Micro Information Units”; Lee et al., Retrieval and Organizing Web Pages; G. Forman and Ira Cohen, “Learning from Little: Com- parison of Classifiers Given Little Training,” Tech Report: HPL- 2004-19R1 20040719 (Palo Alto, Calif.: Hewlett-Packard Research Labs., 2004), http://www.hpl.hp.com/techreports/2004/HPL -2004-19R1.pdf, (accessed Oct. 20, 2006). 16. G. Mann et al., “Bibliometric Impact Measures Leveraging Topic Analysis,” (in press), in Proceedings of the Joint Conference on Digital Libraries (2006). http://www.cs.umass.edu/~mccallum/ papers/impact-jcdl06s.pdf, (accessed Oct. 20, 2006). 17. R. Bouckaert and E. Frank, “Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms,” in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. (Berlin, New York: Springer-Verlag, 2004), 3–12 (Lecture Notes in Computer Science, v. 3056), http://www .cs.waikato.ac.nz/~ml/publications/2004/bouckaert-frank.pdf, (accessed Oct. 20, 2006); R. Bouckaert, “Estimating Replicabil- ity of Classifier Learning Experiments,” in Proceedings of the International Conference on Machine Learning (2004), http://www. cs.waikato.ac.nz/~ml/publications/2004/bouckaert-estimat- ing.pdf, (accessed Oct. 20, 2006); R. Caruana and A. Niculescu- Mizil, “Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria,” in KDD-2004: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York: ACM Press, 2004), http://perfs.rocai04.revised.rev1.ps, (accessed Oct. 20, 2006). 18. J. Zhang et al., “Modified Logistic Regression: An Approx- imation to SVM and Its Application in Large-Scale Text Cat- egorization,” in Proceedings: Twentieth International Conference on Machine Learning (Menlo Park Calif.: AAAI Press, 2003), 888–97, http://www.informedia.cs.cmu.edu/documents/icml03zhang .pdf, (accessed Oct. 20, 2006); Y-C. Chang, “Boosting SVM Classifiers with Logistic Regression,” Technical Report. (Tai- pei: Institute of Statistical Science, Academia Sinica, 2003), http://www.stat.sinica.edu.tw/library/c_tec_rep/2003-03.pdf, (accessed Oct. 20, 2006); T. Zhang and F. Oles, “Text Categori- zation Based on Regularized Linear Classification Methods,” Information Retrieval 4, no. 1 (2001): 5–31, http://www.research .ibm.com/people/t/tzhang/pubs.html, (accessed Oct. 20, 2006); T. Joachims, “SVMlight,” (including SVMmulticlass, SVMstruct, SVMHMM) (software, 2005), http://svmlight.joachims.org/, (accessed Oct. 20, 2006); C. Chang and C-J. Lin, “LIBSVM,” (software, 2005), http://www.csie.ntu.edu.tw/~cjlin/libsvm/, (accessed Oct. 20, 2006); C-W Hsu and C-J Lin, “BSVM,” (soft- ware, 2003), http://www.csie.ntu.edu.tw/~cjlin/bsvm/index .html, (accessed Oct. 20, 2006); T. Finley and T. Joachims, “Supervised Clustering with Support Vector Machines,” in Proceedings of the International Conference on Machine Learning (New York: ACM Press, 2005), http://www.cs.cornell.edu/ People/tj/publications/finley_joachims_05a.pdf, (accessed Oct. 20, 2006); I. Tsochantaridis et al., “Support Vector Machine Learning for Interdependent and Structured Output Spaces,” in Proceedings of the International Conference on Machine Learning (New York: ACM Press, 2004), http://www.cs.cornell.edu/ People/tj/publications/tsochantaridis_etal_04a.pdf, (accessed Oct. 20, 2006) ; S. Godbole and S. Sarawagi, “Discriminative Methods for Multi-Labeled Classification,” in Proceedings of the Pacific-Asia Conferences on Knowledge Discovery and Data Min- ing (2004), http://www.it.iitb.ac.in/~shantanu/work/pakdd04 .pdf, (accessed Oct. 20, 2006); L. Cai and T. Hofmann, “Hierarchi- cal Document Categorization with Support Vector Machines,” in Proceedings of the ACM 13th Conference on Information and Knowl- MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 213 edge Management (2004), http://www.cs.brown.edu/people/ th/publications.html, (accessed Oct. 20, 2006); T. Hofmann et al., “Learning with Taxonomies: Classifying Documents and Words,” in Proceedings of the Workshop on Syntax, Seman- tics, and Statistics, Neural Information Processing (2003), http:// www.cs.brown.edu/people/ th/publications.html, (accessed Oct. 20, 2006); A. Tveit, “Empirical Comparison of Accuracy and Performance for the MIPSVM Classifier with Existing Classifiers,” Technical Report, Division of Intelligent Systems, Department of Computer and Information Science, Norwegian University of Science and Technology. (Trondheim, Norway, 2003), http://www.idi.ntnu.no/~amundt/publications/2003/ MIPSVMClassificationComparison.pdf, (accessed Oct. 20, 2006); C-W Hsu and C-J Lin, “A Comparison of Methods for Multi- Class Support Vector Machines,” IEEE Transactions on Neu- ral Networks 13, no. 2 (2002): 415–25, http://www.csie.ntu .edu.tw/~cjlin/papers/multisvm.pdf, (accessed Oct. 20, 2006). 19. P. Komarek, “Logistic Regression for Data Mining and High-Dimensional Classification” (Ph.D. thesis, Carnegie Mellon University, 2004), 138; P. Komarek and A. Moore, “Fast Robust Logistic Regression for Large Sparse Data Sets with Binary Outputs,” Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. January 3–6, 2003, Hyatt Hotel, Key West, Florida, ed. By Christopher M. Bishop and Brendan J. Frey. http://research.Microsoft.com/ conferences/AIStats2003/proceedings/174.pdf (accessed Nov. 23, 2006); A. Popescul et al., “Towards Structural Logistic Regression: Combining Relational and Statistical Learning,” in MRDM 2002: Workshop on Multi-Relational Data Mining, http://www-ai.ijs.si/sasodzeroski/MRDM2002/proceed ings/popesul.pdf (accessed Nov. 23, 2006); J. Zhang and Y. Yang, “Probabilistic Score Estimation with Piecewise Logistic Regression,” in Proceedings: Twenty-first International Conference on Machine Learning (Menlo Park, Calif.: AAAI Press, 2004), http://www-2.cs.cmu.edu/~jianzhan/papers/icml04zhang .pdf, (accessed Oct. 20, 2006); Zhang et al., “Modified Logistic Regression”; Zhang and Oles, “Text Categorization”; Multi-class LR is discussed in Zhang et al., 2003, and Chang, 2003 (reference 18). 20. Some recent work on NB can be seen in J. Rennie, “Tack- ling the Poor Assumptions of Naive Bayes Text Classifiers,” in T. Fawcett and N. Mishra, eds., Proceedings of the 20th Interna- tional Conference on Machine Learning (Washington, D.C.: AAAI Pr., 2003), 616–23, http://haystack.lcs.mit.edu/papers/rennie .icml03.pdf, (accessed Oct. 20, 2006); K. Schneider, “Tech- niques for Improving the Performance of Naive Bayes for Text Classification,” in Computational Linguistics and Intelli- gent text processing: Sixth International Conference, CICLing2005, Mexico City, Mexico, February 13–19, 2005: Proceedings (New York: Springer, 2005). (Lecture Notes in Computer Science, 3406). 682–93, http://www.phil.uni-passau.de/linguistik/ schneider/pub/cicling2005.html, (accessed Oct. 20, 2006); E. Frank et al., “Locally Weighted Naive Bayes,” in Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence (Acapulco: Morgan Kaufmann, 2003), 249–56, http://www .cs.waikato.ac.nz/~eibe/pubs/UAI_200.ps.gz, (accessed Oct. 20, 2006); G. Webb et al., “Not so Naive Bayes: Aggregating One-Dependence Estimators,” Machine Learning 58, no. 1 (Jan. 2005): 5–24, http://www.csse.monash.edu.au/~webb/Files/ WebbBoughtonWang05.pdf, (accessed Oct. 20, 2006); E. Keogh and M. Pazzani, “Learning the Structure of Augmented Bayes- ian Classifiers,” International Journal on Artificial Intelligence Tools 11, no. 4 (2002): 587–601, http://www.ics.uci.edu/~pazzani/ Publications/tools.pdf (accessed Oct. 20, 2006). 21. McCallum and Jensen, “A Note on the Unification of Information Extraction and Data Mining”; Joachims, “SVM- light”; Y. Altun et al., “Hidden Markov Support Vector Machines,” in Proceedings of the 20th International Conference on Machine Learning (Menlo Park, Calif.: AAAI Press, 2003), http://www.cs.brown.edu/people/th/publications.html (accessed Oct. 20, 2006); A. Ganapathiraju et al., “Hybrid SVM/HMM Architectures for Speech Recognition,” in Advances in Neural Information Processing Systems 13: Proceed- ings of the 2000 Conference (Cambridge, Mass.: MIT Press, 2001), http://www.nist.gov/speech/publications/tw00/pdf/cp210 .pdf (accessed Oct. 20, 2006); D. Freitag and A. McCallum, “Information Extraction with HMM Structures Learned by Stochastic Optimization,” in Proceedings of the 18th Conference on Artificial Intelligence (Austin, TX.: AAAI Press, 2000) http:// www.cs.umass.edu/~mccallum/papers/iehill-aaai2000s .ps (accessed Oct. 20, 2006); S. Basu et al., “A Probabilistic Framework for Semi-Supervised Clustering,” in Proceedings of the 10th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (Seattle, Wash.: 2004), 59– 68, http://www.cs.utexas.edu/users/ml/papers/semi-kdd -04.pdf, (accessed Oct. 20, 2006). 22. T. Liu et al., “Efficient Exact kNN and Nonparametric Classification in High Dimensions,” in Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Con- ference (Cambridge, Mass.: MIT Press, 2001). http://www .autonlab.org/autonweb/showPaper.jsp?ID=Liu-knn, (accessed Oct. 20, 2006); G. Guo et al., “KNN Model-Based Approach in Classification,” in Lecture Notes in Computer Science, vol. 2888 (Heidelberg: Springer Berlin, 2003), 986–96, http://www .icons.rodan.pl/publications/%5BGuo2003%5D.pdf (accessed Oct. 20, 2006) 23. Bouckaert and Frank, “Evaluating the Replicability of Significance Tests”; Bouckaert, “Estimating Replicability of Clas- sifier Learning Experiments”; Caruana and Niculescu-Mizil, “Data Mining in Metric Space”; R. Caruana and T. Joachims, “PERF (Data Mining Evaluation Software),” in Proceedings of the Conference on Knowledge Discovery and Data Mining (2004). http://kodiak.cs.cornell.edu/kddcup/software.html (accessed Oct. 20, 2006); Paynter, “Developing Practical Automatic Meta- data.” 24. Raina et al., “Classification with Hybrid Generative/ Discriminative Models”; Bouchard and Triggs, “The Trade-Off Between Generative and Discriminative Classifiers.” 25. Ibid; Zhang et al., “Modified Logistic Regression”; Chang, “Boosting SVM Classifiers with Logistic Regression”; Joachims, 214 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 “SVMlight”; L. Shih et al., “Not Too Hot, Not Too Cold: The Bundled SVM Is Just Right!” in Proceedings of the ICML-2002 Workshop on Text Learning (2002). http://people.csail.mit.edu/ u/j/jrennie/public_html/papers/icml02-bundled.pdf (accessed Oct. 20, 2006); F. Fukumoto and Y. Suzuki, “Manipulating Large Corpora for Text Classification,” in Proceedings of the Conference on Empirical Methods in Natural-Language Processing (Philadel- phia: Association for Computational Linguistics, 2002), 196–203, http://acl.ldc.upenn.edu/W/W02/W02-1026.pdf (accessed Oct. 20, 2006); Altun et al., “Hidden Markov Support Vector Machines; Ganapathiraju et al., “Hybrid SVM/HMM Architec- tures”; Liu et al., “Efficient Exact k-NN”; A. Ng and M. Jordan, “On Discriminative versus Generative Classifiers: A Com- parison of Logistic Regression and Naive Bayes,” in Advances in Neural Information Processing Systems 14: Proceedings of the 2001 Conference (Cambridge, Mass.: MIT Press, 2002), http:// www.robotics.stanford.edu/~ang/ papers/nips01-discriminati- vegenerative.ps (accessed Oct. 20, 2006); K. Nigam et al., “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning 39, nos. 2/3 (2000): 103–34, http://www .kamalnigam.com/papers/emcat-mlj99.pdf (accessed Oct. 20, 2006). 26. G. Valentini and F. Masulli, “Ensembles of Learning Machines,” in Neural Nets WIRN Vietri-02, Series Lecture Notes in Computer Sciences, M. Marinaro and R. Tagliaferri, eds. (Heidelberg: Springer-Verlag, 2002), http://www.disi.unige.it/ person/MasulliF/papers/masulli-wirn02.pdf (accessed Oct. 20, 2006). 27. Ibid.; R. Caruana et al., “Ensemble Selection from Librar- ies of Models” in Proceedings: Twenty-first International Conference on Machine Learning (Menlo Park, Calif.: AAAI Press, 2004). http://www.cs.cornell.edu/~alexn/shotgun.icml04.revised. rev2.pdf (accessed Oct. 20, 2006); G. Tsoumakas, “Effective Vot- ing of Heterogeneous Classifiers,” in Machine Learning ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20–24, 2004: Proceedings. (Berlin, New York: Springer, 2004), http://users.auth.gr/~greg/Publications/tsoumakas -ecml2004.pdf (accessed Oct. 20, 2006); J. Fürnkranz, “On the Use of Fast Sub-Sampling Estimates for Algorithm Recommen- dation,” Technical Report TR-2002-36 (Wien: Österreichisches Forschungsinstitut für Artificial Intelligence, 2002), http://www .ofai.at/cgi-bin/get-tr?paper=oefai-tr-2002-36.pdf (accessed Oct. 20, 2006); A. Seewald, 2002. “Meta-Learning for Stacked Classifi- cation,” (extended version) in Proceedings of the 2nd International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support, and Meta-Learning (University of Helsinki, Department of Computer Science, Report B-2002-3, 2002), http:// www.ofai.at/cgi-bin/get-tr?download=1&paper=oefai-tr-2002 -05.pdf (accessed Oct. 20, 2006); A. Seewald and J. Fürnkranz, “An Evaluation of Grading Classifiers,” in Advances in Intelli- gent Data Analysis: Proceedings of the 4th International Symposium (Lisbon, Portugal: Springer-Verlag, 2001), http://www.ofai.at/ cgi-bin/get-tr?paper=oefai-tr-2001-01.pdf (accessed Oct. 20, 2006); P. Bennett et al., “The Combination of Text Classifiers Using Reliability Indicators,” Technical Report. Microsoft and Information Retrieval 8, no. 1 (2005): 67–100, http://research .microsoft.com/~horvitz/tclass_combine.pdf (accessed Oct. 20, 2006); Y. Kim et al., “Optimal Ensemble Construction via Meta- Evolutionary Ensembles,” Expert Systems With Applications 30, no. 4 (in press 2006), http://www.informatics.indiana.edu/fil/ Papers/mee-eswa.pdf (accessed Oct. 20, 2006). 28. S. Godbole, “Document Classification as an Internet Ser- vice: Choosing the Best Classifier” (masters thesis, IIT Bombay, 2001). http://www.it.iitb.ac.in/~shantanu/work/mtpsg.pdf (accessed Oct. 20, 2006). 29. K. Liu and H. Kargupta, “Distributed Data Mining Bibli- ography: Release 1.7,” (Baltimore: University of Maryland, Com- puter Science Department, 2006), http://www.csee.umbc.edu/ ~hillol/DDMBIB/ (accessed Oct. 20, 2006); A. Prodromidis and P. Chan, “Meta-Learning in Distributed Data Mining Systems: Issues and Approaches,” in Advances of Distributed Data Mining, Hillol Kargupta and Philip Chan, eds. (Menlo Park, Calif. : AAAI/ MIT Press, 2000). http://www1.cs.columbia.edu/~andreas/ publications/DDMBOOK.ps.gz (accessed Oct. 20, 2006); G. Tsoumakas and I. Vlahavas, “Distributed Data Mining of Large Classifier Ensembles,” in Methods and Applications of Artificial Intelligence: Second Hellenic Conference on AI, SETN 2002, Thes- saloniki, Greece, April 11–12, 2002: Proceedings, (Berlin, New York: Springer, 2002), 249–56, http://users.auth.gr/~greg/Pub- lications/ddmlce.pdf (accessed Oct. 20, 2006); R. Khoussainov et al., “Grid-Enabled Weka: A Toolkit for Machine Learn- ing on the Grid,” ERCIM News no. 59, (Oct. 2004), http:// www.ercim.org/publication/Ercim_News/enw59/khussainov .html (accessed Oct. 20, 2006). 30. S. Godbole et al., “Document Classification through Inter- active Supervision of Document and Term Labels,” in Knowledge Discovery in Databases: PKDD 2004: 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy, September 20–24, 2004: Proceedings (Berlin; New York: Springer, 2004), http://www.it.iitb.ac.in/~shantanu/work/ pkdd04.pdf (accessed Oct. 20, 2006).; H. Yu et al., “PEBL: Posi- tive Example Based Learning for Web Page Classification Using SVM,” in KDD-2002: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (New York: ACM Pr., 2002), 239–48, http://eagle.cs.uiuc.edu/ pubs/2002/pebl-kdd02.pdf (accessed Oct. 20, 2006); T. Krist- jannson et al., “Interactive Information Extraction with Con- strained Conditional Random Fields,” in Proceedings: Nineteenth National Conference on Artificial Intelligence (AAI-04) (Menlo Park, Calif.: AAAI Press; Cambridge, Mass.: MIT Press, 2004), http://www.cs.umass.edu/~mccallum/papers/addrie-aaai04. pdf (accessed Oct. 20, 2006); V. Tablan et al., “OLLIE: On-Line Learning for Information Extraction,” in Proceedings of the HLT-NAACL Workshop on Software Engineering and Architecture of Language Technology Systems: Edmonton, Canada: 2003. (New York: ACM, 2003), http://gate.ac.uk/sale/hlt03/ollie-sealts.pdf (accessed Oct. 20, 2006). 31. Godbole et al., “Document Classification.” 32. Ibid.; Tablan et al., “OLLIE: On-Line Learning for Infor- mation Extraction.” MACHINE ASSISTANCE IN COLLECTION BUILDING | MITCHELL 215 33. G. Mann et al., “Bibliometric Impact Measures,” (in press). 34. Bouckaert and Frank, “Evaluating the Replicability of Significance Tests”; Caruana and Niculescu-Mizil, “Data Mining in Metric Space”; Caruana and Joachims, “PERF (Data Mining Evaluation Software).” 35. Mann et al., “Bibliometric Impact Measures”; Matsuo et al., “KeyWorld”; Matsuo and Ishizuka, “Keyword Extraction from a Single Document”; Lee et al., Retrieval and Organizing Web Pages; Tajima et al., “Discovery and Retrieval of Logical Information.” (See also the sections on Hybrid, Unified Models, and Document Scale Learning and Classification, above.) 36. Menczer, “Mapping the Semantics of Web Text and Links.” 37. P. Srinivasan et al., “A General Evaluation Framework for Topical Crawlers,” Information Retrieval 8, no. 3 (2005): 417–47, http://www.informatics.indiana.edu/fil/Papers/ crawl_framework.pdf (accessed Oct. 20, 2006); A. Maguit- man et al., “Algorithmic Computation and Approximation of Semantic Similarity,” (in press, 2006). To appear in World Wide Web Journal. http://www.informatics.indiana.edu/fil/Papers/ semsim_extended.pdf (accessed Oct. 20, 2006). 38. ArXiv. Cornell University Library, http://arxiv.org/ (accessed Oct. 20, 2006); CiteSeer.IST (formerly ResearchIndex), http://citeseer.ist.psu.edu/ (accessed Oct. 20, 2006); eScholarship Repository, California Digital Library, http://repositories.cdlib .org/escholarship/, (accessed Oct. 20, 2006); National Science Foundation, National Science Digital Library, http://nsdl.org/ (accessed Oct. 20, 2006); OAIster. Digital library production ser- vice (University of Michigan), http://oaister.umdl.umich.edu/ o/oaister/ (accessed Oct. 20, 2006); U.S. Institute of Museum and Library Services. Digital collections and content, http://imlsdcc .grainger.uiuc.edu/ (accessed Oct. 20, 2006). 39. K. Calhoun, “The Changing Nature of the Catalog and Its Integration into Other Discovery Tools,” (report to the Library of Congress, Mar. 17, 2006), http://www.loc.gov/catdir/calhoun -report-final.pdf (accessed Oct. 20, 2006); Mitchell, “Collabora- tion Enabling Internet Resource Collection-Building Software and Technologies”; W. Wulf, “Higher Education Alert: The Railroad is Coming,” in EDUCAUSE, Publications from the Forum for the Future of Higher Education (2002), http://www.educause. edu/ir/library/pdf/FFPIU022.pdf (accessed Oct. 20, 2006). 40. University of California Libraries, “Rethinking How We Provide Bibliographic Services at the University of Cali- fornia,” final report of the Bibliographic Services Task Force of the University of California Libraries, 2005, http://libraries .universityofcalifor nia.edu/sopag/BSTF/Final.pdf (accessed Oct. 20, 2006). 41. L. Dempsey, “Libraries and the Long Tail: Some Thoughts About Libraries in a Network Age,” D-Lib Magazine 12, no. 4 (2006), http://www.dlib.org/dlib/april06/dempsey/ 04dempsey.html (accessed Oct. 20, 2006). 42. Mason, J. et al., “INFOMINE: Promising Directions in Virtual Library Development,” First Monday 5, no. 6 (June 5, 2000), http://www.firstmonday.dk/issues/issue5_6/mason/ (accessed Oct. 20, 2006). 43. E. O’Neill and L. M. Chan, “FAST: Faceted Application of Subject Terminology,” in Proceedings of the World Information Con- gress, IFLA General Conference and Council (Berlin: IFLA, 2003). http://www.ifla.org/IV/ifla69/papers/010e-ONeill_Mai- Chan.pdf (accessed Oct. 20, 2006); See also: OCLC 2003–2006, “FAST: Faceted Application of Subject Terminology,” http:// www.oclc.org/research/projects/fast/default.htm) (accessed Oct. 20, 2006). 44. M. Bates, 2003, “Improving User Access to Library Cata- log and Portal Information,” Task Force Recommendation 2.3, Final Report (Washington, D.C.:Library of Congress, 2003), 30, http:// www.loc.gov/catdir/bibcontrol/2.3BatesReport6-03.doc.pdf (accessed Oct. 20, 2006). 45. RDN (Resource Description Network), http://www.rdn .ac.uk/projects/eprints-uk/, (accessed Oct. 20, 2006); OCLC “ePrints-UK” (2005), http://www.oclc.org/research/projects/ mswitch/epuk.htm, (accessed Oct. 20, 2006). 46. A. MacEwan, “Working with LCSH: The Cost of Coop- eration and the Achievement of Access: A Perspective from the British Library,” presented at the IFLA General Conference, 1998, http://www.ifla.org/IV/ifla64/033-99e.htm (accessed Oct. 20, 2006). 47. Ibid.; R. Larson, “The Decline of Subject Searching: Long- Term Trends and Patterns of Index Use in an Online Catalog,” Journal of the American Society for Information Science 42, no. 3 (1991): 197–215. 48. K. Drabenstott et al., “End-User Understanding of Subject Headings in Library Catalogs,” Library Resources & Technical Services 43, no. 3 (Jul. 1999): 140–60; Bates, “Improving User Access.” 49. Bates, “Improving User Access,” (see discussion of entry vocabulary). 50. Ibid. 51. BEAT (Bibliographic Enrichment Advisory Team, Library of Congress), “Digital Tables of Contents,” (2005), http://www .loc.gov/catdir/beat/digitoc.html (accessed Oct. 20, 2006). 52. D. Vizine-Goetz, “Terminology Services, OCLC,” (2004), http://www.oclc.org/research/projects/termservices/default .htm (accessed Oct. 20, 2006). 53. C. Fellbaum, Wordnet: An Electronic Lexical Database (Cam- bridge, Mass.: MIT Pr., 1998), http://wordnet.princeton.edu/ (accessed Oct. 20, 2006); A. Csomai, “Wordnet Bibliography,” (2006). http://lit.csci.unt.edu/~wordnet/ (accessed Oct. 20, 2006). 54. Bates, “Improving User Access.” 55. A. Maedche and R. Volz, “The Ontology Extraction and Maintenance Framework: Text-to-Onto,” in Proceedings of the ICDM 2001 Workshop (San Jose, Calif.: IEEE Computer Society (2001), http://cui.unige.ch/~hilario/icdm-01/DM-KM-Final/Volz .pdf (accessed Oct. 20, 2006); V. Parekh, J. Gwo, and T. Finin, “Mining Domain Specific Texts and Glossaries to Evaluate and Enrich Domain Ontologies,” in Proceedings of the 2004 International Conference on Information and Knowledge Engineer- ing: IKE ‘04 (Las Vegas: CSREA Press, 2004), http://ebiquity. umbc.edu/v2.1/paper/html/id/171/ (accessed Oct. 20, 2006); 216 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2006 D. Sleeman et al., “Enabling Services for Distributed Environ- ments: Ontology Extraction and Knowledge Base Characteriza- tion,” in Proceedings of Workshop on Knowledge Transformation for the Semantic Web/Fifteenth European Conference on Artificial Intelligence (Lyon, France: ECAI, 2002), http://www.csd.abdn .ac.uk/~sleeman/published-papers/p129-final-ontomine.pdf (accessed Oct. 20, 2006). ; B. Omelayenko, “Learning of Ontol- ogies for the Web: The Analysis of Existent Approaches,” in Proceedings of the International Workshop on Web Dynam- ics (London: WebDyn, 2001), http://dcs.bbk.ac.uk/webdyn/ webDynPapers/omelayenko.pdf (accessed Oct. 20, 2006); R. Dhamankar et al., “Imap: Discovering Complex Seman- tic Matches Between Database Schemas,” in SIGMOD 2004: Proceedings of the ACM SIGMOD International Conference on Management of Data, June 13–18, 2004, Paris, France (New York: Association for Computing Machinery, 2004), http:// www.cs.washington.edu/homes/pedrod/papers/sigmod04 .pdf (accessed Oct. 20, 2006); P. Cassin et al., “Ontology Extraction for Educational Knowledge Bases,” Lecture Notes in Computer Science, vol. 2926 (Heidelberg: Springer-Verlag, 2004), 297–309; Revised and Invited Papers from Agent-Medi- ated Knowledge Management: International Symposium (Stanford, Calif., Mar. 24–26, 2003), ftp://mas.cs.umass.edu/pub/Cassin _Ontology-AMKM03.pdf (accessed Oct. 20, 2006); T. Wang et al., “Extracting a Domain Ontology from Linguistic Resource Based on Relatedness Measurements,” in The 2005 IEEE/WIC/ ACM International Conference on Web Intelligence: Proceedings: September 19–22, Compiègne University of Technology, France (Los Alamitos, Calif.: IEEE Computer Society, 2005), 345–51, http:// csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/ dl/proceedings/&toc=comp/proceedings/wi/2005/2415/00/ 2415toc.xml&DOI=10.1109/WI.2005.63 (accessed Oct. 20, 2006). 56. Bates, “Improving User Access to Library Catalog and Portal Information.” 57. O’Neill and Chan, “FAST: Faceted Application of Subject Terminology.” 58. Mason, et al., “INFOMINE: Promising Directions in Vir- tual Library Development.”