172 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2009 Information Discovery Insights Gained from MultiPAC, a Prototype Library Discovery System Alex A. Dolski At the University of Nevada Las Vegas Libraries, as in most libraries, resources are dispersed into a number of closed “silos” with an organization-centric, rather than patron-centric, layout. Patrons frequently have trouble navigating and discovering the dozens of disparate interfaces, and any attempt at a global overview of our information offerings is at the same time incomplete and highly complex. While consolidation of interfaces is widely considered to be desirable, certain challenges have made it elusive in practice. M ultiPAC is an experimental “discovery,” or meta- search, system developed to explore issues sur- rounding heterogeneous physical and networked resource access in an academic library environment. This article discusses some of the reasons for, and outcomes of, its development at the University of Nevada Las Vegas (UNLV). n The case for MultiPAC Fragmentation of library resources and their interfaces is a growing problem in libraries, and UNLV Libraries is no exception. Electronic information here is scattered across our Innovative WebPAC; our main website, our three branch library websites; remote article databases, local custom databases, local digital collections, special collections, other remotely hosted resources (such as LibGuides), and others. The number of these resources, as well as the total volume of content offered by the Libraries, has grown over time (figure 1), while access provisions have not kept pace in terms of usability. In light of this dilemma, the Libraries and various units within have deployed finding and search tools that provide browsing and searching access to certain subsets of these resources, depending on criteria such as n the type of resource; n its place within the libraries’ organizational structure; n its place within some arbitrarily defined topical categorization of library resources; n the perceived quality of its content; and n its uniqueness relative to other resources. These tools tend to be organization-centric rather than patron-centric, as they are generally provisioned in relative isolation from each other without placing as much emphasis on the big picture (figure 2). The result is, from the patron’s perspective, a disaggregated mass of information and scattered finding tools that, to varying degrees, each accomplishes its own specific goals at the expense of macro-level findability. Currently, a compre- hensive search for a given subject across as many library resources as possible might involve visiting a half-dozen interfaces or more—each one predicated upon awareness of each individual interface, its relation to the others, and Figure 1. “Silos” in the library Figure 2. Organization-centric resource provisioning Alex A. Dolski (alex.dolski@unlv.edu) is Web & Digitization Application Developer at the University of Nevada Las Vegas Libraries. INFORMATION DISCOVERY INSIGHTS GAINED FROM MULTIPAC | DOLSkI 173 the characteristics of its spe- cific coverage of the corpus of library content. Our library website serves as the de facto gate- way to our electronic, net- worked content offerings. Yet usability studies have shown that findability, when given our website as a starting point, is poor. Undoubtedly this is due, at least in part, to interface fragmentation. Test sub- jects, when given a task to find something and asked to use the library website as a starting point, fail out- right in a clear majority of cases.1 MultiPAC is a technical prototype that serves as an exploration of these issues. While the system itself breaks no new technical ground, it brings to the forefront critical issues of metadata quality, organizational structure, and long-term planning that can inform future actions regard- ing strategy and implemen- tation of potential solutions at UNLV and elsewhere. Yet it is only one of numerous ways that these issues could be addressed.2 In an abstract sense, MultiPAC is biased toward principles of simplification, consolidation, and unifica- tion. In theory, usability can be improved by eliminating redundant interfaces, con- solidating search tools, and bringing together resource-specific features (e.g., OPAC holdings status) in one interface to the maximum extent possible (figure 3). Taken to an extreme, this means being able to support searching all of our resources, regardless of type or location, from a single interface; abstracting each resource from whatever native or built-in user interface it might offer; and relying instead on its data interface for querying and result-set gathering. Thus MultiPAC is as much a proof-of-concept as it is a concrete implementation. n Background: How MultiPAC became what it is MultiPAC came about from a unique set of circumstances. From the beginning, it was intended as an exploratory project, with no serious expectation of it ever being deployed. Our desire to have a working prototype ready for our Discovery Mini-Conference meant that we had just six weeks of development time, which was hardly sufficient for anything more than the most agile of Table 1. Some popular existing library discovery systems Name Company/Institution Commercial Status Aquabrowser Serials Solutions Commercial Blacklight University of Virginia Open-source (Apache) Encore Innovative Interfaces Commercial eXtensible Catalog University of Rochester Open-source (MIT/GPL) LibraryFind Oregon State University Open-source (GPL) MetaLib Ex Libris Commercial Primo Ex Libris Commercial Summon Serials Solutions Commercial VuFind Villanova University Open-source (GPL) WorldCat Local OCLC Commercial Table 2. Some existing back-end search servers Name Company/Institution Commercial Status Endeca Endeca Technologies Commercial IDOL Autonomy Commercial Lucene Apache Foundation Open-source (Apache) Search Server Microsoft Commercial Search Server Express Microsoft Free Solr (superset of Lucene) Apache Foundation Open-source (Apache) Sphinx Sphinx Technologies Open-source (GPL) Xapian Community Open-source (GPL) Zebra Index Data Open-source (GPL) 174 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2009 development models. The resulting design, while foun- dationally solid, was limited in scope and depth because of time constraints. Another option, instead of developing MultiPAC, would have been to demonstrate an existing open-source discovery system. The advantage of this approach is that the final product would have been considerably more advanced than anything we could have developed our- selves in six weeks. On the other hand, it might not have provided a comparable learning opportunity. n Survey of similar systems Were its development to continue, MultiPAC would find itself among an increasingly crowded field of competitors (table 1). A number of library discovery systems already exist, most backed by open-source or commercially available back-end search engines (table 2), which handle the nitty-gritty, low-level ingestion, indexing, and retrieval. These lists of systems are by no means comprehensive and do not include notable experimental or research systems, which would make them much longer. n Architecture In terms of how they carry out a search, meta-search applications can be divided into two main groups: dis- tributed (or federated search), in which searches are “broadcast” to individual resources that return results in real time (figure 4); and harvested search, in which searches are carried out against a local index of resource contents (figure 5).3 Both have advantages and disadvan- tages beyond the scope of this article. MultiPAC takes the latter approach. It consists of three primary components: the search server, the user interface, and the metadata harvesting system (figure 6). Figure 4. The federated search process Figure 5. The harvested search process Figure 6. The three main components of MultiPAC Figure 3. Patron-centric resource provisioning INFORMATION DISCOVERY INSIGHTS GAINED FROM MULTIPAC | DOLSkI 175 n Search server After some research, Solr was chosen as the search server because of its ease of use, proven library track record, and HTTP–based representational state transfer (REST) application programming interface (API), which improves network-topological flexibility, allowing it to be deployed on a different server than the front-end Web application—an important consideration in our server environment.4 Jetty—a Java Web application server bundled with Solr—proved adequate and convenient for our needs. The metadata schema used by Solr can be customized. We derived ours from the unqualified Dublin Core meta- data element set (DCMES),5 with a few fields removed and some fields added, such as “library” and “depart- ment,” as well as fields that support various MultiPAC features, such as thumbnail images, and primary record URLs. DCMES was chosen for its combination of general- ity, simplicity, and familiarity. In practice, the Solr schema is for finding purposes only, so whether it uses a standard schema is of little importance. n User interface The front-end MultiPAC system is written in PHP 5.2 in a model-view-controller design based on classical object design principles. To support modularity, new resources can be added as classes that implement a resource-class interface. The MultiPAC HTML user interface is composed of five views: search, browse, results, item, and list, which exist to accommodate the finding process illustrated in figure 7. Each view uses a custom HTML template that can be easily styled by nonprogrammer Web designers. (Needless to say, judging by figures 8–12, they haven’t been.) Most dynamic code is encapsulated within dedi- cated “helper” methods in an attempt to decouple the templates from the rest of the system. Output formats, like resources, are modular and decoupled from the core of the system. The HTML user interface is one of several interfaces available to the MultiPAC system; others include XML and JSON, which effectively add Web services support to all encompassed resources—a feature missing from many of the resources’ own built-in interfaces.6 n Search view Search view (figure 8) is the simplest view, serving as the “front page.” It currently includes little more than a brief introduction and search field. The search field is not complicated; it is, in fact, possible to include search forms on any webpage and scope them to any subset of resources on the basis of facet queries. For example, a search form could be scoped to Las Vegas–related resources in Special Collections, which would satisfy the demand of some library departments for custom search engines tailored to their resources without contribut- ing to the “interface fragmentation” effect discussed in the introduction. (This would require a higher level of metadata quality than we currently have, which will be discussed in depth later.) Because search forms can be added to any page, this view is not essential to the MultiPAC system. To improve simplification, it could be easily removed and replaced with, for example, a search form on the library homepage. n Browse view Browse view (figure 9) is an alternative to search view, intended for situations in which the user lacks a “concrete target” (figure 7). As should be evident by its appearance, Figure 7. The information-finding process supported by MultiPAC Figure 8. The MultiPAC search view page 176 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2009 this is the least-developed view, simply displaying facet terms in an HTML unordered list. Notice the facet terms in the format field; this is malprocessed, MARC– encoded information resulting from a quick-and-dirty Extensible Stylesheet Language (XSL) transformation from MARCXML to Solr XML. n Results view The results page (figure 10) is composed of three columns: 1. The left column displays a facet list—a feature gen- erally found to be highly useful for results-gathering purposes.7 The data in the list is generated by Solr and transformed to an HTML unordered list using PHP. The facets are configurable; fields can be made “facetable” in the Solr schema configuration file. 2. The center column displays results for the current search query that have been provided by Solr. Thumbnails are available for resources that have them; generic icons are provided for those that do not. Currently, the results list displays item title and description fields. Some items have very rich descriptions; others have minimal descriptions or no descriptions at all. This happens to be one of several significant metadata quality issues that will be discussed later. 3. The right column displays results from nonin- dexed resources, including any that it would not be feasible to index locally, such as Google, our article databases, and so on. MultiPAC displays these resources as collapsed panes that expand when their titles are clicked and initiate an AJAX request for the current search query. In a situation in which there might be twenty or more “panes” to load, performance would obviously suffer greatly if each one had to be queried each time the results page loaded. The on-demand loading process greatly speeds up the page load time. Currently, the right column includes only a handful of resource panes—as many as could be developed in six weeks alongside the rest of the prototype. It is anticipated that further development would entail the addition of any number of panes—perhaps several dozen. The ease of developing a resource pane can vary greatly depending on the resource. For developer- friendly resources that offer a useful JavaScript Object Notation (JSON) API, it can take less than half an hour. For article databases, which vendors generally take great pains to “lock down,” the task can entail a two-day marathon involving trial-and-error HTTP-request-token authentication and screen-scraping of complex invalid HTML. In some cases, vendor license agreements may prohibit this kind of use altogether. There is little we can do about this; clearly, one of MultiPAC’s severest limita- tions is its lack of adeptness at searching these types of “closed” remote resources. n Item view Item view (figure 11) provides greater detail about an individual item, including a display of more metadata fields, an image, and a link to the item in its primary con- text, if available. It is expected that this view also would include holdings status information for OPAC resources, although this has not been implemented yet. The availability of various page features is dependent on values encoded in the item’s Solr metadata record. For example, if an image URL is available, it will be displayed; if not, it won’t. An effort was made to keep the view logic separate from the underlying resource to improve code and resource maintainability. The page template itself does not contain any resource-dependent conditionals. n List view List view (figure 12), essentially a “favorites” or “cart” view, is so named because it is intended to duplicate the list feature of UNLV Libraries’ Innovative Millennium Figure 9. The MultiPAC browse view page INFORMATION DISCOVERY INSIGHTS GAINED FROM MULTIPAC | DOLSkI 177 OPAC. The user can click a button in either results view or item view to add items to the list, which is stored in a cookie. Although currently not feature-rich, it would be reasonable to expect the ability to send the list as an e-mail or text message, as well as other features. n Metadata harvesting system For metadata to be imported into Solr, it must first be harvested. In the harvesting process, a custom script checks source data and com- pares it with local data. It downloads new records, updates stale records, and deletes missing records. Not all resources support the ability to easily check for changed records, meaning that the full record set must be down- loaded and converted during every harvest. In most cases, this is not a problem; most of our resources (the library catalog excluded) can be fully dumped in a matter of a few seconds each. In a production environment, the harvest scripts would be run automatically every day or so. In practice, every resource is different, necessitating a different harvest script. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is the proto- col that first jumps to mind as being ideal for metadata harvesting, but most of our resources do not support it. Ideally, we would modify as many of them as possible to be OAI–compliant, but that would still leave many that are out of our hands. Either way, a substantial number of custom harvest scripts would still be required. For demonstration purposes, the MultiPAC prototype was seeded with sample data from a handful of diverse resources: 1. A set of 16,000 MARC records from our library catalog, which we converted to MARCXML and then to Solr XML using XSL transformations 2. Our locally built Las Vegas Architects and Buildings Database, a MySQL database containing more than 10,000 rows across 27 tables, which we queried and dumped into XML using a PHP script 3. Our locally built Special Collections Database, a smaller MySQL database, which we dealt with the same way 4. Our CONTENTdm digital collections, which we downloaded via OAI-PMH and transformed using another custom XSL stylesheet There are typically a variety of conversion options for each resource. Because of time constraints, we simply chose what we expected would be the quickest route for each, and did not pay much attention to the quality of the conversion. n How MultiPAC answers UNLV Libraries’ discovery questions MultiPAC has essentially proven its capability of solv- ing interface multiplication and fragmentation issues. Figure 10. The MultiPAC results view page 178 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2009 By adding a layer of abstraction between resource and patron, it enables us to reference abstract resources instead of their specific implementations—for example, “the library catalog” instead of “the INNOPAC catalog.” This creates flexibility gains with regard to resource pro- vision and deployment. This kind of “pervasive decoupling” can carry with it a number of advantages. First, it can allow us to provide custom-developed services that vendors cannot or do not offer. Second, it can prevent service interruptions caused by maintenance, upgrades, or replacement of individual back-end resources. Third, by making us less dependent on specific implementations of vendor products—in other words, reducing vendor “lock-in”—it can potentially give us leverage in vendor contract negotiations. Because of the breadth of information we offer from our website gateway, we as a library are particularly sensitive about the continued availability of access to our resources at stable URLs. When resources are not persistent, patrons and staff need to be retrained, expec- tations need to be adjusted, and hyperlinks—scattered all over the place—need to be updated. By decoupling abstract resources from their implementations, MultiPAC becomes, in effect, its own persistent URI system, unify- ing many library resources under one stable URI schema. In conjunction with a URL rewriting system on the Web server, a resource-based URI schema (figure 13) would be both powerful and desirable.8 n Lessons learned in the development of MultiPAC The lessons learned in the development of MultiPAC fall into three main categories, listed here in order of importance. Metadata quality considerations Quality metadata—characterized by unified schemas; useful crosswalking; and consistent, thorough descrip- tion—facilitates finding and gathering. In practice, a sur- rogate record is as important as the resource it describes. Below a certain quality threshold, its accompanying resource may never be found, in which case it may as well not exist. Surrogate record quality influences relevance ranking and can mean the difference between the most relevant result appearing on page 1 or page 50 (relevance, of course, being a somewhat disputed term). Solr and similar systems will search all surrogates, including those that are of poor quality, but the resulting relevancy rank- ing will be that much less meaningful. Figure 13. Example of an implementation-based vs. resource-based URI Implementation-based http://www.library.unlv.edu/arch/archdb2/index.php/projects/view/1509 Resource-based (hypothetical) http://www.library.unlv.edu/item/483742 Figure 11. The MultiPAC item view page Figure 12. The MultiPAC list view page INFORMATION DISCOVERY INSIGHTS GAINED FROM MULTIPAC | DOLSkI 179 Metadata quality can be evaluated on several lev- els, from extremely specific to extremely broad (figure 14). That which may appear to be adequate at one level may fail at a higher level. Using this figure as an example, MultiPAC requires strong adherence to level 5, whereas most of our metadata fails to reach level 4. A “level 4 failure” is illustrated in table 3, which compares sample metadata records from four different MultiPAC resources. Empty cells are not necessarily “bad”— not all metadata elements apply to all resources—but this type of inconsistency multiplies as the number of resources grows, which can have negative implications for retrieval. Suggestions for improving metadata quality The results from the MultiPAC project suggest that meta- data rules should be applied strictly and comprehensively according to library-wide standards that, at our libraries, have yet to be enacted. Surrogate records must be treated as must-have (rather than nice-to-have) features of all resources. Resources that are not yet described in a system that supports search- able surrogate records should be transitioned to one that does; for example, HTML web- pages should be tran- sitioned to a content management system with metadata ascrip- tion and searchability features (at UNLV, this is planned). However, it is not enough for resources to have high-quality meta- data if not all schemas are in sync. There exist a number of resources in our library that are well-described but whose schemas do not mesh well with other resources. Different formats are used; different descriptive elements Figure 14. Example scopes of metadata application and evalua- tion, from broad (top) to specific Table 3. Comparing sample crosswalked metadata from four different UNLV Libraries resources Library Catalog Digital Collections Special Collections Database Las Vegas Architects & Buildings Database Title Goldfield: boom town of Nevada Map of Tonopah Mining District, Nye County, Nevada 0361 : Mines and Mining Collection Flamingo Hilton Las Vegas Creator Paher, Stanley W. Booker & Bradford Call Number F849.G6P34 Contents (Item-level description of contents) Format Digital Object Photo Collections Database Record Language eng Eng eng Coverage Tonopah Mining District (Nev.) ; Ray Mining District (Nev.) Description (Omitted for brevity) Publisher Nevada Publications University of Nevada Las Vegas Libraries UNLV Architecture Studies Library Subject (LCSH omitted for brevity) (LCSH omitted for brevity) 180 INFORMATION TECHNOLOGY AND LIBRARIES | DECEMBER 2009 are used; and different interpretations, however subtle, are made of element meanings. Despite the best intentions of everyone involved with its creation and maintenance, and despite the high quality of many of our metadata records when examined in isola- tion, in the big picture, MultiPAC has demonstrated—per- haps for the first time—how much work will be needed to upgrade our metadata for a discovery system. Would the benefits make the effort worthwhile? Would the effort be implementable and sustainable given the limitations of the present generation of “silo” systems? What kind of adjustments would need to be made to accommodate effective workflows, and what might those workflows look like? These questions still await answers. Of note, all other open-source and vendor systems suffer from the same issues, which is a key reason that these types of systems are not yet ascendant in libraries.9 There is much promise in the ability of infrastructural standards like FRBR, SKOS, RDA, and the many other esoteric information acronyms to pave the way for the next generation of library discovery systems. Organizational considerations Electronic information has so far proved relatively elusive to manage; some of it is ephemeral in existence, most of it is constantly changing, and all of it is from diverse sources. Attempts to deal with electronic resources—representing them using catalog surrogate records, streamlining web- site portals, farming out the problem to vendors—have not been as successful as they have needed to be and suf- fer from a number of inherent limitations. MultiPAC would constitute a major change in library resource provision. Our library, like many, is for the most part organized around a core 1970s–80s ILS–support model that is not well adapted to a modern unified discovery environment. Next-generation discovery is trending away from assembly-line-style acquisition and processing of primarily physical resources and toward agglomerating interspersed networked and physical resource clouds from on- and offsite.10 In this model, increasing responsibilities are placed on all content pro- viders to ensure that their metadata conforms to site-wide protocols that, at our library, have yet to be developed. n Conclusion In deciding how to best deal with discovery issues, we found that a traditional product matrix comparison does not address the entire scope of the problem, which is that some of the discoverability inadequacies in our libraries are caused by factors that cannot be purchased. Sound metadata is essential for proper functioning of a unified discovery system, and descriptive uniformity must be ensured on multiple levels, from the element level to the institution level. Technical facilitators of improved discoverability already exist; the responsibility falls on us to adapt to the demands of future discovery systems. The specific discovery tool itself is only a facilitator, the specific implementation of which is likely to change over time. What will not change are library-wide metadata quality issues that will serve any tool we happen to deploy. The MultiPAC project brought to light important library-wide discoverability issues that may not have been as obvious before, exposing a number of limitations in our exist- ing metadata as well as giving us a glimpse of what it might take to improve our metadata to accommodate a next-generation discovery system, in whatever form that might take. References 1. UNLV Libraries Usability Committee, internal library website usability testing, Las Vegas, 2008. 2. Karen Calhoun, “The Changing Nature of the Catalog and Its Integration with Other Discovery Tools.” Report prepared for the Library of Congress, 2006. 3. Xiaoming Liu et al., “Federated Searching Interface Tech- niques for Heterogeneous OAI Repositories,” Journal of Digital Information 4, no. 2 (2002). 4. Apache Software Foundation, Apache Solr, http://lucene .apache.org/solr/ (accessed June 11, 2009). 5. Dublin Core Metadata Initiative, “Dublin Core Metadata Element Set, Version 1.1,” Jan. 14, 2008, http://dublincore.org/ documents/dces/ (accessed June 25, 2009). 6. Lorcan Dempsey, “A Palindromic ILS Service Layer,” Lorcan Dempsey’s Weblog, Jan. 20, 2006, http://orweblog.oclc .org/archives/000927.html (accessed July 15, 2009). 7. Tod A. Olson, “Utility of a Faceted Catalog for Scholarly Research,” Library Hi Tech 4, no. 25 (2007): 550–61. 8. Tim Berners-Lee, “Hypertext Style: Cool URIs Don’t Change,” 1998, http://www.w3.org/Provider/Style/URI (accessed June 23, 2009). 9. Bowen, Jennifer, “Metadata to Support Next-Generation Library Resource Discovery: Lessons from the eXtensible Cata- log, Phase 1,” Information Technology and Libraries 2, no. 27 (June 2008): 6–19. 10. Calhoun, “The Changing Nature of the Catalog.”