Microsoft Word - 9813.doc Design and Implementation of the E-Referencer Danny C. C. Poo Christopher S. G. Khoo Teck-Kang Toh School of Computing Centre for Advanced Information Systems National University of Singapore School of Applied Science Lower Kent Ridge Road Nanyang Technological University Singapore 119260 Singapore 639798 dpoo@comp.nus.edu.sg assgkhoo@ntu.edu.sg 1 Design and Implementation of the E-Referencer Abstract. An expert system Web interface to online catalogs called E-Referencer is being developed. An initial prototype has been implemented. The interface has a repertoire of initial search strategies and reformulation strategies that it selects and implements to help users retrieve relevant records. It uses the Z39.50 protocol to access library systems on the Internet. This paper describes the design and implementation of the E-Referencer. A preliminary evaluation of the strategies is also presented. 1. Introduction E-Referencer is a web-based interface to online catalogs that is being developed as a tool to test and experiment on various search strategies that are aimed at helping online catalog users in their searches. The design of the system makes use of the expert system technology in the selection of search strategies. Through the conduct of experiments and with some modifications and fine-tuning, it is hope that the E-Referencer can be used as an effective searching tool for all online catalog users. The E-Referencer uses the Z39.50 Information Retrieval protocol [1] to communicate with the various library systems, and the Java Expert System Shell (JESS) [2] to implement the knowledge base of the system. At the present moment, the knowledge base of the E-Referencer consists of: 1. a conceptual knowledge base that maps free-text keywords to concepts represented by the Library of Congress (LC) subject headings 2. search strategies coded in the system, including • initial search strategies, used to convert the user’s natural language query to an appropriate Boolean search statement • reformulation strategies, used for refining a search based on the results of the previous search statement 3. rules for selecting an appropriate search strategy. The E-Referencer processes a user’s natural language query, selects a suitable search strategy and formulates an appropriate search statement for the library system. Based on the user’s relevance 2 feedback on the search result, it further selects a strategy for reformulating the search. The process goes on until the user is satisfied with the final result. This paper begins by discussing the state of implementation of current search systems as implemented in Online Public Access Catalogues System (OPACS). The discussion then proceeds to illustrate the improvement in search results obtained by using an initial prototype of the E-Referencer. The design and implementation of the E-Referencer covers the remaining sections of the paper. 2. Experiences in Online Catalog Searches 2.1 User Difficulties in searching OPACS Traditional text-based boolean search systems as incorporated in the design of OPACS are difficult to use. In the 1984 Online Catalog Evaluation Projects sponsored by the Council of Library Resources, Markey found that users have the following problems when performing subject searches of online catalogues [6] : • Users have problems matching their terms with those indexed in the online catalogue. • They have difficulty identifying terms broader or narrower than their topic of interest. • They do not know how to increase the search results when too little or nothing is retrieved. • They do not know how to reduce the search results when too much is retrieved. • They lack understanding of the printed LCSH (Library of Congress Subject Heading) Hildreth also pointed out that “conventional informational retrieval systems place the burden on the user to reformulate and re-enter searches until satisfactory results are obtained.” [7] Indeed the boolean search system, which was the conventional informational retrieval system used then, requires too much knowledge from their users. The new Web-based library OPAC search systems today are not much different from those in the eighties in terms of functionality. Borgman indicated that most of the improvements to online catalogues in recent years were in surface features rather than in the core functionality. Also, online catalogues “were designed for highly skilled searchers, usually librarians, who used them frequently, not for novices or for end-users doing their own searching.” [8]. 3 Other studies conducted by Cousins [3], Dalrymple [4], Ensor [5], Lancaster [6] have found deficiencies in the present-day online catalogue systems and identified the problems users have. For example, Cousins [3] analyzed the types of subject queries that users brought to two online catalog systems. She found that many subject queries were not expressed at the level of specificity that was appropriate or suitable for searching the system. She concluded that online catalogue systems should provide more information about the document content (e.g. content pages), facilities for browsing the thesaurus and the classification scheme, facilities for browsing records arranged by class number, ranked display of search results, help with query formulation, and relevance feedback. 2.2 Recent Developments in Improving the Search System To help the users search more effectively, recent research efforts have concentrated on improving the search system. The more notable developments are: • Introduction of ‘best match’ or statistically-based search systems that incorporate numerically- based algorithms for estimating the relevance of documents to the users’ query. • Use of knowledge-based search systems that encode expert knowledge to provide advice or assistance to users for searching. • Development of the Z39.50 Standard for Information Retrieval that provides common access to multiple online catalogues. 2.2.1 ‘Best Match’ Search Systems There are many variations of ‘best match’ or statistically-based search systems. They support a mixture of features like document ranking, relevance feedback and query expansion. These systems use either the vector-based information retrieval model or the probabilistic model to re-index the collection of documents to support more effective retrieval. The user is allowed to enter his search query in natural language. The search query is then treated as a list of unstructured keywords. Stop- words such as ‘is’, ‘a’, and ‘the’, which are not very useful in the search process are removed. The remaining content-bearing words are stemmed to a common form. Inverse frequency weights are then calculated for each of the remaining query stems; whereby the greatest weights are then assigned to those terms that occur least frequently in the document collection. The weight of each document in the collection is then calculated by taking the sum of the weights of the common terms that occur in both 4 the query and the document. Subsequently, the documents are sorted base on their calculated weights. This process is known as document ranking. In order to determine the relevancy of the documents, a certain number of top ranking documents are displayed to the user. This process is known as relevance feedback. Based on the user’s feedback, terms are extracted from the relevant documents; the selected terms are then weighted and ranked. A certain number of the top ranking terms could be automatically added to the original query. Alternatively, they can be displayed to the user and the latter can decide if the new terms are to be included in the original query. This process is known as query expansion. Examples of best match systems include the Interactive System for Teaching Retrieval (INSTRUCT) developed in the University of Sheffield [11], the Okapi System by the City University in London [12] and almost all the current web search engines. There are advantages in using such systems. First, the user does not need to compose their queries using logical operators. Second, the ranking of documents means that the user is more likely to find the required document at the top of the list. Third, the use of relevance feedback information to refine the query, by automatic or semi-automatic query expansion provides a useful mechanism for the user to clarify his query and locate the required information quickly. However, these are also problems with such systems. Firstly, there is a need to re-index the existing database and this can be very costly and time-consuming. This may explain why very few library OPAC boolean search systems are ‘best match’ systems. Secondly, a large amount of records are usually returned in each search. This is evident in the searches carried out on the web. It is unlikely for users to browse through all the returned records. A small number of the records is finally seen. Lastly, such numerically-based systems do not utilize any extra information or domain knowledge present that may be used to help users in their search. 5 2.2.2 Knowledge-based Search Systems Another approach is to make use of techniques from research in expert systems. An intermediary system that removes the need for users to have knowledge of boolean search can be provided. Expert systems encode the knowledge of a human expert in the domain of interest and the strategies the expert uses for reasoning. In the field of information retrieval, the expert is probably the reference librarian and his/her expert knowledge lies in his/her ability to convert natural language queries into boolean queries, and refining and reformulating the query according to his/her knowledge of the subject area being searched. A number of such systems based on expert system techniques like production rules, semantic nets and frames have been developed over the years; for example: • Gauch’s Query Reformulation system uses production rules to reformulate the query by manipulating the boolean operators, or by adding related terms, or by replacing terms with broader or narrower terms from a thesaurus [13]. • PLEXUS, an expert referral system that uses a frame-based representation of topics in the domain of gardening to map words to semantic primitives. It incorporates some search strategies expressed as production rules, and builds a temporary user profile based on the users’ gardening experience, knowledge and location. This profile is used to set up the level of help for the user [14]. • RUBRIC (Rule-Based Retrieval Information by Computer) uses production rules to define a hierarchy of retrieval concepts. Domain knowledge is modeled as a collection of concepts. Each concept contains a description, the relationship between the concept with other concepts, and the rules that describe the patterns of text that should be present to support the use of the retrieval concept. In this system, the user is required to provide to the system domain specific concepts [15]. • Drabenstott and her colleagues developed a prototype online catalogue that uses search trees or decision trees to represent how experienced librarians select a search strategy and formulate a search statement. The decision tree is represented as a flowchart [16-18]. 6 The main advantage of using the knowledge-based approach is in the ability to reuse the existing database index. By building on top of existing infrastructure, libraries are more likely to adopt the knowledge-based solution to augment their existing OPAC boolean search systems. Furthermore, the classification systems in library catalogues such as the Dewey Classification System and the Library of Congress Subject Classification System, are already well-defined. Such structured domain information and knowledge can thus be incorporated in any system to help users in their searches. 2.2.3 Z39.50 Information Retrieval Protocol For many years, users have been limited to performing searches at the library premises. Users have to also make use of the OPAC search system as provided by each library in searching the catalogs. Although most systems are boolean search systems offering similar search capabilities, users still need to learn each of the different search interfaces in order to be proficient in accessing information from the online catalogs. In order to overcome this compatibility issue, the Z39.50 Information Retrieval Protocol was developed to standardize the informational retrieval process. Commercial products and freeware search systems that implement the Z39.50 protocol are now available. For a list of Z39.50 products and freeware, refer to http://lcweb.loc.gov/z3950/agency/projects/software.html. We have evaluated a few of these Z39.50 compliant search systems, namely: • Sitesearch WebZ software from Online Computer Library Center (OCLC) http://www.oclc.org/oclc/sitesearch/components.htm • Chalmers from Chalmers Library, Chalmers University of Technology http://www.lib.chalmers.se/prov/Z3950/gateway.html • Willow from University of Washington http://www.washington.edu/willow • WWW/Z39.50 Gateway from Library of Congress http://lcweb.loc.gov/z3950/gateway.html Most of all these Z39.50 compliant search systems provide access to multiple databases but they only support boolean logic queries. Users are still burdened with the responsibility of query formulation and reformulation. Susannah noted that “Almost all literature on Z39.50 and its implementation focuses on the issues related to the implementor and the development of the standard. In the ten years since work 7 on Z39.50 began, little attention has been given to the end user, the one who is supposed to ultimately benefit from the implementation of the standard.” [19]. 2.3 Our Approach A study conducted by Robertson and his colleagues found that there is very little overall difference in performance between the ‘best match’ search system, INSTRUCT, and the knowledge-based system, TomeSearcher [20]. In order to help current users of library online catalogs search more effectively, the knowledge-based approach is currently the most suitable and feasible approach. The approach as taken in the paper is to build an intermediary system on top of the existing facilities as provided by OPACS. In this way, we can reuse the existing infrastructure to reduce cost, and to take advantage of the domain knowledge present in the structure of online catalog classification systems and the expertise of the librarians in searching catalogs to enhance the search session of the users. The task of interfacing is made simpler with the widespread acceptance and use of the Z39.50 Information Retrieval Protocol. Our expert intermediary system, the E-Referencer, encompasses the expert knowledge of reference librarians. Librarians are trained in formulating queries in boolean logic and reformulating them to obtain sufficient relevant records. Such formulation and reformulation strategies are domain-independent and can be applied to any boolean search systems. The E-Referencer also makes use of the domain knowledge present in the structure of classification scheme of the online catalogs to assist users. In a library OPAC system, the documents and records are classified based on information such as subject heading and call number. This information is fully utilized in helping users maximize their searches. 2.4 Evaluation of E-Referencer and Results The system, E-Referencer version 2.0, has been implemented and is accessible at the URL http://revelation.comp.nus.edu.sg/ERef2.0/. 8 An evaluation of the system was carried out using an earlier version of the E-Referencer on 12 queries that were selected from among those submitted by the university staff and students for literature searches. The queries were selected to cover a wide range of subjects. A complete list and description of the 12 queries is given below: Query No. Topic A96-7 Digital library projects A96-14 Cognitive models of categorization A97-16 Internet commerce B97-1 Making a framework for surveying service quality in academic libraries B97-3 Looking to do a comprehensive literature review of the Sapir-Whorf Hypothesis D96-2 Software project management D96-16 Decision under uncertainty D97-1 Thermal conductivity of I.C. Packaging D97-2 Fault-Tolerant Multiprocessor Topologies D97-4 Face recognition D97-7 A study on computer literacy and use by teachers in primary schools N97-13 Expert systems in library reference service The Nanyang Technological University (NTU) library system in Singapore was used in the evaluation. Searches were performed by entering the query topics into the traditional search system provided by NTU’s library at http://www.ntu.edu.sg/library/opacs.html, and also on the E-Referencer prototype that we developed. The same set of queries was also given to an expert librarian who performed her own search and reformulation using the NTU library search system. To illustrate the process of searching by the three systems, we shall use one of the queries above: A96- 7 “Digital library projects”. a. NTU Library Search System No record was returned when the query string was entered in this search system. b. E-Referencer The same query string was then entered into the E-Referencer; Initial Search Strategy 1 was carried out as follows: • Stop-words were replaced with ANDs “Digital library projects” 9 • The words in between the AND and OR operators were assumed to be adjacent “Digital library projects” • Individual words were stemmed and truncation signs added “Digit? librar? project?” • The formulated query string “Digit? librar? project?” was sent to the NTU library search system. No record was retrieved and Broadening Strategy 1 activated: • The operator AND was inserted between all adjacent words “Digit? AND librar? AND project?” • Once again, the NTU library search system was sent the query “Digit? AND librar?AND project?” The search yielded four records, they include: 1. Conversion of the microfilm to digital imagery: a demonstration project: performance report on the production conversion phase of Project Open Book / by Paul Conway, principal investigator. 2. Digital library visualization tool / by Yee Mun Sung. 3. Library development for mixed analog-digital circuit simulation / submitted by Ng Kian Ching, Ng Meng Hui. 4. Electronic services in academic libraries : ALA survey report / by Mary Jo Lynch, project director. c. Expert Librarian The expert librarian was able to retrieve 6 records from the given query string. Searches using other query string from the above topics were also carried out. The results of the searches were given to two judges, who were asked to indicate whether the records retrieved were relevant, marginally relevant or not relevant. For this evaluation, records that were judged to be marginally relevant were considered to be non-relevant. The precision measure (proportion of records retrieved that are relevant) was calculated for the first 20 records displayed. (The expert system currently displays only 20 records to the user for relevance judgment.) The mean precision for the 2 sets of relevance judgments was then calculated for each query. The consolidated results are given in Table 1. 10 Search by NTU Search System Search by E-Referencer Search by Librarian Query No. No. Displayed No. Relevant Precision No. Displayed No. Relevant Precision No. Displayed No. Relevant Precision A96-7 0 0 0.00 4 0.5 0.13 17 6 0.35 A96-14 0 0 0.00 20 1 0.05 3 2 0.67 A97-16 4 1 0.25 4 1 0.25 20 705 0.38 B97-1 0 0 0.00 20 3 0.15 13 7 0.54 B97-3 0 0 0.00 2 0 0.00 4 3.5 0.88 D96-2 11 10.5 0.95 11 10.5 0.95 20 14 0.70 D96-16 0 0 0.00 20 9 0.45 20 9 0.45 D97-1 0 0 0.00 3 1.5 0.50 20 5 0.25 D97-2 0 0 0.00 5 3.5 0.70 6 6 1.00 D97-4 0 0 0.00 8 2 0.25 15 5 0.33 D97-7 0 0 0.00 19 0 0.00 14 2.5 0.18 N97-13 0 0 0.00 4 3.5 0.88 5 4 0.80 Average 1.25 1.0 0.10 10 3.0 0.36 13.1 6.0 0.54 Table 1: Comparison of Search Results Note: 1. The figures given for No. Relevant and Precision are the average for 2 sets of relevance judgments by 2 persons. 2. The evaluation is based on the first 20 records retrieved. E-Referencer currently displays only 20 records for relevance judgment. Clearly, Table 1 shows that the E-Referencer performed much better than the traditional library search system. The reformulation strategies used by the E-Referencer have helped the user retrieve more relevant records. However, the result shows that the E-Referencer did not perform as well as an expert librarian. While this may be true from Table 1, it must be pointed out that the evaluation is skewed towards the expert librarian. This is because the librarian was able to execute several search statements, and continually refine the search after examining the records retrieved in earlier search formulations. While the above results show the final search set as obtained by the librarian, it is only the first non-null set retrieved that is shown for the E-Referencer. The evaluation suggests that the E-Referencer can be an efficient tool for helping online catalog users to search better. The strategies as used in the prototype still need to be refined further. Current work is aimed at realizing this. A complete description of the search strategies can be found in [16]. In the rest of the paper, we shall focus on the design and implementation of the E-Referencer. 11 3. Conceptual Design For many years, when library users have problems finding what they need using the search systems in the libraries, they would approach the reference librarians. The latter are people who are more knowledgeable and proficient with the catalog search system. In general, librarians would ask what the user is looking for, clarifying the subject or topic when necessary before constructing a query in boolean logic to search the catalog. If too little records are retrieved, the librarian would either change some of the boolean operators in the query, like changing the AND operator to OR to get more records, or try using similar keywords or broader subject headings in the reformulation. Similarly if too many records are retrieved, the librarian may try modifying the boolean operators or use narrower subject headings to reduce the search results. This process goes on until the user is satisfied with the search result. Basically, there are two types of knowledge present in a search session: 1. Domain knowledge – Classification information of records in OPACS such as Subject Headings, and Dewey and Library of Congress Call Numbers. These information can be used to help users clarify their search topic and locate the relevant documents. The hierarchy present in the various classification schemes can also be used to broaden or narrow a search. 2. Domain-independent knowledge – Search strategies that librarians used to formulate the user’s original query to the format expected by the OPAC (boolean logic), and reformulation of the query by modifying the boolean logic operators to get more or less records. In the design of the E-Referencer, we have decided to incorporate these two types of knowledge into the system. A conceptual knowledge base of domain knowledge has been incorporated into the E- Referencer to map keywords to concepts represented in the subject headings. The domain-independent knowledge of formulation and reformulation rules has been implemented as search strategies in the E- Referencer. For a complete description of search strategies used in the E-Referencer, the reader is referred to an earlier published paper [21]. 12 4. System Design and Implementation Having illustrated the potential of the E-Referencer, we will now describe the design and implementation of the E-Referencer. 4.1 Design Approach and Considerations The approach we have adopted in developing our system is that of rapid prototyping and incremental development. We first implemented an initial prototype using simple-minded strategies specified by an experienced librarian. We then carry out experiments to evaluate the system and compare its performance with that of experienced librarians. From this, we identify the areas the system is deficient and how it can be improved. The prototype system had been designed to study the reasons why experienced librarians are more superior in their searches than ordinary users, and how expert search systems can be designed to match what experienced librarians can do. This approach of rapid prototyping is necessary because at the start of the project, we do not know which strategy is the most effective one to use. A detailed planning and design approach is thus not feasible. This cycle of incremental development, testing and redevelopment has allowed us to add new features and refine the system gradually. The increasingly popular Z39.50 Informational Retrieval protocol was used to provide a common interface to multiple online catalogs. Search strategies as used by librarians have been incorporated into the knowledge base of the prototype system using JESS. E-Referencer uses a three-tier design architecture consisting of a client, proxy and server. The client handles user interaction, the server (a Z39.50 server) contains the data and search strategies, and a proxy sits between the client and server. This approach is necessary because the subject heading database required by the conceptual knowledge base is huge, around 300 megabyte. It is not feasible to send this large database across the network to every client in order to extract subject headings of usually a few keywords (at most ten). In our design, the proxy houses the database and handles all the 13 processing required on the conceptual knowledge base. Keywords are submitted by clients to the proxy, which will then search the subject heading database and return the relevant subject headings to the clients. The traffic generated in this case is greatly reduced since only the keywords and the resulting subject headings are returned. Another advantage of this approach is that we can log all the activities carried out by the different clients; the log information can be useful for further analysis. 3.2 Design of the E-Referencer The design of the E-Referencer is shown in Figure 1. It consists of the following modules: Client Modules a. The Graphical User Interface (GUI) Module handles the interaction between the user and E- Referencer. b. The Network Interface Module communicates with the E-Referencer proxy. c. The Expression Module provides functions for manipulating a search expression. d. The Control Module is the heart of the expert system. It controls and calls the various functions of the system and has the following components: • A Knowledge Base of search strategies • A Fact Base which stores the intermediate search results and information needed to select the next search strategy. • An Explanation Facility for explaining why and how certain strategies were chosen. e. The Knowledge Module contains wrapper functions for integrating the expert system script of the Control Module with the other modules of E-Referencer. Proxy Modules a. The Proxy Controller Module accepts new connections from clients and activates the appropriate modules to handle the various clients’ requests. b. The Keyword-Subject Association Module provides a list of subject headings that associates with the keywords users specify in their query. The subject headings are used to augment the user’s original query to perform a more accurate search. 14 c. The OCLC Z39.50 Client API provides functionality for connecting to, searching and retrieving information from the various library systems that support the Z39.50 protocol. d. The Z39.50 Interface Module provides a clean interface to the OCLC Z39.50 Client API. It isolates the rest of the system from changes to the OCLC Z39.50 Client API. 3.3 Client Modules Design and Implementation a. Graphical User Interface Module The widespread use of graphical-based operating systems like Windows, OS/2 and X-Windows have greatly increased the demand for programs written with graphical user interfaces. The proper usage of graphical user interface provides a very simple and friendly way for the user to interact with the system. The mouse pointer allows for easy manipulation of the system and the use of graphical items like buttons and scroll windows allows the system to present its information to the user clearly and effectively. Thus, the E-Referencer, which will eventually be used by ordinary users, has to support a graphical user interface. In addition, we hope to make the E-Referencer easily accessible to all online catalog users, and thus a Web-based graphical interface is required in the design of the E-Referencer. We have also designed the user interface to be simple, so that it is easy to use. The interface contains only one keyboard input area for the user to enter the query string, so that users will not need to spent too much time learning how to use the system. Limited information on the search results is displayed; the information includes title, author and publisher information. The records retrieved are also arranged in a list for easy browsing. Since the E-Referencer is also used as an experimental tool to help us refine and test our search strategies, we have also included a server and strategy option in the design. The server option allows us to search different Z39.50 servers, while the strategy option allows us to use different strategies for reformulation purposes. 15 OCLC Z39.50 Client API for Java Control Module - Knowledge Base - Fact Base - Explanation Facility JESS functions Knowledge Module Wrapper Functions Network Interface Module Expression Module Graphical User Interface Module Proxy Controller Module Library B Online Catalog User Keyword-Subject Association Module Library A Online Catalog Subject Heading Database Proxy Client Servers (External) Z39.50 Interface Module Figure 1: Design of E-Referencer A screenshot of the main user interface and the feedback window is shown in Figure 2 and Figure 3. 16 Figure 2: E-Referencer Frame (main GUI) b. Expression Module This module provides functions to manipulate the search expression that a user keys into the E- Referencer. The functions implemented are: • Removing stop-words like a, an, is, the, which does not help in the search session. • Stemming words to remove suffixes to get a common form for retrieving more records. Porter’s algorithm [22] available at the Glasgow IDOM – IR Resources Web site (http://www.dcs.gla.ac.uk/idom/ir_resources/liguistic_utils/) is selected because of its simplicity and fast implementation. • Conversion between AND, OR and Adjacent operators in search string. Example: AND operators -> “expert” AND “systems” OR operators -> “expert” OR “systems” Adjacent Operators -> “expert systems” • Create combinations of two or three keywords Example: Expert Systems Internet Intranet Two keyword combination -> 17 “Expert Systems” OR “Expert Internet” OR “Expert Intranet” OR “Systems Internet” OR “Systems Intranet” OR “Internet Intranet” Three keyword combination -> “Expert Systems Internet” OR “Expert Systems Intranet” OR “Expert Internet Intranet” OR “Systems Internet Intranet” Figure 3: Feedback Window c. Control Module JESS, developed by Sandia Labs, is used to represent the expert knowledge in the E-Referencer. JESS is written entirely in Java and implements a subset of the CLIPS [23] language, which uses production rules to represent knowledge. A production rule consists of a list of facts followed by a list of actions. When all the facts in the list are asserted, the rule is said to have fired and the list of actions is executed. Actions could include asserting more facts, which could fire more rules. In JESS, all production rules are specified in a script. It comes with a set of standard functions and provides a mechanism to wrap functions written in Java as JESS functions. All these functions can then be called from within the 18 JESS script, which allows for greater flexibility in manipulating the system. JESS was chosen because JESS syntax is simple and search strategies can be specified as production rules easily. Furthermore, the ability to integrate Java functions into JESS makes it very favorable for our use since E-Referencer is written in Java. JESS uses an inference engine, based on the Rete (Greek word for net) algorithm [24], to process the production rules. The Control Module creates this JESS inference engine, which processes the production rules contained in the JESS script. The components of the Control Module are: • A Knowledge Base of search strategies. The strategies are represented as production rules in the JESS script EReferencer.clp. Example of a search strategy: If “No. of records retrieved = 0” and “No. of words in query = 1” then assert “Broadening Strategy 6” (prompt user to enter synonyms) • A Fact Base which stores the intermediate search results and information needed to select the next search strategy. The intermediate search results and information are represented as asserted facts in JESS, which could lead to the firing of other production rules, in some case, the firing of a rule which contains another search strategy. Example of facts: “No. of records retrieved = 10” “No. of words in query = 5” • An Explanation Facility, to explain why and how certain strategy is chosen. The explanations are embedded in the production rules. Below is a sample piece of code for a search strategy implemented as a production rule in the JESS script. For a more thorough understanding of the JESS syntax and rules, refer to the JESS README http://herzberg.ca.sandia.gov/jess/README.html. (defrule BROADENING_STRATEGY1 1 ; Expression has adjacency operators. Convert them into ands. ?expr <- (Expr SearchExpr ?str) ?strategy <- (BroadeningStrategy 1) 5 19 => (retract ?strategy) (miscPrintout "Broadening Strategy 1: convert adjacent operators to and") (miscPrintout "Checking if adjacency operators are present.") (if (exprHasAdjWords ?str) 10 then (retract ?expr) (miscPrintout "Adjacency operators found.") (bind ?newstr (exprAdjToAnd ?str)) (assert (Expr SearchExpr ?newstr)) 15 (assert (KeyWordSearch)) else (miscPrintout "No adjacency operators found.") (assert (BroadeningStrategy)) ) 20 ) The sample code is the production rule for representing the strategy “broadening strategy 1”. The lines before the => represent the list of facts, while the lines after => represent the list of actions. The Fact Base is implemented as facts being asserted. The fact (Expr SearchExpr ?str) is always asserted. It is used to store the current search query in the variable ?str. Thus, in this case, the above rule is fired when the fact (BroadeningStrategy 1) is asserted in the Fact Base. Upon firing the rule, the fact (BroadeningStrategy 1) is retracted from the Fact Base to prevent further firing of the same rule. The Explanation Facility is implemented as code embedded in different rules in the script. The code at line 8, 9, 13 and 18 is part of the Explanation Facility and they indicate to the user the strategy used and the actions executed. In this sample code, Broadening Strategy 1 is used, and a check is performed to determine if there are any adjacent words in the query. Line 14 shows how the JESS script calls the JESS function exprAdjToAnd from the Knowledge Module. This function is a Java function that is wrapped as a JESS function, and the method Call of the private class ExprAdjToAnd will be invoked. Other facts are then being asserted into the Fact Base so as to fire other rules (lines 15, 16), which may then select another strategy or perform some other functions. 20 d. Knowledge Module In order to integrate the rest of the modules written in Java (e.g. the Expression, Subject, Z39.50 Interface and GUI Modules) with JESS, we need to create wrappers for them. All these wrapper functions are grouped according to modules. That is, all the wrapper functions for Expression are grouped under Expression Functions, all the wrapper functions for Z39.50 Interface are grouped under Z3950 Functions and etc. A prefix is then added to each wrapper function to denote the module that they belong to. The Knowledge Module thus contains all these groups of wrapper functions. Example of wrapper functions: Expression Functions { exprStem, exprRemoveStopWord, exprAndToOr ...} Z3950 Functions { z39Connect, z39Search, z39Display …} GUI Functions { GUIFeedbackDialog, GUIFrame …} When facts in the Fact Base of the Control Module are asserted, certain rules in the Knowledge Base are fired to select an appropriate strategy. The actions in the strategy make calls to the wrapper functions in the Knowledge Module to perform some required operations. For example, given these rules Rule 1 If “No. of records retrieved = 0” and “No. of words in query > 1” Then Assert fact “Broadening Strategy 1” : : Rule 2 If “Broadening Strategy 1” Then ExprAdjToAnd <- wrapper function call: Convert Adjacent to And : : When a user enters the query “Expert Systems” the fact “No. of words in query = 2” is asserted in the Fact Base. The search is performed but no record is retrieved. The fact “No. of records retrieved = 0” is then asserted. These two facts then fire rule 1 in the Knowledge Base, and it asserts the fact “Broadening Strategy 1”; this in turn fires rule 2. Rule 2 calls the wrapper function ExprAdjToAnd in the Knowledge Module to activate the appropriate Java function in the Expression Module to convert the Adjacent operators in the original query to AND operators. A new query “Expert AND Systems” is eventually created and used to perform another search. 21 Each function group of wrappers is implemented as a public Java class; they extend the JESS Userpackage class. The various wrapper functions are created as private classes and added into the JESS inference engine. Sample code: public class ExprFunctions implements Userpackage { public void Add(Rete engine) { engine.AddUserfunction(new exprRemoveStop()); engine.AddUserfunction(new exprStem()); : : } } The above code shows how the class ExprFunctions implements the group of wrapper functions for the Expression Module. Other groups are implemented similarly. Each wrapper function is implemented as a private class extending from the JESS Userfunction class. The latter contains a private attribute _name to store the name the programmer use for invoking the function in the JESS script. The public method Call is then used to define the operations to be performed when the function is called in the JESS script. Sample code: class exprRemoveStop implements Userfunction { Expression ex = new Expression(); int _name = RU.putAtom( "exprRemoveStop" ); public int name() { return _name; } public Value Call(ValueVector vv, Context context) throws ReteException { String expr = ""; if ((vv.size() == 2) && (vv.get(1).type() == RU.STRING)) { expr = vv.get(1).StringValue(); expr = ex.removeStop(expr); } return new Value(expr, RU.STRING); } } 22 This code shows how the class ExprRemoveStop is used to wrap the Java function expr.RemoveStop of the Expression Module into the JESS function exprRemoveStop. All other Java functions are wrapped in the same way. e. Network Interface Module This module sets up the connection between the client and the proxy. Search requests issued by the client are submitted to the proxy through this module. The search result as returned by the proxy is collected by this module before it is displayed to the user. All networking functions are implemented via the Socket class in Java. 3.4 Proxy Modules Design and Implementation a. Proxy Controller Module As the name implies, this module controls all the activities and transactions that are carried out in the proxy. It accepts connections from E-Referencer client applets, and establishes connection with the selected Z39.50 server on behalf of the clients. The Z39.50 Interface module is invoked for the connection. If a client applet requests for subject headings that are associated with the keywords it submits, the Keyword-Subject Association Module will be invoked by the controller to retrieve the relevant subject headings. If logging is required at a later stage, it can be implemented in this module since this module handles all transactions. The Proxy Control Module is implemented using Java Threads. When the proxy starts up, a controller thread is created to listen to client requests. Each time a new client’s connection request is received, the controller thread would instantiate two new threads to handle all future requests from that client. One thread will handle all activities between the client and the proxy while the other thread will handle all activities between the proxy and the Z39.50 server the client is connecting to. With two separate threads, there is continuous communication since the blockage of one communication channel will not affect the other. When the connection to the client dies, the two threads will also be killed and reclaimed. 23 b. OCLC Z39.50 Client API The E-Referencer is developed as a system capable of searching the many different online library catalogs available. Thus, a standard System Interface module to all these online catalogs is needed. By implementing the System Interface Module based on the Z39.50 protocol, users will be able to access all existing and newly created Z39.50 compliant online catalogs. A search on the Web for existing resources that can be used in our development effort revealed the following class libraries and toolkits: • OCLC Z3950 Client API. The API is written entirely in Java and implements the latest version of the Z39.50 protocol, version 3 released in 1995. The API is provided free and comes with source code. • Index Data, Yet Another Z39.50 Toolkit, YAZ is a toolkit for implementing the Z39.50v3 protocol. The toolkit supports both the Z39.50 and the ISO10163 SR protocol. Both the Origin (client) and Target (server) roles of the protocol are supported. The toolkit is written in C. YAZ is also provided free. • ZedKit for Unix. The Z39.50 Application Development Libraries is developed for the German Library Project DBV OSI II and also the ONE project co-funded by the European Commission Libraries Programme, and is written is C/C++. We tested the OCLC Z39.50 Java API by using the sample client application, zclient, that comes with the API, to connect to a few Z39.50 compliant online catalogs. Since there are many different Z39.50 server implementers like INNOPAC, DRA and etc., it is necessary to test the OCLC API, by connecting it to a representative few of the various servers implemented by the different vendors. The API was tested with the DRA server at the NTU Library, the INNOPAC server at National University of Singapore (NUS) and the Ameritech HORIZON server at Clarke College, Dubuque, Iowa. Although some fine-tuning was needed due to differences in implementation by the different vendors, we managed to connect to the various servers, specify the database to search, send some sample search queries and retrieved the records from all the servers. 24 From the tests, we found the OCLC Z39.50 Client Java API most suitable for our use. The API is written entirely in Java, which makes it easy to integrate into our design framework, and it supports many functions of the latest version of the Z39.50 protocol. c. The Z39.50 Interface Module Having decided to adopt the OCLC Z39.50 Client API for development, we then designed the Z39.50 Interface Module; this module serves as a “wrapper” module on top of the OCLC Z39.50 Client API. The module provides simple Z39.50 functions like connect, search, retrieve, close etc. for use by the E- Referencer; these functions are implemented using the OCLC Z39.50 Client API. There are two advantages for adopting this design. Firstly, the Z39.50 Interface Module can isolate the rest of the program from the OCLC Z39.50 Client API code. In the event of changes to the OCLC Z39.50 Client API or even if there is a need to change to a different API, we will just need to modify the codes in the Z39.50 Interface Module, while keeping the rest of the program code intact. Secondly, by grouping the primitive OCLC API procedures into higher level procedures like search, retrieve etc., we have simplified the coding of the other modules that make use of Z39.50 functions. A new Z39client class is created to provide Z39.50 functions like connect, search, retrieve etc. for the proxy. Each function is implemented as a method and makes calls to the OCLC Z30.50 Client API to perform its operation. From our implementation, we found that the zclient application implemented most of the Z39.50 functions supported by the API. We modified the zclient source code to create our own z39client class. Most of the existing methods in zclient were modified and reused, but some of the functions as provided by zclient were too simple and we had to create new methods to suit our needs. For example, the display method in zclient does not guarantee that the requested number of records are retrieved and displayed. Therefore, we created a retrieve method in our z39client class, which uses a loop to continuously call the display method to retrieve records, and guarantees that the specified number of records are retrieved. 25 d. Keyword-Subject Association Module The conceptual knowledge base that maps free-text keywords to concepts represented as LC (Library of Congress) subject headings is a very useful tool for users to clarify their search topic. This conceptual knowledge base is implemented as the Keyword-Subject Association Module in the proxy. The module accesses a subject heading database that contains all the keyword-subject heading mappings for all the keywords found in the LC bibliographic catalogue from 1980-1998. As mentioned earlier, this large database is one of the main reasons that necessitated the three-tier design architecture. The major task in implementing this module is in the creation of good data structures for storing the keyword-subject heading mappings to allow for efficient retrieval. Since the database is located at a centralized proxy, disk space is not a constraint. An inverted file is thus used to store the keyword- subject heading mappings, to allow for fast and efficient retrieval. A Subject Heading Map is also created to map each subject heading to a unique number. This is used in the keyword-subject heading inverted file for representing the subject headings using numbers. The use of this greatly reduces the size of the inverted files, because the numbers require much less storage than the subject heading strings. This Subject Heading Map is implemented as a sorted text file so that we can use a binary search algorithm to map subject headings to their unique numbers and vice-versa quickly. 5. Conclusion The current online catalog search systems do not provide enough assistance for users in their searches. There have been various attempts to develop new systems that utilize different approached to help users in this area. Such systems can be broadly categorized as ‘best match’ systems or knowledge- based systems. Both systems have their merits and problems. However, we felt that the knowledge- based approach is more suitable for use in this domain of online catalog searching, and have thus have adopted it to solve the above-mentioned problem. 26 A Web-based search interface known as E-Referencer has been developed to provide an accessible and useful search tool for online catalog users. Presently, the E-Referencer is also used as a tool for experimenting with search strategies to create a system that is capable of helping users search effectively. Early test results are encouraging and refinements are still being done on the E-Referencer. We hope that with more testing and iterations of refinements, it can eventually be deployed for widespread use as an effective searching tool for online catalog users. References 1. Z39.50 Maintenance Agency. URL http://lcweb.loc.gov/z3950/agency/. 2. JESS, the Java Expert System Shell. URL http://herzberg.ca.sandia.gov/jess/. 3. Cousins, S. A.: In their own words: An examination of catalogue users’ subject queries. J. Amer. Soc. Inf. Sci. 46 (1992) 329-341. 4. Dalrymple, P.W.: Retrieval by Reformulation in Two Library Catalogs: Toward a Cognitive Model of Searching Behavior. J. Amer. Soc. Inf. Sci. 41 (1990) 272-281. 5. Ensor, P.: User Practices in Keyword and Boolean Searching on an Online Public Access Catalog. Inf. Tech. Libr. 11 (1992) 210-219. 6. Lancaster, F.W., Connell, T.H., Bishop, N., McCowan, S.: Identifying Barriers to Effective Subject Access in Library Catalogs. Libr. Reso. Tech. Serv. 35 (1991) 377-391. 7. Markey, K.: Subject Searching in Library Catalogs: Before and After the Introduction of Online Catalogs. OCLC Online Computer Library Center, Dublin, OH. (1984). 8. Hildreth, C.: Beyond Boolean: Designing the Next Generation of Online Catalogs. Libr. Trends 35 (1987) 647-667. 9. Borgman, C.L.: Why are Online Catalogs Still Hard to Use? J. Amer. Soc. Inf. Sci. 47 (1996) 493- 503. 10. Khoo, C., Poo, C.C.D.: An Expert System Front-End as a Solution to the Problems of Online Catalogue Searching. In: Information Services in the 90s: Congress Papers. Library Association of Singapore, Singapore (1991) 6-13. 11. Al-Hawamdeh, S., Ellis, D., Mohan, K.C., Wade, S.J., and Willet, P.: Best match of document retrieval: development and use of INSTRUCT. Proceedings of the Twelfth International Online Information Meeting, (1998) 761-767. 12. Robertson, S.E.: Overview of the Okapi Projects. J. of Doc. Vol. 53, no.1 (1997) 3-7. 13. Guach. S: Search improvement via automatic query reformulation. ACM Transactions of Information Systems, 9 (1991) 14. Vickery, A., Brooks, H.M.: PLEXUS – The expert system for referral. Information Processing & Management, 23 (1987) 99-117. 15. Tong, R.M., Applebaum, L.A., Askmann, V.N., Cunningham, J.F.: Conceptual information retrieval using RUBRIC. In C.T.Yu and C.J.Van Rijsbergen(Eds.), Proceedings of the tenth Annual International ACM SIGIR Conference on Research and Development in Formation Retrieval (1987) 247-253. 16. Drabenstott, K.M.: Enhancing a New Design for Subject Access to Online Catalogs. Libr. Hi Tech, 14 (1996) 87-109. 17. Drabenstott, K.M., Weller, M.S.: Failure Analysis of Subject Searches in a Test of a New Design for Subject Access to Online Catalogs. J. Amer. Soc. Inf. Sci. 47 (1996) 519-537. 18. Drabenstott, K.M., Weller, M.S.: The Exact-Display Approach for Online Catalog Subject Searching. Inf. Proc. Manag. 32 (1996) 719-745. 19. Z39.50: An Overview of Development and the Future. URL http://www.cqs.washington.edu/~camel/z/z.html. 20. Robertson, A.M., Willet, P., Vickery, A., Thompson, W.: Comparison of Statistically-based and knowledge-based approaches to information retrieval. Inf. 90 (1990) 282-286. 21. Khoo, C.S.G., Poo, D.C.C., Liew, S.-K., Hong, G., Toh, T.-K.: Development of Search Strategies for E-Referencer, an Expert System Web Interface to Online Catalogs. In: Toms, E., Campbell, 27 D.G., Dunn, J. (eds.): Information Science at the Dawn of the Millennium: Proceedings of the 26th Annual Conference of the Canadian Association for Information Science. CAIS, Toronto (1998). 22. Porter, M.F.: An Algorithm for Suffix Stripping. Program 14 (1980) 130-137. 23. CLIPS: A Tool for Building Expert Systems. URL http://www.ghg.net/clips/CLIPS.html (1997). 24. Forgy, C.L.: Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem. Artificial Intelligence 19 (1982), 17-37. 28 Search by NTU Search System Search by E-Referencer Search by Librarian Query No. No. Displayed No. Relevant Precision No. Displayed No. Relevant Precision No. Displayed No. Relevant Precision A96-7 0 0 0.00 4 0.5 0.13 17 6 0.35 A96-14 0 0 0.00 20 1 0.05 3 2 0.67 A97-16 4 1 0.25 4 1 0.25 20 705 0.38 B97-1 0 0 0.00 20 3 0.15 13 7 0.54 B97-3 0 0 0.00 2 0 0.00 4 3.5 0.88 D96-2 11 10.5 0.95 11 10.5 0.95 20 14 0.70 D96-16 0 0 0.00 20 9 0.45 20 9 0.45 D97-1 0 0 0.00 3 1.5 0.50 20 5 0.25 D97-2 0 0 0.00 5 3.5 0.70 6 6 1.00 D97-4 0 0 0.00 8 2 0.25 15 5 0.33 D97-7 0 0 0.00 19 0 0.00 14 2.5 0.18 N97-13 0 0 0.00 4 3.5 0.88 5 4 0.80 Average 1.25 1.0 0.10 10 3.0 0.36 13.1 6.0 0.54 Table 1: Comparison of Search Results Note: 1. The figures given for No. Relevant and Precision are the average for 2 sets of relevance judgments by 2 persons. 2. The evaluation is based on the first 20 records retrieved. E-Referencer currently displays only 20 records for relevance judgment. 29 OCLC Z39.50 Client API for Java Control Module - Knowledge Base - Fact Base - Explanation Facility JESS functions Knowledge Module Wrapper Functions Network Interface Module Expression Module Graphical User Interface Module Proxy Controller Module Library B Online Catalog User Keyword-Subject Association Module Library A Online Catalog Subject Heading Database Proxy Client Servers (External) Z39.50 Interface Module Figure 1: Design of E-Referencer 30 Figure 2: E-Referencer Frame (main GUI) 31 Figure 3: Feedback Window