Domain-dependant information gathering agent Domain-dependent Information Gathering Agent Aleksander Pivk*, Matjaz Gams Jozef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia Abstract A universal agent should be capable of gathering information from arbitrary heterogeneous sites and offer intelligent information services on its own based on information so gathered. We present a domain-dependent agent for information gathering. It can visit an arbitrary domain-related site by observing a user perform the first query. By understanding key concepts of the first query, the agent performs subsequent queries autonomously. When a user asks the agent about a particular item, the agent gathers relevant information from various sites. The major advantage of the agent is a semi- automatic creation of a wrapper around a particular site with few human interventions. We have implemented two versions of such information-gathering agents: ShinA (SHoppINg Assistant) for e-trading tasks, and EMA (EMployment Agent), which performs employment and job related functions over the Internet. Keywords: Intelligent agent; Comparison shopping; Information extraction; Inductive learning; Wrapper 1. Introduction Internet services offer several advantages over the classical ones, and are already changing the way companies, customers, and users perform their activities. But the rapid growth of the Internet has resulted in an explosion of available information and online services. To make things worse, the Internet is poorly structured, and disorganized (Lewis, 1999). There are millions of heterogeneous Internet sites offering various services, which makes the Internet unintelligible for any single user conducting a non- trivial task. Therefore, serviceability of the Internet depends to a large degree on the amount of automated uniform user-friendly services, systems, and tools (Levy, 2000; Pazienza, 1997). The negative side of the Internet success is evident when one wants to gather relevant information. One of the first solutions to cope with information overload were search engines. Typically, a user provides a couple of keywords, and a search engine finds links to relevant sites. This simple approach has several drawbacks. A search engine proposes only simple links to hopefully relevant sites. A user must still follow the links manually, and further browse for the desired information (Balabanovic & Shoham, 1995). Some of the links point to vast databases with specialized interfaces. This forces the user to get acquainted with new interfaces, and slows down the user even more. In addition, this approach favors largest information providers due to name-branding and time optimization. As a result, this largely reduces the basic advantage of the Internet. To further complicate matters, a user is unaware of changes in already visited sites unless he or she revisits the sites frequently. A rather simple solution is to provide alert services when a relevant site changes. E.g., several e- commerce sites already allow shoppers to order price alerts that notify a shopper when the price of a product changes, or falls below a specified amount. But some of the services require lengthy surveys to be filled out before they can be used, while at the same time they may provide little or no personalization. It is also undesirable that such services weaken user privacy (Jakobsson, 1998). Another approach involves voluntary ratings and reviews of sites by users, resulting in personalization and recommendations. Again, this takes a lot of user time and introduces problems with privacy. In addition, recommendation systems effectively reduce the size of the marketplace and introduce bias, as it is difficult to obtain a sufficient number of ratings for every existing vendor, and to control the reliability of the sources (Pivk, 2001; Basu, Hirsh & Cohen, 1998; Schafer, Konstan & Riedl, 1999). In this paper we present solutions of the problem of information gathering. We propose a domain- dependent information-gathering agent that extends the existing approaches by providing automatic uniformity, user autonomy, and privacy. We implemented two systems: a semi-uniform intelligent agent for employment tasks (Gams, 2001) and for e- commerce (Pivk & Gams, 2000). Our principal goal was to improve accessibility and expand the benefits of existing information-gathering approaches. The paper is organized as follows: In Section 2 we describe agents, in Section 3 wrappers, in Section 4 the algorithm, in Section 5 the implementation for employment tasks, and in Section 6 for e-trading. * Corresponding author. Tel.: +386-1-447-3380; fax.: +386-1-425-1038. E-mail address: aleksander.pivk@ijs.si Related work is described in Section 7. We summarize our ideas in Conclusion. 2. Agents In information gathering, one of the most attractive new approaches are software agents. They help automate a variety of activities, most of which are time consuming. Software agents differ from “traditional” software in the sense that they are personalized, social, continuously running and semi- autonomous (Gutmann, Moukas & Maes, 1998). In this way, the Internet is becoming more user-friendly, semi-intelligent and human-like. There are many definitions of what the term “agent” denotes, based on different approaches, expectations and visions (Bradshaw, 1997). Shoham (1997) describes a software agent as a software entity which functions continuously and autonomously in a particular environment often inhabited by other agents and processes. The requirement for continuity and autonomy derives from human desire that an agent be able to perform activities in a flexible and intelligent manner, responsive to changes in the environment, and without constant human supervision. An agent that functions over a long period of time should be able to adapt from its experience (Liebowitz, 1999). There are two types of agents, closely related to our research: Information/Internet agents and shopping agents. Information/Internet agents perform the role of managing, manipulating, or collating information from many distributed sources (Maes, 1994; Maes, Guttman & Moukas, 1999). The motivation for developing information agents is at least twofold. Firstly, there is an increasing need for tools that manage the information explosion. Secondly, there are also vast financial opportunities to be gained. Shopping agents find products (the following terms: product/object/item/entity represent synonyms in this paper) under the best terms among different e- commerce sites (Dastani, Jacobs, Jonker, Catholyn & Truer, 1999; Gutmann, Moukas & Maes, 1998). A shopping agent queries multiple sites on behalf of a shopper to gather pricing and other information on products and services. However, basic comparison-shopping agents (Greenwald & Kephart, 1999) still lack several desired properties. They introduce a marketplace that is biased in favor of those e-commerce sites that collaborate with the shopping agents. In addition to a limited number of e-commerce sites to choose from, those participating sites often do not offer the best prices. Typically, e-commerce sites that are included into agent’s repertoire follow specific protocols or include hand-coded wrappers to transform data into agent-readable forms (Hammer, 1997; Muslea, Minton & Knoblock, 1998). An ideal agent would extract contents of heterogeneous Internet databases online, and present the data to a user in a uniform way (Cowie & Lehnert, 1996). While this task is feasible by humans, current computer systems are not intelligent enough to perform it successfully as the type, amount, and organization of the information provided by databases differ from site to site (Baek, Liebowitz, Prasad & Granger, 1999). For an autonomous agent, the desired properties are uniformity, user autonomy and privacy. Uniformity refers to the ability of an agent to automatically enter various sites and adapt to the type, amount, and organization of the information provided by various companies. Since this property demands human-level intelligence, the task is beyond current computer systems. In reality, we accept limited uniformity, i.e. semi-uniformity. Autonomy refers to the idea that an agent provides the best possible service by remaining as independent as possible from both customers and vendors. Autonomy from vendors implies that the service is to remain unbiased by performing wide searches (as opposed to only searching the databases of a few “preferred” vendors). This can be achieved by progress in making interfaces more uniform, and by improved methods for interpreting potential hits. Autonomy from the customer means that users can be relieved from the tedious task of searching for information and of needing to adjust to different sites. Additionally, our research addresses the privacy of the shopper by concealing the identity and behavior of the user. However, we note that the privacy provided is conditional, and should be selectively revoked if abuse is suspected. These aspects of our proposed agent provide for an unbiased marketplace where the user benefits in many respects, and safely stays hidden behind the agent who performs anonymous searches on user’s behalf. 3. Wrappers There is a classical solution for dealing with heterogeneous information sources (in our case domain-specific databases), where each source has a unique way of providing information to the user. Namely, a program may be able to integrate information from the sources if it has at its disposal a communicator-translator, generally referred to as wrapper (Yang, Seo & Choi, 2000), which produces a translation rule for each site. Formally, a wrapper is a program or a rule that understands information provided by a specific source and translates it into a regular form that can be further reused. A wrapper is an essential component of a mediator system (Fig. 2), which accepts queries from users, translates each one into the appropriate query for the individual source, fetches the relevant pages from that source, extracts the requested information from the retrieved pages, and returns the result to the user. Essentially, wrappers make the Web sources look like databases that can be queried through the mediator’s query language (Kusmerick, Weld & Doorenbas, 1997; Cowie & Lehnert, 1996). Wrappers can be either hand-coded or created automatically by software agents (Muslea, 1999). Hand-coding wrappers is a tedious and time- consuming task. In addition, a comparison agent with these manually written wrappers is not scalable because new stores are not automatically integrated. Automatic construction of wrappers by agents for online stores is a challenging issue, mainly because HTML documents on the Web are not agent-friendly (Liu, Pu & Han, 2000). Another difficulty for automatic wrapper-generation comes from the heterogeneity of scores, in the sense that different stores employ different mechanisms for manipulating customer queries, and different styles for displaying product information. Fortunately, domain-dependent sites such as online stores are mostly semi-structured (Atzeni, Mecca & Merialdo, 1997), and we can extract at least some meaningful information without the help of semantic-based modules. 4. Information-gathering algorithm We present our algorithm for an automatic information-gathering domain-dependent agent. Task: Gather information from heterogeneous Internet databases with as little human intervention as possible. Domain-dependent constraint: Heterogeneous Internet databases have to contain common concepts like job definition in employment tasks, or product description in e-commerce. Before presenting the algorithm, we describe the basic data structures used. 4.1. Data structures A record represents a product/object/item/entity, e.g., a job offer or a product offer. Here is an example of a product offer from Amazon, in this case a book: Special Edition Using Java Server Pages and Servlets, by Mark Wutka, Our Price: $27.99, You Save: $12.00 (30%), Used Price: $15.06, Availability: Usually ships within 24 hours, Paperback - 754 pages 1st edition (January 15, 2000), Que; ISBN: 0789724413; Dimensions: 1.70 x 9.09 x 7.36, Amazon.com Sales Rank: 69,177. A reply usually consists of records of items in HTML. The items are stored as lists of records of text, or records in XML format. A form corresponds to an HTML form. There are two variations of this data structure: empty – corresponding to the HTML source, and filled – with data filled in. An example of a form is presented in Fig. 1. The relevant part of the HTML source is:
Need help?

Number of items to be displayed:
Fig. 1. ShinA’s main interface. The first paragraph of the HTML code introduces the form name (shinaForm), the method (post), and the action (shinaServlet) that is performed upon query submission, a couple of parameters, and a link to Help. The second paragraph presents an input field for a text query, which gets validated by the validateQuery method. The third paragraph describes the ShinA’s submit button. When pressed, the entered text – query, is sent to the shinaServlet program. The last paragraph lets a user choose the number of items per output page. A query is a string of text submitted to a server through an HTML form. An example of a query searching for "Java Servlets" is: http://ai.ijs.si/shinaServlet?searchStr=Java+Servlets (3/1) The query consists of the address of the server (http://ai.ijs.si), the name of the program (shinaServlet), the parameter name (searchStr), and the entered value. Our system also attaches statistics to each query in the form of the success rate (3 attempts / 1 failure). HTML commands are parts of a form. The agent uses them as patterns. An example of an HTML command as a form field is: A query-reply consists of a query and a corresponding reply. With it the system records questions sent by the user together with reply generated. Fig. 2. An information integration system / mediator accesses databases through wrappers. Extraction rules enable parsing of the output of a particular database in order to get a uniform output by the mediator system. Extraction rules represent the syntactic structure of database output (Ashish & Knoblock, 1997). Our implementations are rather straightforward, e.g. a rule searches for a text description inside HTML documents, separated by specified delimiters. Examples of extraction rules are: text := string of characters number := string of digits 0 .. 9 … price := html.body.td[i].number WHERE html.body.td[i].b.text = “EUR” … jobDescription := { text1 jobDefinition text2 delimiter} productDescription := { ID name price text } The first two lines are examples of basic syntactic structures like text and number. The middle part is an example of an extraction rule for price. It consists of a number, which is followed by a currency type, in this case EUR. The last two lines represent top-level extraction rules, describing major concepts like jobDescription or productDescription. A jobDescription consists of some text before an actual jobDefinition and some text afterwards, where a delimiter separates each jobDescription. Schematically, the system communicates with heterogeneous databases using extraction rules for wrapping (Fig. 2). 4.2. Algorithm The algorithm consists of two major phases: 1) the initial query to an unknown database The agent observes a user enter and query a new Internet database (ObserveUser – names of procedures are in italics). Then the agent records and parses communication between the human and the database, and extracts sufficient content (HTML RuleExtraction, concept matching). If necessary, the agent consults the user to get the desired level of understanding of the communication. This enables nearly error-free performance of the system. 2) subsequent queries (MappAndMatch, Retrieve) The agent performs a new query to the Internet database by simulating what a user did. If the query is successful, the agent updates the statistics for the query and presents information through a uniform interface to the user. If the query is not successful, the system updates statistics, and discovers whether the query has become invalid. If the query is found to be invalid, the system demands new rules to be extracted for a particular site. The user is notified accordingly. When connecting to a new database for the first time, the ObserveUser procedure is applied: procedure ObserveUser (query, commands, reply); {A user starts communicating with a database through the agent interface. The agent remembers user commands and HTML commands without active participation. The user is in principle unaware of agent’s presence.} begin start observing a communication between a user and a database interface; repeat remember query and commands; until session is finished; if session finishes with error then forget the whole session; else RuleExtraction(query, commands, extractRules, reply); end; procedure RuleExtraction(query, commands, extractRules, reply) {The agent stores three basic types of information: - database description - name, URL address, title, country and other available information. - query and extraction rules - to perform another query in future; the query is either stored for the first time or modified accordingly - a list of HTML commands - the agent parses HTML commands, especially those that perform some action with the input/output fields. Other commands are most often skipped.} begin store query into appropriate data structure; find a delimiter in the reply; repeat {parse commands and create extract rules} find the next HTML command in the reply; if this command fits a rule then add the rule to a list of extractRules; else ignore the command; until all commands are parsed; set pointers to establish query-reply structures; {end parsing commands, start parsing the reply in order to extract concepts} repeat find the first concept/pattern/item (e.g., a job, a product ... ); add the concept/pattern into the corresponding list of entities {mapping}; until document is parsed and transformed into a list of entities for the mediator system; end; The procedure for creating extraction rules establishes connections between the parsed HTML document and the domain-dependent concepts. In this way, background knowledge in the form of ontologies is used to capture the meaning of essential parts in the HTML document. For example, in the domain of employment tasks, ontologies can be used to identify professions. If the RuleExtraction procedure fails at any stage, it calls for a supervisor intervention. On demand, it shows all intermediate results, e.g. a list of extracted rules, which the supervisor can modify at will. Therefore, if the system decomposes the process well, no human help is needed. In any case, at the end of the session the system assumes that the parsing and mapping procedures are flawless. This does not mean that no mistake can occur in the future, because a page can change. The "Retrieve" procedure connects to databases of previous queries, where queries are modified to reflect user’s question. The "Retrieve" module connects to the database in two ways: • if exact query was previously successfully performed, the agent repeats it; • otherwise the system tries to create a query, based on a similar query. The best way to do this is to “understand” by ontologies, which queries qualify as promising candidates. For example, if a user asks for a specific type of shoes, all shoe-related shops and all shoe-related previous queries seem to be a good basis for a new query. The new query is constructed by changing previous entities with the specific type of shoes. procedure Retrieve(query, replies); {get info from a query database and write it into replies, consisting of a list of entities with corresponding parameters and their values} begin find all queries with identical input fields to the one typed by a new user; if no queries found then CreateQuery (queries); for all queries do extract database address from a query and connect to the Internet address and obtain lastReply; if an error is obtained then if query is invalid then begin RuleExtraction(query, null, extractRules, null); notify the user; if RuleExtraction not successful then notify the supervisor; end; else notify the supervisor; else begin MappAndMatch (lastReply, replies); update statistics; end; show extracted replies to the user through the mediator system; end; procedure CreateQuery (queries) {create a new list of queries based on previous similar ones} begin apply an ontology to create a list of new queries; for all previous queries that correspond to similar items do begin replace items with new ones; set statistics for new queries; end; end; There are different types of ontologies, some of them hard-coded into the system and others consisting of simple or complex data structures. For example, if a query corresponds to a form with just one input field for a database search, and the user enters just one input field in the mediator form, then this is a perfect match. Another ontology is based on clustering of databases based on the entities they provide (i.e., kinds of products). Other ontologies consist of a list of entities that can be entered into a specific input field. For example, practically all job definitions are defined in a national catalog. Ontologies play a substantial role in construction of modifications from a list of queries and commands. For example, a full "looking-for-jobs" query can be described in a form with a flexible frame consisting of many subfields. The query contains information about job profession and additional specifications of area, income, type of work etc. The routine checks that all common major items, such as profession, are filled in and that modified commands are (hopefully) related to the same job specification. procedure MappAndMatch (lastReply, replies) {based on extracted rules, parse the lastReply and add each extracted entity with corresponding values into replies, i.e. into the mediator system} begin repeat match each extraction rule to the lastReply; when the first rule triggers, delete the matched top of the lastReply; if the extracted transformed text corresponds to a part of the entity then add it into replies; until end-of-lastReply; end; Advanced successful performance of the system is based not only on ontologies but also on memory- based learning. The ontologies are more or less static and sometimes even hand-coded. They represent basic knowledge about a specific problem domain. Some ontologies are dynamically created or instantiated, e.g. those describing shops, which makes it possible to search similar shops. Namely, each shop typically presents its product catalogue, basic orientation, and structure. While ontologies are domain-specific, the memory- based learning is a plain uniform implementation of the basic technique published elsewhere (Kitano, 1993). The algorithm was only slightly modified according to the problem domain characteristics. The algorithm performs memory-based learning by storing queries and statistics. The system learns each time a pattern is successful or not by changing its statistical success rate. In this way, the system adapts to the desires of the users and to changes in the domain-related sites on the Internet. Our algorithm is described in (Gams 2001). Additional learning in ShinA is described in Section 6. The system was implemented in two problem domains: employment and e-trading. While the two implementations vary a lot in terms of programming languages and technical details, the algorithm is practically the same. Both implementations offer information about desired objects, which differ in properties, benefits and costs. Users browse through a large number of potential candidates, and compare several similar possibilities to finally choose specific objects. 5. Employment information gathering EMA (EMployment Agent) is an intelligent employment agent for providing employment information in a way similar to human employment agents (Gams, 2001). An Internet-based intelligent employment agent should perform like a human agent, provide basic information about available jobs and available workers, and try to match these two databases as efficiently as possible (Fig. 3). In employment tasks, users typically want to get a job - any job, a good job, better than a current job. In this way, a typical task of the agent is to find relevant information about a specific job offer (analogy – a specific e-commerce item) from various information sources. EMA is an intelligent agent that provides user- friendly information on demand, or when it notices relevant information on the Internet. EMA can store search patterns, perform search on its own or following direct commands, and it can send online replies or offline emails regarding vacant jobs or available workers. EMA can store and observe interesting Web sites chosen by users, and match available jobs and workers. Fig. 3. The basic task of the EMA Internet-based intelligent employment agent is to provide employment information as human agents do. EMA has been implemented a couple of years ago with several modules added subsequently. Ema also incorporated some language and translation capabilities, such as Slovene text-to-speech interface. It also included ontologies for profession definitions, and the ability to connect to arbitrary employment sites on its own. Natural language processing. A common input to EMA is a partially constrained Slovenian domain-dependent text, often consisting of phrases and form inputs. In bulletin boards, the language is a matter of choice, but titles and information are either in Slovene or in English. Consequently, majority of text is either Slovene or English, with a couple of exceptions. No censorship is performed regarding language or specific details as long as the input is dedicated to the desired employment task and inside minimal decency requirements. The system is capable of translating Slovene text into English. The translation is based on a dictionary consisting of up to four words observed previously in the employment data. In the worst case, new combinations are translated word-by-word and stored for further overview by humans. Stored combinations are sorted by frequency and translated by humans if reasonable. In addition, the translation system looks into the morphology dictionary to capture different forms of the same words. Slovene has a rich morphology so this is essential for good performance. Finally, a spell-checker module corrects spelling errors. The translation is currently not yet at the level performed by systems translating between larger European languages; however, it is sufficiently good to enable basic understanding of translated text, since the employment syntax is quite limited. The Internet site of the EMA system is presented in Fig. 4 (http://www.ess.gov.si). Fig. 4. Home site of the intelligent employment agent. Ontologies. EMA performs some of its advanced functions based on ontologies. In this way, EMA understands the basic employment tasks. One of the most common employment tasks (besides basic queries) is to find a relation between different job descriptions and jobs definitions. This is in fact one of the basic concept of all employment tasks. E.g., from a basic search query consisting of applicant’s properties and desires, one should extract relevant job offers. Conversely, from a job definition one should look for available workers. With machine and statistical text- learning methods (Freitag, 1998; Mladenic, 1999) we (Bezek & Gams, 2001) have designed ontologies, i.e. meta-knowledge about job definitions from interesting words. Semi-uniformity. During the first years of the EMA system, several competing Internet-based employment sites were designed. Since EMA was a national employment site sponsored by Employment Services of Slovenia (ESS), an idea emerged to represent also the information from other employment sites. Instead of handwriting wrappers around each particular employment site, we have designed an algorithm that semi-automatically (Ashish & Knoblock, 1997; Kusmerick, Weld & Doorenbas, 1997) attaches to other employment sites and then automatically gathers information from those particular sites. Although the algorithm published in (Gams, 2001) was not fully implemented, we have designed two slightly modified modules of this kind and implemented them in EMA. Another version was independently designed by a student group (JobProvider) on the basis of public lectures and presentations. All versions were functional for at least some time, and on average for several years, thus showing that the idea is sound. On the basis of these experiences we have decided to generalize the algorithm, and design another system, the ShinA system, for e-commerce tasks (Pivk & Gams, 2000). EMA is a big software system especially for our R&D group of intelligent systems at the Jozef Stefan Institute. It consists of several modules and submodules, written in different programming languages and runs on several platforms and operating systems. The system is a 30 thousand lines program written mainly in C, and also in other languages and with Internet programming tools. Together with text and data it occupies 30 MB of space. The EMA agent was, and still is, among the most successful applications of intelligent agents in Slovenia and in Central Europe. In the first year of its implementation, our country was the third in Europe to offer national employment information through the Internet. At that time, we were the first country in the world to provide over 90% of all nationally available jobs on the Internet. On the other hand, in absolute terms there are employment systems in big countries or employment systems connected to major Internet information providers such as Yahoo or AltaVista that provide orders of magnitude bigger amounts of employment information. EMA was among the most often-visited non- entertainment sites in Slovenia. At its peak, there were over 200.000 visits monthly, which represents one tenth of country’s population. Most visitors were driven by unemployment, or by the desire to find a better job. 6. E-trading information gathering Another implementation of our system is ShinA (SHoppINg Assistant)–a mediator system that automatically collects product description from a number of online stores on user’s behalf. ShinA performs a different task than EMA – that of a shopping assistant. ShinA was reprogrammed from scratch, but still based on experiences and the common algorithms of EMA. ShinA’s crucial ability to semi-automatically gather information from various sites replaces hand-coded wrappers of EMA. (Yang, Lee & Choi, 2000). INPUT by User Fig. 5. Workflow diagram of the ShinA system. 6.1. Workflow of the agent The mediator system enables a user to enter two types of data. The first type is a new e-store URL address and the second type is a request for multiple- store product information (Fig. 5). In the first case, where a user enters a new e-store URL address, the system observes the user’s communication with the e-commerce site. Since e- commerce sites have various input mechanisms, the system must locate the position of a query form that accepts user’s search-for-item requests, and must be able to intelligently parse at least the most relevant parts of it. The user’s communication with the e- commerce site finishes either by entering a search query into an input field of the form or by selecting an ontology (i.e., product catalogue link). This suffices to generate a query template. In this way, ShinA generates an appropriate query template for each online store. Afterwards, the mediator system applies a learning algorithm, described in detail in Section 6.3. In the learning phase, appropriate extraction rules are created that correspond to a most representative pattern (i.e. description of an item). In this phase, ontologies are also recognized and extracted. Evaluation and testing of extracted rules is performed by extracting product information from a randomly retrieved e-store’s product page. In case when the extraction of products’ description is unsuccessful, new pattern and extraction rules have to be discovered. The system must be able to ignore redundant and unnecessary fragments of a page. Furthermore, it has to delineate a product description and recognize the attributes of a product such as the price, the manufacturer, the availability, the special offer, the size, etc. Next, the e-store is classified into one or more corresponding clusters (used for later search of items), using learned ontologies and basic information about the e-store. Finally, basic information, query template, rules and ontologies are stored into a knowledge database. Type of input ObserveUser() User finished RuleExtraction() (wrapper generation) OntologyExtraction() Test extracted rules (wrapper) StoreClustering() SAVE: - E-store basic information - E-store query template - Extracted rules (wrapper) - Ontologies Retrieve() for all e-stores in a cluster (except those already stored) CreateQuery() Merge both list of queries (created & stored) MappAndMatch() SAVE: - Newly generated item-query - Item information OUTPUT (multiple-store item-related information results) query for item search new e-store’s URL address no yes wrong rules (wrapper) rules OK (wrapper) Fig. 6. An overall architecture of ShinA. In the second case, where a user enters a keyword search for an item, the system first checks whether an equal or similar search has ever been requested. If so, each stored query (linked with user’s item search) is modified accordingly and added to the list of potential queries. Otherwise, the system takes the most promising cluster of e-stores and creates queries based on query templates. The created queries are then added to the list of potential queries. Next, each potential query is performed on the corresponding e- store, which responds with a page, hopefully containing information about sought after items with their descriptions. The system tries to extract product information from each retrieved page with appropriate extraction rules. The successful queries and related product information are then stored into Knowledge database. Finally, product information (from various sites and various formats) is presented to the user in a user- friendly uniform way. 6.2. System architecture The common data structures, the algorithm and ontologies have already been described in Section 4. Here we describe ShinA in this framework as a particular implementation of the algorithm. Fig. 6 shows the architecture of our ShinA comparison- shopping agent. The task of the observing module ObserveUser() is to transform requested pages into a form that enables the system to control and supervise user’s communication with e-store sites. The wrapper generator RuleExtraction() is the main learning module that constructs a wrapper for each particular store. The wrapper generator learns two things: it recognizes a store’s query scheme and ontologies, and it learns how to extract a store’s content. The wrapper interpreter Retrieve() is a module that executes generated wrappers to get product information. This module forms several actual queries by combining each store’s query template with the keywords that the user actually typed in, and sends them to the corresponding shopping sites. The search results from the stores are then collected and fed to the uniform output generator module MappAndMatch(). The output generator integrates all the search results and generates a uniform output. Information such as extracted rules, query templates, ontologies, item-related information, e-store-related information, etc., is stored into the Knowledge database. 6.3. Learning Algorithm The main learning task specific to ShinA is to build a rule from one or several resulting pages. Such a rule is used to learn the format of product description from successful searches at a new shopping site. We use an inductive learning mechanism to accomplish this task. Here, the examples correspond to the pages of a search result, and the concept to be learned corresponds to the extraction rule (Yang, Seo & Choi, 2001). Each page contains one or more product descriptions that matched the query. A product description is composed of a sequence of product attributes. For example, a music store displays search results in which the attributes are the CD title, the artist name, the price, etc. The wrapper-learning module has to find the starting and ending position of the list of product attributes within the result page, and recognize the pattern of a product description. To do this, our method is divided into three phases. In the first phase, the HTML source of the page is partitioned into three parts. The first and the last part are redundant and irrelevant fragments of the page (header, advertisement, script) that must be ignored. The useful middle part must be extracted, and consists of a set of attributes that describe the product. If there is more than just one product description on the resulting page, more different patterns of product representations may occur. In the next phase, the algorithm recognizes product attributes by examining HTML tags (delimiters) and categorizes them, accordingly. The product description is thus viewed as a sequence of categories, and the algorithm finds repeating patterns in it. The most frequent pattern becomes the representative product description. 6.4. Implementation ShinA’s Web interface is shown in Fig. 1. Unlike EMA, ShinA is a dedicated system without modules for translation, speech, and large amount of locally stored data. However, the core task of parsing and understanding heterogeneous sites is more difficult in e-commerce than in employment because of higher diversity of e-stores. The ShinA system has been implemented on a Windows 2000 operating system, running Apache HTTP Server, version 1.3. As developing tools, JDK 1.3.1 and Jakarta Tomcat 3.2 were used. The system is written in Java and is based on servlet/JSP. It consists of approximately 10 thousand lines of code. Tomcat is a servlet container that is used in the official Reference Implementation for the Java Server and JavaServer Pages technologies. Both Apache Server and Tomcat are developed in an open and participatory environment and therefore freely available. 7. Related work BargainFinder (Krulwich, 1995) and Jango (Jango) are first-stage comparison shoppers. They specify functions that agents must have in order to be applied to Electronic Commerce. Both systems employ the manual rule extraction method. ShopBot (Doorenbos, Etzioni & Weld, 1997), like our system, suggests an automatic rule extraction technique by analyzing and learning the shopping sites. In order to integrate specific product information, Shopbot first removes irrelevant information such as advertisements by using an inductive learning mechanism and then extracts necessary product information. Shopbot, however, uses several strong biases about the structure of HTML and the display format of product search results to learn. Therefore, Shopbot is unable to learn a shopping site that does not conform to these strong biases. By contrast, the only bias in our method is that the result of a product search should be displayed in a semi-structured way, that is, each product description unit has the same output format, which conforms to almost all shopping sites. We expect our method to function more robustly than Shopbot. The thesis can be verified by empirically testing the success rate of correct wrapper constructions. We plan to measure the performance of our agents in the near future. PersonaLogic is a comparison-shopping system that compares specific products themselves rather than shopping sites. Kasbah, AuctionBot, and Tete-a-Tete (see References) are negotiation mediators with which users can buy and sell products based on negotiation strategies between agents in the virtual marketplace. They do not enter e-commerce sites in an automatic or semi-automatic way like our agents do. To our knowledge, today’s commercial comparison shoppers, including MySimon, PriceWatch, and BottomDollar all employ manual rule extraction methods, and consequently suffer from reduced scalability, and the ability to incorporate new e-stores and adapt to changes in incorporated stores. 8. Conclusion We have proposed a general domain-dependent information-gathering agent. The system has been implemented in several versions of the two basic applications: EMA for employment, and ShinA for e- commerce. The agent successfully constructs correct wrappers for “reasonable” domain-dependent databases without assuming many structural constraints. The agent contains a quick, simple and robust inductive learning algorithm that automatically generates wrappers. The extension of the learning method is memory-based learning of success rates of particular queries. This makes it possible to adapt to user habits and to changes at visited sites. There are some limitations in our current system. Firstly, we have assumed that users give a proper URL and path for the test query. Secondly, each major concept must contain a uniform structure. We think that this is not a severe restriction since most domain-related databases stores that produce semi- structured entity information, contain the same attributes. Thirdly, we only extract the major information from an entity description that may also contain several other attributes. Lastly, the system relies heavily on HTML. If an Internet site provides information exclusively by embedding it in graphics or using Java, the system will be unable to handle the site. Overall, the idea of semi-automatic information- gathering from heterogeneous Internet-based databases is becoming technologically mature. The problems we have been facing during implementations were not of principal matter. Technically, similar systems seem to be close to full- scale implementation. One can imagine a uniform e- shopping agent visiting and gathering information from all e-shopping databases on the Internet. We believe that this technique can also be applied to other information integration systems for heterogeneous information sources. Acknowledgement: Financial support was provided by an international project INCO-Copernicus 960154, Cooperative Research in Information Infrastructure, CRII, by the Ministry of education, science and sports in Slovenia, and by ESS. We would like to thank the CEO of the Employment Service of Slovenia, Mr. J. Glazer. References Ashish, N., & Knoblock, C. (1997). Semi-Automatic Wrapper Generation for Internet Information Sources. In Proceedings of Second IFCIS Conference on Cooperative Information Systems (CoopIS). Atzeni, P., Mecca, G., & Merialdo, P. (1997). Semi- structured and Structured Data in the Web: Going Back ad Forth. ACM SIGMOD Workshop on Management of Semistructured Data, pp.1-9. AuctionBot. http://auction.eecs.umich.edu/ Balabanovic, M., & Shoham, Y. (1995). Learning Information Retrieval Agents: Experiments with Automated Web Browsing. In Proc. of the AAAI Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments. Baek, S., Liebowitz, J., Prasad, S.Y., & Granger, M.J. (1999). Intelligent Agents for Knowledge Management - Toward Intelligent Web-Based Collaboration within Virtual Teams. In Knowledge Management Handbook, (ed. Jay Liebowitz), CRC Press, (11-1 - 11-23). Washington, DC. Basu, C., Hirsh, H., & Cohen, W. (1998). Recommendation as classification: Using social and content-based information in recommendation. In Proceedings of the 1998 Workshop on Recommender Systems, AIII Press, pp.11-15. Bezek, A., & Gams, M. (2001). An agent that understands job description. Informatica (Ljubljana), 25(1), pp.99- 105. BottomDollar. http://www.bottomdollar.com/ Bradshaw, J. M. (1997). Software Agents, AAAI Press, Menlo Park, California. Cowie, J., & Lehnert, W. (1996). Information Extraction. Comm. of the ACM, 39(1), pp.80-101. Dastani, M., Jacobs, N., Jonker, Catholyn, M., & Truer, J. (1999). Modelling User Preferences and Mediating Agents in Electronic Commerce. In Proc. of the Agent- Mediated Electronic Commerce (AMEC) SIG-Meeting of AGENTLINK. Doorenbos, R., Etzioni, O., & Weld, D. (1997). A Scalable Comparison-Shopping Agent for the World Wide Web. In Proceedings of the First International Conference on Autonomous Agents, pp.39-48. Freitag, D. (1998). Information Extraction from HTML: Application of a General Machine Learning Approach. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98). Gams, M. (2001). A Uniform Internet-Communicative Agent, Electronic Commerce Research, 1, Kluwer Academic Publishers, pp.69-84. Greenwald, A.R., & Kephart, J.O. (1999). Shopbots and pricebots. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp.506-511. Gutmann, R. H., Moukas, A. G., & Maes, P. (1998). Agents as Mediators in Electronic Commerce, Electronic Markets, 8(1), pp.22-27. Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., & Vassalos, V. (1997). Template-based wrappers in the TSIMMIS system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.532-535. Jakobsson, M., & Yung, M. (1998). On assurance structures for WWW commerce. In Financial Cryptography ’98. Jango. http://www.jango.com/ Kasbah. http://kasbah.media.mit.edu/ Kitano, H. (1993). Challenges of Massive Parallelism, In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp.813-834. Krulwich, B. (1995). Bargain Finder agent prototype. Technical Report, Anderson Consulting. http://bf.cstar.ac.com/bf/ Kusmerick, N., Weld, D.S., & Doorenbas, R. (1997). Wrapper Induction for Information Extraction. In International Joint Conference on Artificial Intelligence, pp.729-735. Levy, A.Y., & Weld, D.S. (2000). Intelligent Internet Systems. Artificial Intelligence, 118, pp.1-14. Liebowitz, J., (ed.). (1999). Knowledge Management Handbook, CRC Press, Washington, DC. Liu, L., Pu, C., & Han, W. (2000). XWRAP: An XML- enabled Wrapper Construction System for Web Information Sources. In Proceedings of the Sixteenth International Conference on Data Engineering, pp. 611-621. Lewis, T. (1999). Microsoft Rising : ... and Other Tales of Silicon Valley, ACM. Maes, P. (1994). Agents that reduce work and information overload. Comm. of the ACM, 37(7), pp.31-40. Maes, P., Guttman, R., & Moukas, A. (1999). Agents that buy and sell. Comm. of the ACM, 42(3), pp.81-91. Mladenic, D. (1999). Text-learning and related intelligent agents. IEEE EXPERT, Special Issue on Applications of Intelligent Information Retrieval. Muslea, I., Minton, S., & Knoblock, C. (1998). Wrapper Induction of Semistructured, Web-based Information http://www.bottomdollar.com/ http://www.jango.com/ Sources. In Proceedings of the Conference on Automatic Learning and Discovery. Muslea, I., Minton, S., & Knoblock, C. (1999). A Hierarchical Approach to Wrapper Induction. In Proceedings of the Third International Conference on Autonomous Agents, pp.190-197. MySimon. http://www.mysimon.com/ Pazienza, M.T. (ed.), (1997). Information Extraction, Springer. PersonaLogic. http://www.personalogic.com/ Pivk, A. (2001). Item-based Collaborative Filtering Algorithms, In Proc. of the Fourth International Conference on Information Society, pp. 65-68. Pivk, A., & Gams, M. (2000). E-Commerce Intelligent Agents, In Proceedings of ICTEC’00, pp.418-429. PriceWatch. http://www.pricewatch.com/ Schafer, J., Konstan, J., & Riedl, J. (1999). Recommender systems in e-commerce. In Proceedings of ACM Conference on E-Commerce, pp.158-166. Shoham, Y. (1997). An Overview of Agent-oriented Programming, In Software Agents (ed. Bradshaw, J. M.), AAAI Press, Menlo Park, California, pp.271-290. Tete-a-Tete. http://ecommerce.media.mit.edu/tete-a-tete/ Yang, J., Seo, H., & Choi, J. (2001). MORPHEUS: A Customized Comparison Shopping Agent. In Proceedings of the 5th International Conference on Autonomous Agents (Agents-2001), pp. 63-64. Yang, J., Lee, E., & Choi, J. (2000). A Shopping Agent that Automatically Constructs Wrappers for Semi- Structured Online Vendors. Lecture Notes in Computer Science, Vol. 1983, pp.368-373. http:/// http://www.personalogic.com/ http://ecommerce.media.mit.edu/tete-a-tete/ Introduction Agents Wrappers Information-gathering algorithm Data structures Algorithm Employment information gathering E-trading information gathering Workflow of the agent System architecture Learning Algorithm Implementation Related work Conclusion