A SEMANTIC MODEL OF SELECTIvE DISSEMINATION OF INFORMATION | MORALES-DEL-CASTILLO ET AL. 21 A Semantic Model of Selective Dissemination of Information for Digital Libraries J. M. Morales-del-Castillo, R. Pedraza-Jiménez, A. A. Ruíz, E. Peis, and E. Herrera-Viedma In this paper we present the theoretical and methodo- logical foundations for the development of a multi-agent Selective Dissemination of Information (SDI) service model that applies Semantic Web technologies for spe- cialized digital libraries. These technologies make pos- sible achieving more efficient information management, improving agent–user communication processes, and facilitating accurate access to relevant resources. Other tools used are fuzzy linguistic modelling techniques (which make possible easing the interaction between users and system) and natural language processing (NLP) techniques for semiautomatic thesaurus genera- tion. Also, RSS feeds are used as “current awareness bul- letins” to generate personalized bibliographic alerts. N owadays, one of the main challenges faced by information systems at libraries or on the Web is to efficiently manage the large number of docu- ments they hold. Information systems make it easier to give users access to relevant resources that satisfy their information needs, but a problem emerges when the user has a high degree of specialization and requires very specific resources, as in the case of researchers.1 In “tra- ditional” physical libraries, several procedures have been proposed to try to mitigate this issue, including the selec- tive dissemination of information (SDI) service model that make it possible to offer users potentially interesting documents by accessing users’ personal profiles kept by the library. Nevertheless, the progressive incorporation of new information and communication technologies (ICTs) to information services, the widespread use of the Internet, and the diversification of resources that can be accessed through the Web has led libraries through a process of reinvention and transformation to become “digital” libraries.2 This reengineering process requires a deep revision of work techniques and methods so librarians can adapt to the new work environment and improve the services provided. In this paper we present a recommendation and SDI model, implemented as a service of a specialized digital library (in this case, specialized in library and informa- tion science), that can increase the accuracy of accessing information and the satisfaction of users’ information needs on the Web. This model is built on a multi-agent framework, similar to the one proposed by Herrera-Viedma, Peis, and Morales-del-Castillo,3 that applies Semantic Web technologies within the specific domain of special- ized digital libraries in order to achieve more efficient information management (by semantically enriching dif- ferent elements of the system) and improved agent–agent and user–agent communication processes. Furthermore, the model uses fuzzy linguistic model- ling techniques to facilitate the user–system interaction and to allow a higher grade of automation in certain procedures. To increase improved automation, some natural language processing (NLP) techniques are used to create a system thesaurus and other auxiliary tools for the definition of formal representations of information resources. In the next section, “Instrumental basis,” we briefly analyze SDI services and several techniques involved in the Semantic Web project, and we describe the prelimi- nary methodological and instrumental bases that we used for developing the model, such as fuzzy linguistic model- ling techniques and tools for NLP. In “Semantic SDI serv- ice model for digital libraries,” the bulk of this work, the application model that we propose is presented. Finally, to sum up, some conclusive data are highlighted. n Instrumental basis Filtering techniques for SDI services Filtering and recommendation services are based on the application of different process-management techniques that are oriented toward providing the user exactly the information that meets his or her needs or can be of his or her interest. In textual domains, these services are usu- ally developed using multi-agent systems, whose main aims are n to evaluate and filter resources normally repre- sented in XML or HTML format; and n to assist people in the process of searching for and retrieving resources.4 J. M. Morales-del-Castillo (josemdc@ugr.es) is Assistant Professor of Information Science, Library and Information Science Department, University of granada, Spain. R. Pedraza- Jiménez (rafael.pedraza@upf.edu) is Assistant Professor of Information Science, Journalism and Audiovisual Communication Department, Pompeu Fabra University, Barcelona, Spain. A. A. Ruíz (aangel@ugr.es) is Full Professor of Information Science, Library and Information Science Department, University of granada. E. Peis (epeis@ugr.es) is Full Professor of Information Science, Library and Information Science Department, University of granada. E. Herrera-viedma (viedma@decsai.ugr.es) is Senior Lecturer in Computer Science, Computer Science and Artificial Intelligence Department, University of granada. 22 INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2009 Traditionally, these systems are classified as either content-based recommendation systems or collaborative recommendation systems.5 Content-based recommen- dation systems filter information and generate recom- mendations by comparing a set of keywords defined by the user with the terms used to represent the content of documents, ignoring any information given by other users. By contrast, collaborative filtering systems use the information provided by several users to recommend documents to a given user, ignoring the representation of a document’s content. It is common to group users into different categories or stereotypes that are characterized by a series of rules and preferences, defined by default, that represent the information needs and common behav- ioural habits of a group of related users. The current trend is to develop hybrids that make the most of content-based and collaborative recommendation systems. In the field of libraries, these services usually adopt the form of SDI services that, depending on the profile of subscribed users, periodically (or when required by the user) generate a series of information alerts that describe the resources in the library that fit a user’s interests.6 SDI services have been studied in different research areas, such as the multi-agent systems development domain,7 and, of course, the digital libraries domain.8 Presently, many SDI services are implemented on Web platforms based on a multi-agent architecture where there is a set of intermediate agents that compare users’ profiles with the documents, and there are input-output agents that deal with subscriptions to the service and display generated alerts to users.9 Usually, the information is struc- tured according to a certain data model, and users’ profiles are defined using a series of keywords that are compared to descriptors or the full text of the documents. Despite their usefulness, these services have some deficiencies: n The communication processes between agents, and between agents and users, are hindered by the dif- ferent ways in which information is represented. n This heterogeneity in the representation of infor- mation makes it impossible to reuse such informa- tion in other processes or applications. A possible solution to these deficiencies consists of enriching the information representation using a common vocabulary and data model that are understandable by humans as well as by software agents. The Semantic Web project takes this idea and provides the means to develop a universal platform for the exchange of information.10 Semantic web technologies The Semantic Web project tries to extend the model of the present Web by using a series of standard languages that enable enriching the description of Web resources and make them semantically accessible.11 To do that, the project basis itself on two fundamental ideas: (1) resources should be tagged semantically so that informa- tion can be understood both by humans and comput- ers, and (2) intelligent agents should be developed that are capable of operating at a semantic level with those resources and that infer new knowledge from them (shift- ing from the search of keywords in a text to the retrieval of concepts).12 The semantic backbone of the project is the Resource Description Framework (RDF) vocabulary, which pro- vides a data model to represent, exchange, link, add, and reuse structured metadata of distributed information sources, thereby making them directly understandable by software agents.13 RDF structures the information into individual assertions (e.g., “resource,” “property,” and “property value triples”) and uniquely character- izes resources by means of Uniform Resource Identifiers (URIs), allowing agents to make inferences about them using Web ontologies or other, simpler semantic struc- tures, such as conceptual schemes or thesauri.14 Even though the adoption of the Semantic Web and its application to systems like digital libraries is not free from trouble (because of the nature of the technologies involved in the project and because of the project’s ambi- tious objectives,15 among other reasons), the way these technologies represent the information is a significant improvement over the quality of the resources retrieved by search engines, and it also allows the preservation of platform independence, thus favouring the exchange and reuse of contents.16 As we can see, the Semantic Web works with infor- mation written in natural language that is structured in a way that can be interpreted by machines. For this reason, it is usually difficult to deal with problems that require operating with linguistic information that has a certain degree of uncertainty (e.g., when quantifying the user’s satisfaction in relation to a product or service). A possible solution could be the use of fuzzy linguistic modelling techniques as a tool for improving system–user commu- nication. Fuzzy linguistic modelling Fuzzy linguistic modelling supplies a set of approxi- mate techniques appropriate for dealing with qualitative aspects of problems.17 The ordinal linguistic approach is defined according to a finite set of tags (S) completely ordered and with odd cardinality (seven or nine tags): { }{ }T,=Hi,s=S i …∈ 0, The central term has a value of approximately 0.5, and the rest of the terms are arranged symmetrically around A SEMANTIC MODEL OF SELECTIvE DISSEMINATION OF INFORMATION | MORALES-DEL-CASTILLO ET AL. 23 it. The semantics of each linguistic term is given by the ordered structure of the set of terms, considering that each linguistic term of the pair (si, sT-i) is equally informative. Each label si is assigned a fuzzy value defined in the inter- val [0,1] that is described by a linear trapezoidal property function represented by the 4-tupla (ai, bi, αi, βi). (The two first parameters show the interval where the property value is 1.0; the third and fourth parameters show the left and right limits of the distribution.) Additionally, we need to define the following properties: 1.–The set is ordered: si ≥ sj if i ≥ j. 2.–There is the negation operator: Neg(si ) = sj, with j = T - i. 3.–Maximization operator: MAX(si, sj) = si if si ≥ sj. 4.–Minimization operator: MIN(si, sj) = si if si ≤ sj. It also is necessary to define aggregation operators, such as Linguistic Weighted Averaging (LWA),18 capable of and operating with and combining linguistic information. Focusing on facilitating the interaction between users and system, the other starting objective is to achieve the development and implementation of the model proposed in the most automated way possible. To do this, we use a basic auxiliary tool—a thesaurus—that, among other tasks, assists users in the creation of their profile and ena- bles automating the alerts generation. That is why it is critical to define the way in which we create this tool, and in this work we propose a specific method for the semiautomatic development of thesauri using NLP techniques. NLP techniques and other automating tools NLP consists of a series of linguistic techniques, statistic approaches, and machine learning algorithms (mainly clustering techniques) that can be used, for example, to summarize texts in an automatic way, to develop automatic translators, and to create voice recognition software. Another possible application of NLP would be the semiautomatic construction of thesauri using different techniques. One of them consists of determining the lexical relations between the terms of a text (mainly syn- onymy, hyponymy, and hyperonymy),19 and extracting terms that are more representative for the text’s specific domain.20 It is possible to elicit these relations by using linguistic tools, like Princeton’s WordNet (http://wordnet .princeton.edu) and clustering techniques. WordNet is a powerful multilanguage lexical data- base where each one of its entries is defined, among other elements, by their synonyms (synsets), hyponyms, and hyperonyms.21 As a consequence, once given the most important terms of a domain, WordNet can be used to create from them a thesaurus (after leaving out all terms that have not been identified as belonging or related to the domain of interest).22 This tool can also be used with clustering tech- niques—for example, to group documents of a collection in a set of nodes or clusters, depending on their similarity. Each of these clusters is described by the most representa- tive terms of their documents. These terms make up the most specific level of a thesaurus and are used to search in WordNet for their synonyms and most general terms, contributing (with the repetition of this procedure) to the bottom-up-development process of the thesaurus.23 Although there are many others, these are some of the most well-known techniques of semiautomatic thesau- rus generation (semiautomatic because, needless to say, the supervision of experts is necessary to determine the validity of the final result). For specialized digital libraries, we propose develop- ing, on a multi-agent platform and using all these tools, SDI services capable of generating alerts and recommendations for users according to their personal profiles. In particular, the model presented here is the result of several previous models merging, and its service is based on the definition of “current-awareness bulletins,” where users can find a basic description of the resources recently acquired by the library or those that might be of interest to them.24 n The Semantic SDI service model for digital libraries The SDI service includes two agents (an interface agent and a task agent) distributed in a four-level hierarchi- cal architecture: user level, interface level, task level and resource level. Its main components are a repository of full-text doc- uments (which make up the stock of the digital library) and a series of elements described using different RDF- based vocabularies: one or several RSS feeds that play a role similar to that of current-awareness bulletins in traditional libraries; a repository of recommendation log files that store the recommendations made by users about the resources, and a thesaurus that lists and hierarchi- cally relates the most relevant terms of the specialization domain of the library.25 Also, the semantics of each ele- ment (that is, its characteristics and the relations the ele- ment establishes with other elements in the system) are defined in a Web ontology developed in Web Ontology Language (OWL).26 Next, we describe these main elements as well as the different functional modules that the system uses to carry out its activity. Elements of the model There are four basic elements that make up the system: 24 INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2009 the thesaurus, user profiles, RSS feeds, and recommenda- tion log files. Thesaurus An essential element of this SDI service is the thesau- rus, an extensible tool used in traditional libraries that enables organizing the most relevant concepts in a specific domain, defining the semantic relations estab- lished between them, such as equivalence, hierarchical, and associative relations. The functions defined for the thesaurus in our system include helping in the indexing of RSS feeds items and in the generation of information alerts and recommendations. To create the thesaurus, we followed the method suggested by Pedraza-Jiménez, Valverde-Albacete, and Navia-Vázquez.27 The learning technique used for the creation of a the- saurus includes four phases: preprocessing of documents, parameterizing the selected terms, conceptualizing their lexical stems, and generating a lattice or graph that shows the relation between the identified concepts. Essentially, the aim of the preprocessing phase is to prepare the documents’ parameterization by removing elements regarded as superfluous. We have developed this phase in three stages: eliminating tags (stripping), standardizing, and stemming. In the first stage, all the tags (HTML, XML, etc.) that can appear in the collection of documents are eliminated. The second stage is the standardization of the words in the documents in order to facilitate and improve the parameterization process. At this stage, the acronyms and N-grams (bigrams and trigrams) that appear in the documents are identified using lists that were created for that purpose. Once we have detected the acronyms and N-grams, the rest of the text is standardized. Dates and numeri- cal quantities are standardized, being substituted with a script that identifies them. All the terms (except acro- nyms) are changed to small letters, and punctuation marks are removed. Finally, a list of function words is used to eliminate from the texts articles, determiners, auxiliary verbs, conjunctions, prepositions, pronouns, interjections, contractions, and grade adverbs. All the terms are stemmed to facilitate the search of the final terms and to improve their calculation during parameterization. To carry out this task, we have used Morphy, the stemming algorithm used by WordNet. This algorithm implements a group of functions that check whether a term is an exception that does not need to be stemmed and then convert words that are not exceptions to their basic lexical form. Those terms that appear in the documents but are not identified by Morphy are elimi- nated from our experiment. The parameterization phase has a minimum complex- ity. Once identified, the final terms (roots or bases) are quantified by being assigned a weight. Such weight is obtained by the application of the scheme term frequency- inverse document frequency (tf-idf), a statistic measure that makes possible the quantification of the importance of a term or N-gram in a document depending on its fre- quency of appearance and in the collection the document belongs to. Finally, once the documents have been parameter- ized, the associated meanings of each term (lemma) are extracted by searching for them in WordNet (specifically, we use WordNet 2.1 for UNIX-like systems). Thus we get the group of synsets associated with each word. The group of hyperonyms and hyponyms also are extracted from the vocabulary of the analyzed collection of documents. The generation of our thesaurus—that is, the identifi- cation of descriptors that better represent the content of documents, and the identification of the underlying rela- tions between them—is achieved using formal concept analysis techniques. This categorization technique uses the theory of lat- tices and ordered sets to find abstraction relations from the groups it generates. Furthermore, this technique ena- bles clustering the documents depending on the terms (and synonyms) it contains. Also, a lattice graph is gener- ated according to the underlying relations between the terms of the collection, taking into account the hypero- nyms and hyponyms extracted. In that graph, each node represents a descriptor (namely, a group of synonym terms) and clusters the set of documents that contain it, linking them to those with which it has any relation (of hyponymy or hyperonymy). Once the thesaurus is obtained by identifying its terms and the underlying relations between them, it is automatically represented using the Simple Knowledge Organization System (SKOS) vocabulary (see figure 1).28 user profiles User profiles can be defined as structured representations that contain personal data, interests, and preferences of users with which agents can operate to customize the SDI service. In the model proposed here, these profiles are basically defined with Friend of a Friend (FOAF), a specific RDF/XML for describing people (which favours the profile interoperability, since this is a widespread vocabulary supported by an OWL ontology) and another nonstandard vocabulary of our own to define fields not included in FOAF (see figure 2).29 Profiles are generated the moment the user is regis- tered in the system, and they are structured in two parts: a public profile that includes data related to the user’s identity and affiliation, and a private profile that includes the user’s interests and preferences about the topic of the alerts he or she wishes to receive. To define their preferences, users must specify key- words and concepts that best define their information A SEMANTIC MODEL OF SELECTIvE DISSEMINATION OF INFORMATION | MORALES-DEL-CASTILLO ET AL. 25 needs. Later, the system compares those concepts with the terms in the thesaurus using as a similarity measure the edit tree algorithm.30 This function matches character strings, then returns the term introduced (if there’s an exact match) or the lexically most similar term (if not). Consequently, if the suggested term satisfies user expectations, it will be added to the user’s profile together with its synonyms (if any). In those cases where the suggested term is not satisfactory, the system must have any tool or application that enables users to browse the thesaurus and select terms that bet- ter describe their needs. An exam- ple of this type of applications is ThManager (http://thmanager .sourceforge.net), a project of the Universidad de Zaragoza, Spain, that enables editing, visualiz- ing, and going through structures defined in SKOS. Each of the terms selected by the user to define his or her areas of interest has an associated lin- guistic frequency value (tagged as ) that we call “satisfaction frequency.” It represents the regular- ity with which a particular prefer- ence value has been used in alerts positively evaluated by the user. This frequency measures the relative importance of the preferences stated by the user and allows the interface agent to generate a ranking list of results. The range of possible values for these frequencies is defined by a group of seven labels that we get from the fuzzy linguistic variable “Frequency,” whose expression domain is defined by the linguis- tic term set S = {always, almost_ always, often, occasionally, rarely, almost_never, never}, being the default value and “occasionally” being the central value. RSS feeds Thanks to the popularization of blogs, there has been wide- spread use of several vocabular- ies specifically designed for the syndication of contents (that is, for making accessible to other Internet users the content of a website by means of hyperlink lists called “feeds”). To create our current-awareness bulletin we use RSS 1.0, a vocabulary that enables managing hyperlinks lists in an easy and flexible way. It utilizes the RDF/XML syntax and data model and is easily extensible because of the use of Proceedings Figure 1. Sample entry of a SKOS Core thesaurus Diego Allione Sr. af9fa7601df46e95566 Library management 0.83 Figure 2. User profile sample 26 INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2009 modules that enable extending the vocabulary without modi- fying its core each time new describing elements are added. In this model several modules are used: the Dublin Core (DC) module to define the basic bib- liographic information of the items utilizing the elements established by the Dublin Core Metadata Initiative (http:// dublincore.org); the syndica- tion module to facilitate soft- ware agents synchronizing and updating RSS feeds; and the taxonomy module to assign topics to feeds items. The structure of the feeds comprises two areas: one where the channel itself is described by a series of basic metadata like a title, a brief description of the content, and the updating frequency; and another where the descriptions of the items that make up the feed (see figure 3) are defined (including elements such as title, author, sum- mary, hyperlink to the primary resource, date of creation, and subjects). Recommendation log file Each document in the repository has an associated recommendation log file in RDF that includes the listing of evaluations assigned to that resource by different users since the resource was added to the system. Each of the entries of the recom- mendation log files consists of a recommendation value, a URI that identifies the user that has done the recommendation, and the date of the record (see figure 4). The expression domain of the rec- ommendations is defined by the following set of five fuzzy linguistic labels that are extracted from the linguistic variable “Quality of the resource”: Q = {Very_low, Low, Medium, High, Very_high}. These elements represent the raw materials for the SDI service that enable it to develop its activity through four processes or functional modules: the pro- files updating process, RSS feeds generation process, alert generation process, and collaborative recommen- dation process. System processes Profiles updating process Since the SDI service’s functions are based on generating passive searches to RSS feeds from the preferences stored 14/03/2007 High Figure 4. Recommendation log file sample Escudero Sánchez, Manuel Fernández Cáceres, José Luis Broadcasting and the Internet http://eprints.rclis.org/…/AudioVideo_good.pdf This paper is about… 2002 REDOC, 8 (4), 2008 Virual communities Figure 3. RSS feed item sample in a user’s profile, updating the profiles becomes a critical task. User profiles are meant to store long-term prefer- ences, but the system must be able to detect any subtle change in these preferences over time to offer accurate recommendations. In our model, user profiles are updated using a simple mechanism that enables finding users’ implicit preferences by applying fuzzy linguistic techniques and taking into account the feedback users provide. Users are asked about their satisfaction degree (ej) in relation to the informa- tion alert generated by the system (i.e., whether the items A SEMANTIC MODEL OF SELECTIvE DISSEMINATION OF INFORMATION | MORALES-DEL-CASTILLO ET AL. 27 retrieved are interesting or not). This satisfaction degree is obtained from the linguistic variable “Satisfaction,” whose expression domain is the set of five linguistic labels: S’ = {Total, Very_high, High, Medium, Low, Very_low, Null}. This mechanism updates the satisfaction frequency associated with each user preference according to the satisfaction degree ej. It requires the use of a matching function similar to those used to model threshold weights in weighted search queries.31 The function proposed here rewards the frequencies associated with the preference val- ues present when resources assessed are satisfactory, and it penalizes them when this assessment is negative. Let ej { }T,=Hba,|Ss,s ba 0,...∈∈ S’ be the degree of satisfaction, and f j i l { }T,=Hba,|Ss,s ba 0,...∈∈ S the frequency of property i (in this case i = “Preference”) with value l, then we define the updating function g as S’x S→S: { } { } ( ) {=f,eg s