On the Use of the Dempster Shafer Model in Information Indexing and Retrieval Applications Shirnon Schocken' Stern School of Business, New York University Robert A . flurnrneP Courant Institute of Mathematical sciences, New York University Workina PaDer Series STERN IS-92-27 Department of Information Systems, Management Education Center, Stern School of Business, New York University, 44 W. 4th Street, New York, NY 10003. Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, , New York, NY 10012. Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 On the Use of the Dempster Shafer Model in Information Indexing and Retrieval Applications The Dempster Shafer theory of evidence concerns the elicitation and manipula- tion of degrees of belief rendered by multiple sources of evidence to a common set of propositions. Information indexing and retrieval applications use a variety of quantitative means - both probabilistic and quasi-probabilistic - to repre- sent and manipulate relevance numbers and index vectors. Recently, several proposals were made to use the Dempster Shafer model as a relevance calculus in such applications. The paper provides a critical review of these proposals, pointing at several theoretical caveats and suggesting ways to resolve them. The methodology is based on expounding a canonical indexing model whose relevance measures and combination mechanisms are shown to be isomorphic to S hafer 's belief functions and to Dempster's rule, respectively. Hence, the paper has two objectives: (i) to describe and resolve some caveats in the way the Dempster Shafer theory is applied to information indexing and retrieval, and (ii) to provide an intuitive interpretation of the Dempster Shafer theory, as it unfolds in the simple context of a canonical indexing model. Keywords: Theory of evidence, Dempster Shafer model, relevance measures, information indexing and retrieval. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 1 Introduction Consider a finite and exhaustive set of mutually-exclusive propositions and a body of evi- dence t h a t supports some subsets of propositions and discounts others. Many theories were put forward t o describe how one should represent and update one's degrees of belief in such propositions when new or additional evidence is brought to bear. The classical approach is t o cast degrees of belief as probabilities - a set of numbers between 0 and 1 that obeys the axioms of subjective probability - and use Bayesian inference rules to revise them in light of new evidence. One problem with this approach is that it doesn't offer a clear way t o model the various degrees of 'uncommitted beliefs,' or 'second order uncertainties,' that characterize most realistic inference problems. Fbr example, consider the extreme case of 'insufficient reason,' in which one knows absolutely nothing about a given set of n propo- sitions. T h e common solution, which goes back t o LaPlace, is to assign a degree of belief of l / n t o each of the propositions under consideration. Incidently, this is also the solution t h a t emerges from maximizing the unconstrained entropy function associated with the n unknown probabilities. Over the years, many students of belief revision theories have objected to this crude quan- tification of insufficient reason. Why, the argument goes, should ignorance be translated to the strong statement that every proposition (or state of nature) is equally likely? This criticism has led to several alternative models that attempt t o capture the elusive notion of uncommitted belief by modifying the axiomatic framework of probability theory. Perhaps the best known model in this category is the 'theory of evidence,' originated by Demp- ster's (1967, 1967a) , work on upper and lower probabilities. Dempster's ideas, which were Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 based on a frequentist view of inference, were refined and extended by Shafer (1976), who also gave them a subjective interpretation. This led to the Dempster Shafer model - an elaborate formalism for representing and revising degrees of support rendered by multiple sources of evidence to a common set of propositions1. When the work of Dempster and Shafer was 'discovered' by the artificial intelligence com- munity, it immediately stirred a considerable interest in two application areas in which normative models of belief formation play a key role: expert systems, and information indexing and retrieval systems. For expert systems, the Dempster Shafer (DS) model pro- vides a mathematically-sound model for representing and manipulating rule-based degrees of belief, an area that was traditionally dominated by ad-hoc belief revision calculi whose relationship t o probability theory was a t best murky. For information indexing and re- trieval systems, the DS model can be used as a relevance calculus, designed t o quantify and revise the degrees of relevance between documents, keywords, and user-supplied queries. This line of thought has led t o the development of several DS-based information indexing and retrieval applications. For example, Biswas, Bezdek, Marques, and Subramanian (1957) built a document retrieval system in which the relevance of documents t o taxonomical classes was measured and manipulated, respectively, by belief functions and Dempster's rule: "We choose to define similarity functions based on the Dempster Shafer theory of evidence ... one of the advantages of this approach is that it rejects the process of belief revision and updating just as in human reasoning processes." (Biswas et al, 1987). Coming from a different direction, Turtle and Croft (1991) describe a canonical representation in which relevance is handled through inference networks that are structured as directed l I n this paper, the terms the Dempsier Shafer theory of evidence and the Dempster Shafer model are used interchangeably. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 acyclic graphs. The nodes in the networks correspond to keywords, documents, and queries, and "the arcs joining the nodes are interpreted as assertions that the parent node provides support for the child node." Turtle and Croft proposed to operationalize these degrees of support through either subjective probabilities, or DS belief functions. A similar approach was undertaken in RUBRIC, a full-text information retrieval system described by Tong and Shapiro (1985) . RUBRIC can be instantiated to operate with several alternative relevance calculi, the DS model being a prime example. The importance of such applications is obvious, as they attempt t o take the DS model 'out of the lab' and implement it in realistic settings. 1n doing so, however, many adopters of the DS model have taken the model's validity for granted, either explicitly or implicitly. With that in mind, i t is important t o point out that both the cognitive and the normative roots of the DS model are still a matter of intense controversy: whereas Shafer (1987) argues that the theory of evidence is a natural extension of probability theory , many critics, e.g. Lindley (1987) , view it as a reformulated version of a specialized, albeit interesting, case of classical probabilistic inference. The debate is not helped by the somewhat forbidding notation of the DS model, which prevents an intuitive understanding of its underlying structure and philosophy. In fact, the gap between the theory and practice of the DS model seems to be two- directional. On the one hand, many practitioners believe that the normative correctness of the DS model is a 'closed case,' proceeding to implement i t without questioning its underly- ing rationale. On the other hand, many researchers try t o defend the DS model on complex philosophical and mathematical grounds, without realizing t h a t simpler justifications can be found in the field, i.e. in the way t h e model is actually used in certain canonicalsettings. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 The latter point is worth emphasizing: a close examination of certain applications of the DS model can provide not only a better understandi~g of the model, but, furthermore, a compelling normative justification. . The plan of the paper is depicted in figure 1 and described as follows. $2 presents the notion of index vectors and the challenge of eliciting and measuring relevance in a normative, rather than ad-hoc, fashion. 93 gives an overview of the DS model, as it unfolds in the context of a typical information indexing and retrieval (IR) application. This sets the stage t o four critical questions regarding the theoretical fit between the general features of the DS model and the specific requirements of IR applications. In order to answer these questions, $4 presents a canonical indexing model in which the notions of lexicons, taxonomies, and relevance, are treated formally and unambiguously. It is then shown that the canonical model is completely isomorphic t o the DS model, leading to a new intuitive understanding of the latter. $5 offers concluding remarks about the implications of the research on the DS model and on IR applications. P u t figure 1 around here The Problem Models of bibliographical indexing concern the construction of data structures that enable rapid content-based access t o collections of documents. Given a document, on the one hand, and a keyword lexicon, on the other, the goal of the indexing model is t o select a . subset of keywords that 'best' describes the document t o its prospective users. Since some . Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 keywords are more relevant to the document than others, a numeric scale is often used to express the strength of association 'between the document and the selected keywords. The result is an index vector, consisting of pairs of keywords and their respective relevance weights. Several models exist for representing and manipulating such relevance vectors, and the reader is referred to Salton and McGill (1983) and to Salton and Buckley (1988) for comprehensive treatments of the general approach to the subject. Formally, let D be a set of documents about a certain domain of interest, and let K: = {Isl,. . . , k,,,) be a lexicon, or a set of domain keywords. The index of each document d E D is a set of pairs of the form: where Ii'; E K: and 0 5 T ; 5 1, i = I,. . . , n. The I'i's are Iezical subsets, representing different groupings of keywords, and the ri's are called relevance numbers. Taken together, the pair (I(;,r;) E Sd says that the degree of relevance between the document d and the lexical subset ICi is Ti. Had we restricted the IG's to be singletons only, (1) would become the familiar 'term-weight vectors' that are normally used in information indexing and retrieval applications. Further, had we required that all the r; be 0 or 1 only, (1) would be reduced to the familiar keyword list (also called 'subject headings') that is normally used to classify articles in professional journals. Given the obvious simplicity of a Boolean indexing scheme, why bother about developing formalisms for weighted indexing? The answer is that relevance is a subjective and composite relation, based on an aggregation Center for Digital Economy Research Stem School of Business IVorking Paper 19-92-27 of several indexing opinions. Specifically, each document has many classifiers, or discerning characteristics, that determine its relevance. For example, the title of a document can sug- gest one index, whereas the abstract can suggest another. Other aspects of the document, obtained through lexical, linguistic, and citations, analyses will yield additional indexing opinions that must be taken into consideration. Hence, even if the individual opinions were forced t o be binary, their aggregation would probably induce a continuous index. In addi- tion, the indexing opinions are not cast automatically; ;ather, they are elicited from human catalogers who inject yet another level of uncertainty and subjectivity to the indexing pro- cess. That is, when two catalogers are given access to the same classifier as background information, they may well supply two different (but hopefully similar) indexing opinions. Different IR applications use different models to handle this pluralism in a formal way. From a theoretical perspective, the credibility of these models hinges on their capacity t o elicit, represent, and synthesize, relevance opinions in a normative, rather than ad-hoc, fashion. In order t o do so, the relevance numbers and the rules that combine them must be given a compelling interpretation. So far, the leading interpretation in the study of relevance has been probabilistic, beginning with the seminal work of Maron and Kuhns (1960). Recently, however, several attempts were made t o handle relevance in IR applications using the Dempster Shafer model, which is widely considered to be a less restricted extension of probability theory. The strengths and weaknesses of the latter approach are discussed in the next section. Center for Digital Economy Research Stem School of Business \Vorking Paper IS-92-27 A Dempster-Shafer Indexing Model The DS theory of evidence concerns the representation and manipulation of degrees of support rendered by different sources of evidence to a common set of propositions, denoted 0 and called the frame of discernment. In contrast to a standard Bayesian design, in which degrees of belief are normally assigned to elements of 0 directly, the DS model assigns degrees of belief to subsets of propositions, i.e. to members of the power set 2e, also called 'possibilities.' The DS model offers several complementary ways to express evidential support in possibilities. In particular, the model defines three mappings from 2' to [0, 11 termed mass, belief, and plausibility, functions. The three mappings are mathematically equivalent in the sense that knowledge of any one of them (for every possibility) can be used to compute the other two. Therefore, we view them as alternative means t o keep score of the same primitive set of degrees of support. In the standard model, when several sources of evidence support a common set of possibilities (the support can be cast in terms of either mass, belief, or plausibility functions), the overall support in the possibilities is computed through Dempster's rule of combination. , What is the nexus of the DS model and information indexing and retrieval applications? In one way or another, all DS-based IR applications are based on the following premises: (i) The DS notion of degrees of support can be used to operationalize the IR notion of relevance numbers; and (ii) When two or more classification criteria supply different sets of relevance numbers concerning the same document, Dempster's rule provides a plausible mechanism to combine them into a composite index (said otherwise: revise the relevance of the document to certain keywords in light of new evidence). The goal of this section is Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 to motivate a critical analysis of these premises. Specifically, we intend to: First, provide a rigorous but accessible overview of the DS model, as it unfolds in the familiar context of an IR application; 0 Second, present a series of questions regarding the theoretical fit between the general features of the DS model and the specific characteristics of IR applications. The F'rame of Discernment: The frame of discernment 6' is an exhaustive set of mutually exclusive elements that can be interpreted as hypotheses, propositions, or simply 'labels.' The power-set that contains all the subsets of 6' (including 6' itself and the empty set) is denoted 2'. In general, the semantics of the labels depends on the context in which the DS model is applied. In information indexing and retrieval applications, the frame of discern- ment is normally taken t o be a keywords lexicon K: = {kl,. . . , k,). To illustrate, a lexicon that supports a collection of documents about modern art might be K: = { ~ r p , B r a q u e , Cezanne, . . . , ~ o r n ) , enumerating all the major artists of the Twentieth Century. The power set in this case is 2' = {{Arp), {Braque), {Cezanne), . . . , {Arp, Braque), { ~ r p , Cezanne), { ~ r a q u e , cezanne), , . . . , { ~ r p , Braque, cezanne}, . . . ,0, K), the last two elements beings the empty set and K: itself. Each element in 2K represents a disjunction of keywords, denoted hereafter a lexical subset. The act of indexing a document using X: amounts t o choosing, among all the indexing possibilities in 2', the one or more lexical subsets t h a t best describe the document to its potential users. For example, suppose that an art scholar is asked to index the document "The Influence of Cezanne on early Cubism" using K, based on partial information such as the docu- ment's title or abstract. Without loss of generality, assume that (i) the main focus of Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 the document is Cezanne; and (ii) the only Cubist artists in the lexicon are Braque and Picasso. Under these assumptions, the scholar will probably supply an index of the form S = {({cezanne) , TI), ({Braque, Picasso), rz)) , with rl > 1-2. This would entail the follow- ing information: (i) the document is relevant to Cezanne; (ii) it is also relevant, to a lesser extent, to either Braque or to Picasso. This is quite different from the indexing opinion St = ( ( ( ~ e z a n n e ) , TI), ( ( ~ r a q u e ) , r2), ( {Picasso), rz)), which would be more appropriate if the document's title were, say, "The Influence of Cezanne on the early work of Braque and Picasso". We arrive a t our first question: Question Q1: When the DS model is applied to information indexing and retrieval applications, the keyword lexicon X: is taken t o be the frame of discernment, and indexing possibilities are taken to be elements of the lex- ical power set 2'. What are the taxonomical implications and limitations of this representation? To motivate this question, consider again the document "The Influence of Cezanne on early Cubism". Note that the most reasonable index of this document would be SN = {({cezanne), r l ) , ({cubism), r z ) ) , especially if the document's abstract makes no references to specific artists other than Cezanne. However, Cubism is not an element of the original lexicon X:, so it doesn't entail an indexing possibility. To solve the problem, we may want t o extend the original frame of discernment, creating a lexicon of the form K t = KU (cubism}. However, the keywords Braque, P i c a s s o and Cubism, have a great deal in common from a bibliographical standpoint. Therefore, K t is not a valid frame of discernment, because Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 some of its elements are no longer mutually exclusive. Before we present a solution t o this problem, we have to be very specific about the proper relationship among frames of discernment, keyword lezicons, and tazonomies of classes. We'll return to this issue in section 3 , where an answer to Q1 is given. Mass F'unctions: A mapping m : 2' -+ [O, 11 with the properties: is called a mass function2, In the DS model, the mass m(X) represents the degree to which a certain source of evidence supports the possibility X, where X E 6. As a convention, the mass which is 'left over' after all the proper subsets of 6 have been assigned masses is allocated to 6 itself and denoted the uncommitted belief displayed by m, or m(6). In DS-based IR applications, where 0 is taken to be a keywords lexicon K, the mass m ( X ) is taken to represent (to a first approximation that will be discussed shortly) a degree of rel- evance, or, more accurately, the degree of belief that the document is relevant to the lexical subset X C IC, according to a certain classifier. Hence, if a classifier (say, classifier number 1) supplies the relevance opinion Sl = {({~ezanne), O.6), ({Braque, picasso), 0.3)), then the mass function that is induced by this opinion is defined as follows: 2Throughout the paper, upper case variables refer to sets and lower case variables refer to scalars. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 ml({cezanne)) = 0.6 ml ( { ~ r a q u e , Picasso)) = 0.3 rnl(K) = 0.1 (4) ml(X) = 0 for all other proper subsets of K Note that the uncommitted belief induced by the opinion is assigned by default to the frame of discernment by means of ml(K) = 1 - 0.6 - 0.3 = 0.1. The rationale for this assignment is as follows. If a certain classifier provides no information whatsoever about indexing possibilities, the classifier's 'ignorance' can be represented by the indes S = {(K, I)}. This implies the mass function m ( ( ~ r p , Braque, Cezanne, . , . , ~ o r n } ) = 1 and m ( X ) = 0 elsewhere, reflecting the (not very useful) opinion that the document is relevant t o Arp, or t o Braque, or to Cezanne, or to any other artist in the lexicon. Other classifiers can provide more focused relevance opinions, resulting with lower levels of m(K). Hence, unlike a standard probabilistic design, where the notion of uncommitted belief is not well-defined, the DS model provides explicit means to quantify and manipulate it via m(K). Although uncommitted beliefs, or 'second-order uncertainties,' can and have been treated in the standard framework of subjective probability, (e.g. Baron, 1987 ), there is no simple way to do it. The theory of evidence is unique in that it treats the notion of uncommitted belief explicitly, a t the axiomatic level. It's important to observe that mass functions represent indivisible, or atomic, degrees of belief. For example, the magnitudes of m({Braque, ~ i c a s s o } ) , rn({Braque}), and m ( { ~ i c a s s o } ) are unrelated, and a mass function like m((Braque, ~ i c a s s o } ) = 0.9, m({Braque)) = 0, and rn({Picasso}) = 0 is not inconsistent with the theory. This par- ticular function represents a cataloger who strongly believes that the document is relevant to either Braque or to Picasso, although he is not willing to say anything more specific Center for Digital Economy Research Stem School of Business IVorking Paper 19-92-27 beyond this assessment. But what does this notion of relevance mean? We arrive at our next question: Question Q2: A mass function is a formal, domain-independent, component of t h e DS model. Relevance is an informal, but highly intuitive, concept that plays a key role in information indexing and retrieval applications. If a mass function is taken to represent relevance, then what is the exact semantics of this representation? Said otherwise, what type of relevance do mass functions represent? Question q2 suggests the premise that mass functions are not necessarily a natural rep- resentation of the intuitive notion of relevance, a s it is typically construed in information indexing and retrieval applications. For example, if mass functions are used to represent relevance, then the relevance numbers in each index must sum up to 1. That is, the set of allowable indexing opinions {(Ii; , r l ) , . . . , (I(,, r,)) is constrained by xy ri = 1. Many would argue that this constraint doesn't make sense, and that an indexing opinion like, say, { ( { ~ l b e r s ) , 0 . 8 ) , ({Kandisnki),0.4), ({Klee),0.4)) is perfectly reasonable, The only 'wrong' thing about this opinion is that it is inconsistent with the DS notion of a mass function, but this seems to be a limitation of the model's application, not of the opinion. One pragmatic solution is to treat the relevance numbers not as absolute, but rather as relative, measures of subjective relevance. According to this position, the two indexes S = {(A, O.8), (B, 0.4, )(C, 0.4)) and S' = {(A, 0.4), (B,0.2), (C, 0.2)) are equally informative, as both imply exactly the same relative information: the document is twice as relevant Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 to A as it is t o B, and it is as relevant to B as it is to C. However, this immediately leads to another snag: according to the same principle, the index is also equivalent to St' = { ( A , 0.2), (B, 0.1), (C, 0.1)). Yet St and St' reflect two different states of uncommitted belief (0.2 and 0.6, respectively), and thus they don't induce the same mass function. To get around the problem, we can elicit uncommitted beliefs directly from the catalogers3. For example, having specified a relevance opinion, say { A , 0.8, B, 0.4, C, 0.41, the cataloger can be asked t o rate his confidence in the opinion on a scale of 0 to 1. If the confidence level is 1, the index is normalized to { A , 0.5, B, 0.25, C, 0.251, reflecting an uncommitted belief of 0. If the confidence level is 0.8, the index is normalized to { A , 0.4, B, 0.2, C, 0.21, reflecting an uncommitted belief of 0.2. In general, for any unconstrained indexing opin- ion ((16, rl), . . . , ( I ' , rn)) and a confidence level c, we can find a unique mass function {m(Kl), . . . , m ( I L ) , m(K)) such that (i) the m(I<;)'s preserve the relative properties of the unconstrained rib; and (ii) m ( K ) = 1 - c. The shift from an absolute to a relative scale of relevance has several justifications. First, a significant body of psychological and cognitive evidence indicates that relevance is indeed a relative property (Saracevic, 1975 ). Second, we are motivated by the observation that ultimately, an IR application must satisfy the information needs of library patrons, and that relevance numbers should be used pragmatically to that end. For example, according to Maron (1982)'s 'Ranking Principle' , the chief objective of relevance numbers is to present to the patron a set of documents, sorted in decreasing order of perceived relevance to .his or her query. A similar principle is used in diagnostic expert systems, where ordinal, rather than cardinal, degrees of beliefs are often used to guide the inference engine to 31n this section, the terms classifier and caialoger are used interchangeably. T h e distinction between the two terms is made explicit in the next section. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 promising directions and to explain the system's reasoning to the people who consult it. If we accept Maron's Ranking Principle a s a working assumption, then normalization is not an issue, since rankings are invariant under normalization. However, when multiple indexing opinions are aggregated into a pooled index (something that we haven't done yet), normalization becomes a tricky manipulation. Specifically, let Sl and S2 be two indexing opinions, $ an aggregation operator, and N a normalization operator. In many cases (depending on the specific definitions of $ and N), it can be shown that N(Sl $ S2) # N(Sl) $ N(S2), i.e. that N is not homomorphic. In conclusion, we see that even though relevance numbers can be represented by mass functions, the representation has some theoretical caveats. Clearly, these limitations are related to the fact that we are still lacking explicit domain semantics. That is, we don't know yet what is the exact meaning of relevance numbers. This analysis is taken up in section 3, where a complete answer to question Q2 is given, T h e Core: The core of a mass function m : 2' -t [ O , 1 ] is the set of possibilities X E 2' for which m ( X ) > 0. When the frame 8 is taken to be a keyword lexicon K, the core becomes a list of indexing possibilities, in the view of one particular classifier. For example, the core of the mass function induced by classifier 1 (Eqn. 4) is Cl = { { ~ e z a n n e ) , { ~ r a q u e , p i c a s s o ) , K). Suppose now that the same document is indexed by another classifier (classifier no. 2), whose indexing opinion is captured by the following mass function: m 2 ( { ~ i c a s s o ) ) = 0.8 m2(AC) = 0.2 m 2 ( X ) = 0 for all other proper subsets of K: Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 The core of this mass function is C2 = {{Picasso), K). Is there a credible way to combine - the two indexing opinions (4),(5) into an aggregate index? As a first approximation, one can focus on all the lexical subsets that both classifiers agree are relevant to the document. In particular, if classifier 1 thinks that X is relevant and classifier 2 thinks that Y is relevant, then both classifiers agree that X n Y is relevant (recall that both X and Y are interpreted as disjunctions of keywords). This leads to the following definition of a pooled core: Let ml, rn2 : 2' 4 [O, 11 be two mass functions with cores C1 and C2. The pooled core C = C1 @ C2 will be: For example, the pooled core of Cl = {{Cezanne), {Braque,Picasso),K} and Cz = { { ~ i c a s s o ) , K) is Cl $C2 = {{Cezanne), { ~ i c a s s o ) , {Braque, Picasso), K)". In general; then, the pooled core can be viewed a s a first approximation of the degree of consensus or disagreement displayed by two independent indexing opinions. If Cl 83 C2 = Cl = C2, we have a consensus regarding which possibilities are likely. If Cl $ C2 = 0, the classifiers agree on nothing. If C1$ C2 is not empty, we have an overlap of some opinions. Of course the problem of (6) is that it merely identifies areas of mutual agreement (or lack thereof) between two classifiers. In order to compute the intensity of such agreements, a more sensitive pooling mechanism is required. Dempster's rule provides one such mechanism. Dempster's Rule: The most fundamental (and debateable) pillar of the DS model is the convention that once degrees of support are cast in terms of mass functions, Demp- 4Note that IC acts a s an attractor, in that A n K = A for all A E K. Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 ster's rule provides a proper mechanism to combine them. Let ml and m2 be two mass functions defined over the same frame of discernment: ml, m2 : 2' -+ [O, 11, with cores Cl = {Al, . . . , A,, 1 and C2 = {Bl, . . . , Bn2 1, respectively. Dernpster's rule computes the pooled mass function m = m l $ m2 : 2' -t [0, 11 as follows: The rationale behind (7-5) can be explicated through an 'intersection table.' In our two- classifiers scenario (4-5), the table has the following form: The top row of the table records the mass function of the first classifier excluding its zero elements, i.e. the set of values ml ( A l ) , . . . , ml (A,, ) for elements A; in the core Cl. T h e left column of the table records the mass values of the second classifier for its core elements, Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 i.e. the set of values m2(B1), . . . , m2(B,, ) (The curly brackets are dropped for the sake of brevity, e.g. m(Picasso, Braque) stands for m({Picasso, ~ r a q u e ) ) , etc.). Inside the table, t h e (i, j)'th cell records the pooled mass contributed t o A; n Bj by t6e pair A; and Bj, which is taken t o be the product m(Ai) . m(Bj). Using these entries and combining cells with equivalent intersections following (7-8), one obtains: ml(cezanne) = 0.12, m l ( ~ i c a s s o ) = 0.24 + 0.08 = 0.32, m f ( P i c a s s o , Braque) = 0.06, m1(lC) = 0.02, m'(0) = 0.48, After multiplying by & = 1.02 one obtains: m(Cezanne) = 0.23 m(Picasso) = 0.62 m ( P i c a s s o , Braque) = 0.11 m(K) = 0.04 m(0) * O Since the m(-)'s sum up to 1 and m(0) = 0, the mapping m = ml-$ mz that emerges from Dempster's rule is also a mass function, consistent with (3). In words, Dempster's rule computes a measure of agreement between two sources of evi- dence concerning various possibilities drawn from a common frame of discernment. T h e rule is conservative in the sense that it focuses only on those possibilities t h a t both sources support. The magnitude of the pooled support that a possibility X collects is computed by summing the products of the two masses rnl ( X ) and mz(X), which explains the product Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 operator in (7). Because the sources of evidence express their opinions over 2* rather than over 8, a joint agreement on a possibility can occur in more than one way, i.e. whenever the two sources support supersets of X. This explains the summation operator in (7)' which runs over all the possible supersets of X , Finally, when a pairing of two opinions results in a null possibility (the empty set), the multiplication of their masses may still be positive. This is an anomaly, since the definition of a mass function (3) requires that the mass of the null possibility be zero. This explains the role of (8)' in which r n l ( 0 ) is deducted from the total mass and the remaining mass is divided by (1 - m'(0)) t o ensure that the pooled mass will sum up to 1. Dempster's rule is often compared to and contrasted with Bayes rule, because both rules concern the combination of probabilistic opinions into an aggregate (posterior) opinion. It is crucial to observe however that unlike Bayes rule, which is a trivial consequence of the axioms of probability theory, Dempster's rule is a prescriptive pooling mechanism which is neither right nor wrong, and thus it is less of a 'rule,' and more of a 'recipe.' Therefore, we take the position that the ultimate justification of Dempster's rule should be sought in the field, i.e., in the various applications in which the rule is supposed to have a certain sense of domain validity. This leads t o the following question: Question Q3: What is the intuitive justification of Dempster's rule in the context of information indexing and retrieval applications? If one wishes to aggregate indexing opinions via a certain pooling mechanism, then why use (7-S) and not another set of formulae? A typical way to avoid this question is to invoke the argument: "If one uses mass functions Center for Digital Economy Research Stem School of Business IVorking Paper 19-92-27 to represent relevance numbers, then one should combine them using Dempster's rule, be- cause that's how mass functions are combined in the DS model." This argument could have been valid if Dempster's rule had a normative, domain-independent, and non-controversial justification. But this is not the case. In fact, many researchers have struggled to make sense of Dempster's rule, and the debate is still going strong: "Shafer's theory has been strongly criticized for its failure to give a meaning to the measures of belief and plausibil- ity, or to show how someone might am've at a particular numerical assessment. In the absence of a definite interpretation, it is difficult to see how the rules of the theory, and in particular Dempster's rule, can be justified " (Buxton, 1989 ). Given this controversy, the importance of question Q3 is obvious. Hence, our goal is to clarify, and to a certain extent defend, the meaning of 1)empster's rule in the specific bibliographical context of an information indexing and retrieval application. This analysis is carried out in section 3, where a complete answer to Q3 is presented. Belief Functions: Building on the elementary notion of a mass function m : 2' -t [0,1], the.function Be1 : 2' -+ [O, 11, denoted a belief function, can be defined as follows: Whereas m ( X ) measures the support rendered to X ( a subset of propositions) directly, Bel(X') measures the total support rendered to X and t o all its subsets (each being a more specific proposition). This relationship is depicted in figure 3, which illustrates how a Bel(.) function can be derived from the m(.) function given by (10). Note that (3) and (11) C imply that Bel(8) = 0 and Bel(0) = 1 always. In fact, (11) implies that the Be1 function Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 is completely determined by the mass function m, and, likewise, that m can be recovered from Bel's definition (Shafer, 1976, , p. 39). Plausibility Functions: Whereas Bel(X) measures the total support rendered t o a pos- sibility X, the plausibility of X, denoted Pl(X), measures the maximal support that X can possibly attain under a given mass function m. Specifically: In words, Pl(X) records the total mass allocated to all the possibilities with which X intersects. For a pictorial description of this relationship, refer again to figure 3. Put figure 3 around here The intuitive relationship between the three functions m(.), Bel(.), and Pi(.) can be de- scribed as follows. Beginning with the definition of Bel, consider the two possibilities X , A E 0. Since both X and A are disjunctions of propositions, the set-theoretic statement A 2 X is equivalent to the logical rule A -+ X, which we will interpret as: 'If the truth lies in A, it must also lie in X.' Therefore, the sum of all the masses associated with premises A that imply X can be viewed as a measure of the total support rendered to X. As regards Pi's definition, suppose now that A n X # 0 (but A is not necessarily a subset of X ) . Since the possibility A is a disjunction of propositions, the mass m(A) rendered to it can 'float' freely to any one of its subsets, including those that intersect X. In the extreme case, the Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 intersection A f l X may inherit the entire mass of A. It follows that P l ( S ) is the upper bound of Bel(X). To do justice t o the theory of evidence, it should be noted that the construction of Be1 and P1 using m is only one way to define these functions. Shafer provided direct definitions of mass, belief and plausibility functions in terms of each other. He has also emphasized the key role that subaditivity plays in the theory of evidence, a point which we now turn t o discuss in the specific context of information indexing and retrieval. Sub Additivity: The complement of a set X C 0, i.e. the set of all propositions that are in 0 and not in X, is denoted hereafter W. Definitions (11) and (12) imply the following important relationships: If a certain Belb were a Bayesian representation of degrees of belief, the additivity axiom of probability theory ( X i l Y = 8 implies Belb(X U Y ) = Beb(X) + Belb(Y)) would mean that Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 yet (13) and (14) imply that in the general case Bel(X) < 1 - Bel(x), leading to the famous subaditivity property of the theory of evidence: In other words, the belief that one holds in a possibility does not automatically imply one's disbelief in the negation of that possibility. In information indexing and retrieval applications, where 8 is taken to be a keyword lexicon K, this tenant has important im- plications. For example, if the admittance of new evidence causes a cataloger to increase his belief in the document's relevance to a lexical subset X, the same evidence should not necessarily decrease his belief in the document's relevance to lexical subsets in X, espe- cially if the cataloger is not confident in his relevance opinion. In particular, the difference 1 - Bel(X) - Bel(X) is called the uncommitted belief with respect to X. If Be1 were a Bayesian representation of degrees of belief, the uncommitted belief would be zero by definition. This is best illustrated in the 'state of insufficient reason,' in which one knows absolutely nothing about a set of propositions 8 = {ql,. . . , q,). Whereas the common so- lution is to set Bel(q) = l / n for all q; f 8, the theory of evidence would set Bel(6) = l and Bel(X) = 0 for all the other proper subsets of 0. This is the case when the uncommitted belief is at maximum. The interpretation of Bel(.) and PI(.) as lower and upper-probabilities has led many t o view the theory of evidence as a novel calculus for eliciting and manipulating interval- valued, rather than point-valued, degrees of beliefs, Indeed, the theory allows one t o express the belief in every hypothesis X by means of the interval [Bel(X), Pl(X)], which I Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 may be updated as new evidence about X is admitted. Further, the width of the interval, P l ( X ) - Bel(X), is by definition 1 - Bel(X) - Bel(x), or the uncommitted belief with respect t o X. If the uncommitted beliefs induced by a certain mass function rn were zero for all the hypotheses under consideration, the intervals would degenerate to zero widths and Be1 would be a standard probability function. Yet in the more general case in which the mass reflects some 'second-order uncertainty,' or 'ambiguity,' the degree of belief in possibilities X drawn from 8 is allowed to 'float' between Bel(X) and Pl(X). One benefit of such a model is that it is more robust and less prone to human errors in assessing subjective degrees of support. We arrive a t our last question: Q u e s t i o n Q4: The designer of a DS-based IR application can choose to elicit and represent relevance through three alternative languages: mass func- tions, belief functions, and belief intervals. What is the relationship among these three representation in the specific context of information indexing and retrieval applications? Recall that the three functions m, Bel, and P1, are mathematically equivalent, in the sense that knowledge of any one of them (for every possibility) can be used to compute the other two. This equivalence might lead one to concur that the question of whether to use m , Bel, or [Bel, Pl] to elicit and manipulate degrees of support depends on cognitive and efficiency considerations. As it turns out, this conclusion is quite naive. For example, belief intervals are not as flexible a representation as we would like them to be. That is, when one elicits [Bel, Pl] intervals from a source of evidence, it is not true that the only restriction is that Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 0 < Be1 < P1 < 1. Again, a full understanding of these constraints requires a semantic interpretation, which we now proceed to present. A Canonical Indexing Model As figure 1 illustrates, the key theme of this paper is the interplay of the theory and practice of the Dempster Shafer model, as viewed through the 'lens' of a particular application. The previous section was structured around the key constructs of the theory: the frame of discernment, mass and belief functions, and Dempster's rule. Coming from the other extreme, this section is structured around the key constructs of the application: taxonomies, relevance functions, and index aggregation operators. This leads to the development of a canonical indexing model, around which the remainder of the paper evolves. In building this model, our intention is to articulate an indexing mechanism which is simple, intuitive, and, most importantly, probabilistic. The main result that we are aiming at is this: notwithstanding its domain-specific origin and its strict probabilistic nature, the canonical model that is expounded here is completely isomorphic t o the DS model. This has several important implications. First, the canonical model provides concrete answers t o all the questions that were raised about the theoretical fit between the DS theory and information indexing and retrieval applications. Second, because the limitations of the former will be explicit, implicit limitations of the latter will become apparent. Third, because the canonical model makes no use of extra probabilistic arguments, it also provides a simple probabilistic interpretation t o the DS theory, which is often claimed t o be an extension of probability theory. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 4.1 The Taxonomy In most IR applications, documents are indexed and sought within a data structure that is called a tuzonomy. The taxonomy is a finite set of classes, or categories, designed t o organize documents in a particular subject of interest. For example, consider the follow- ing set of classes, taken from an art-related taxonomy: C = {~rt , Braque, Cubism, Dada, I m p r e s s i o n i s t , Janco, Modern, Picasso). Taxonomies are constructed by domain experts - in this case art scholars - who provide two types of information: (i) a set of classes; and (ii) a taxonomical data structure, expressed as ordered pairs of classes. Specifically, if we let (x, y) code the assertion 'y is a direct sub-class of x', then the expert might specify a rela- tion like H = {(Art, Modern), (Art, Impressionists), (Modern, Cubism), (Cubism, ~ r a q u e ) , (cubism, picasso), (Dada, Picasso), (Dada, Janco)), resulting with the taxonomy depicted in figure 2. Put figure 2 around here Formally, a taxonomy is a rooted directed acyclic graph < C, H >. The nodes set C represents taxonomical classes, and the edges set H represents a relation on C (i.e., a subset of C x C, giving directed pairs) with two restrictions: (i) no cycles exist in t h e digraph, and (ii) the digraph contains exactly one root, i.e. a class r f C such t h a t no edge ( x , r ) exists in H . The descendants of a class x are the subclasses of x, and t h e predecessors are the generalization of x. The root of the taxonomy is the only class t h a t (i) has no predecessors, and (ii) generalizes all other classes, e-g. the class a r t in figure 2. Since the taxonomy is a finite and acyclical graph, it contains a 'boundary,' or a set of Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 terminal classes. A class k E C is said to be terminal if has no descendants, i.e. if no edges of the form (k, x) exist in H. In the example of figure 2, the terminal classes are { ~ r a ~ u e , P i c a s s o , ~ a n c o } . We will denote the set of terminal classes by K. This notation is not coincidental, a s K is precisely the keyword lexicon discussed in the previous section. In a taxonomy, each class c E C is characterized by two sets that we denote LIB(C) and VOL(C) and call the library rooted in c and the libraries that intersect c, respectively. LIB(C) contains all the classes that can be reached by paths beginning at c and following edges all the way 'down' to the terminal classes. This set of classes, which includes c itself, represents the entire set of classes into which c may be decomposed. Conversely, VOL(C) contains all the classes that can be reached by following 'upward' paths beginning at c and ending a t the taxonomy's root. This set, which also includes c itself, representi all the classes to which c can be generalized. For example, cubism) ism) = {Cubism, Braque, P i c a s s o ) and ~ ~ ~ ( ~ u b i s r n ) = {Cubism, Modern, ~ r t ) . These sets can be given a recursive definition, as follows: LIB(C) = {c} U {x E C13y E LIB(Y) with (y,x) E H} (17) VOL(C) = {c) U {x E C(3y f V O L ( ~ ) with (x, y) E H ) (18) A taxonomy is similar to a hierarchical tree structure, with a difference. Unlike a common tree, each class in the taxonomy can have as many parents as we desire. For example, in figure 2 { ~ i c a s s o ) is a subclass of both {cubism) and { ~ a d a ) . Also note in passing Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 that definitions (18-17) imply that (i) LIB(TOO~) = C, so that the root library contains all the classes in the taxonomy; (ii) ~ O ~ ( r o o t ) = {root), the root library is not contained by any other library, and (iii) LIB(^) = {k) if and only if k E K, so that terminal classes are characterized by libraries that contain singleton classes only. The indexing process: The act of indexing a document within a taxonomy can be de- scribed as a top-down, depth-first search process. To illustrate, suppose that an art-related document is t o be indexed within the art taxonomy from figure 2. Without loss of gen- erality, assume that the document is relevant to modern art. Beginning at the first level under Modern and proceeding left to right, we test if the document is relevant to Cubism. If the answer is 'yes,' we step down one level and test if it's relevant to Braque. If the answer is 'yes,' we index the document in Braque. If the answer is either 'no' or 'unsure,' we test if it's relevant to Picasso. If the answer is either 'no' or 'unsure,' and assuming that P i c a s s o is the last class below Cubism, we backtrack one level, index the document in Cubism, and proceed to explore Dada. If the document is deemed irrelevant t o any one of the classes thus visited, we backtrack one level and index the document under Modern. This would reflect the notion that even though the document is related to modern art, the existing taxonomy fails t o discern the exact category t o which it belongs. Thus the indexing process involves a depth first search which is cut off a t any class that is deemed to be irrelevant to the indexed document. We see that the notion of relevance that is consistent with this process is defined over subsets of classes, not over individual classes in li. That is, if a document is indexed under, say, Cubism, it implies that the document belongs to the library ~ ~ ~ ( C u b i s r n ) , i.e., t o the collection of documents about Cubism, Braque, or Picasso. This definition of relevance is Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 convenient because it allows us to be as specific as we wish in our relevance statements. If we're sure t h a t a document is relevant to a certain class, we index it under that class. If we're not sure, we can step back and index the document in a library that contains the class. We can do this all the way up to the root of the taxonomy, at which point the indexing decision r o o t would express the opinion that the document belongs somewhere in the library, without specifying exactly where. Relationship to the theory of evidence: The relationship between a taxonomy < C, H > and a lexical frame of discernment K: is simple, but not trivial. From a mathe- matical perspective, the taxonomy can be viewed as a subset of a graph G whose vertices are indexed by 2'. In the graph G, there is an edge from the vertex indexed by the subset A E 2K t o the vertex indexed by the subset B f 2' if and only if A 2 B such that no other subset C satisfies A 2 C 2 B. In the taxonomy, which is a subset of G, each vertex c corresponds t o the vertex of G that is indexed by the subset of K: obtained from the terminal elements in LIB(c). Thus, each class in the taxonomy can be associated with an element in 2 K , namely the subset obtained from the keywords of the terminal classes that can be reached by looking 'downward' from the class in the taxonomy. While there is this mathematical association, there are important differences between the notion of a taxonomy and the power set of K: as used in the DS model. First, we may distinguish between two types of taxonomies: static and adaptive. A s t a t i c t a x o n o m y consists of a fixed and unmodifiable set of classes, like the Dewey decimal system or t h e Library of Congress index. An a d a p t i v e t a z o n o m y is a dynamic data structure that evolves from the indexing process itself. Such a taxonomy consists of a fixed set of keywords, Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 denoted K, and an 'open-ended' set of classes, each class being a different grouping of keywords frorn K. That is, when a new document is deemed relevant to a subset of keywords that don't make up an existing category, we simply announce this subset a new class and add it to the taxonomy. Hence, a document titled "A letter from Braque to Junco" may well be indexed in the class { ~ r a ~ u e , ~ a n c o ) , something that would have been impossible in a static taxonomy that doesn't contain such a predefined category. The only restriction that is placed on an adaptive taxonomy is that it must contain at least all the elements in K (as singletons, or classes that are made up of single keywords), as well as K: itself. Hence, we begin with the initial set of classes C = {(kl}, . . . {kn)), K), and add more classes to it as we go along. Thus, the precise relationship between the IR notion of a taxonomy and the theoretical DS notion of a lexical frame of discernment K: can be described in two steps. First, any static taxonomy is conceptually a 'frozen' and 'named' version of some adaptive taxonomy. Second, any adaptive taxonomy, in turn, is a subset of the lexical power set 2'. An exarnple is illustrated in figure 4, using the simple lexicon X: = {Braque, Picasso, ~anco}. Figure . 4-a depicts the lexical power set 2K (excluding 0). In practice, dealing with the power set of keywords is unrealistic, since the set of all possible classes becomes prohibitively large even with only a few dozen keywords. However, once the semantics of the lexicon is taken into consideration, many if not most of the classes in 2n become irrelevant, since they represent arbitrary grouping of keywords that can be excluded from the taxonomy for all practical purposes. If we choose to focus on tree taxonomies only, the power set can be restricted further by disregarding all its non-hierarchical subsets.' Figure 4-b depicts a =Using the notation 1x1 to represent the cardinality of a set X , characterize each class X E C by the set L ( X ) = {Y E CllXl = IYI). A taxonomy < C, H > will be a tree tazonomy if and only if for every class X E C, L(X) contains only disjoint sets. Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 specific adaptive taxonomy that might have emerged from a hypothetical indexing process. By definition, this taxonomy is a subset of the exhaustive 4-a taxonomy. Finally, figure 4-c depicts a 'frozen' and 'named' version of the 4-b taxonomy. The naming procedure is domain-dependent: if certain classes 'make sense' on semantic grounds, they can be given descriptive names that refiect their contents. For example, the class ( ~ r a q u e , P i c a s s o ) can be named Cubism, the class {Braque, Picasso, Janco) Modern, etc. Put figure 4 around here We now turn to question 41, which asked whether the DS concept of a lexical power set provides an adequate 'skeleton' for indexing documents in IiR applications. The answer to this question is 'yes,' but there is a caveat. Note that there is a subtle difference between a bibliographical taxonomy and a subset of the DS power set: in the former, the classes have names; in the latter, the classes correspond to anonymous lexical subsets. That is, in the logical context of the DS model, to say that a document is relevant to {kl, k2} is tantamount to saying that the document is relevant to either kl, or to k2. Yet in the context of a bibliographical taxonomy, most lexical subsets have meaningful names, like Cubism aad Dada, just like the elementary keywords that make up their contents. p here fore, indexing I a document in a named class might mean something quite different than the implication that the document should be indexed in one or more of the class's constituent keywords. For example, suppose that a cataloger decided to index the title "Cubist Landscapes" directly in the class Cubism. In the standard DS model, this indexing opinion would imply that "the document is relevant either to Picasso, or to Braque, or to another Cubist artist." Although this interpretation is logically correct, it clearly entails a loss of concrete Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 information about the document's direct relevance to Cubism at large. Also, it leads to a situation in which the set of documents relevant to a class is larger than or equal to the union of the sets of documents relevant to all of its children classes, which is inconsistent with the disjunctive interpretation of a standard DS power set. How can we augment the power set representation of the DS model so as not to force a cataloger to disregard information about a documents's direct relevance to non-singleton classes? By viewing the power set (qr the portion of the power set that is in use for indexing) as a taxonomy < C, H >, the problem may be solved by adding to the taxonomy a new set of net classes, as follows. For each non-terminal class c f C, add (i) a new class named net-c to C, and (ii) a new link (c,net-c) to H. The new class net&, which is a direct terminal descendant of c, can now serve as the index of the documents that are relevant specifically and directly to c. With this modification, each class c becomes a mere tag, or a pointer, and the proposition 'the document is relevant to the class c' is once again equivalent to the proposition 'the document is relevant to the library rooted at c.' Since the net classes are terminal classes, they become elements of the lexicon. Therefore, in a taxonomy which is augmented with a set of net classes, every indexing decision can be interpreted as selecting subsets of relevant keywords (which may include net classes) from the lexicon, so we are back in the familiar disjunctive stance of the DS frame of discernment. Purists may find this solution crude, but the adjustment is necessary if one wants to apply the DS model to information indexing and retrieval applications without violating, or misinterpreting, the set theoretic premise of the model. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 4.2 Relevance Functions The fundamental rule of indexing is that a document should be indexed using certain keywords if prospective users of the document would find it relevant to these keywords. In its most primitive form, then, relevance is a Boolean and subjective relation, indicating categorically that a document d E D is relevant to a lexical subset X = {kl,. . . , km) in the view of a particular library patron. However, due to the fact that bibliographical classes don't have crisp boundaries, and due to the multitude of relevance opinions expressed by different catalogers and library patrons, a more reasonable question is not whether d is relevant t o X, but rather what is the intensity of this relation. In other words, we seek to represent relevance in terms of a mapping r : 2K x D --+ [O, 11, rather than in terms of a characteristic function r : 2 R x D --+ {0,1). There have been many efforts to interpret relevance on probabilistic grounds, Maron and Kuhns (1960) being the defining article. One of the fundamental problems in this area has been the proper definition of the sample space from which relevance propositions are drawn. This point was alluded to by Maron, as follows: "The notion of probability of relevance can be interpreted in two different per- spectives: of the document, as the proportion of patrons of a given type who would judge that document relevant, and of the patron himself, as the propor- tion of documents of a given type which he would judge relevant. The first model leads t o a theory of probabilistic indexing; The second model leads to a theory of probabilistic query formulation (Maron, 1982 ) ." Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 In what follows we will focus on Maron's first perspective, in which multiple patrons form relevance opinions about a fixed document. Consistent with Maron's observation, this perspective yields a model of inexact indexing. Unlike Maron, though, the uncertainty associated with the indexes in our model will lead not to probability functions, but rather to Dempster Shafer mass functions, i.e. functions that conform to definition (3). Let U = { u ~ , . . . , u,,) be a set of catalogers, and let X: be a keyword lexicon. Suppose that each cataloger in U is asked to index the same document using AC, i.e. to specify one or more keywords from K: that are relevant to the document. Suppose that cataloger u; supplies the opinion that the document is relevant to the lexical subset C K; we then record this opinion by means of the following Boolean function: v;(X) = { 1 if u; indexed the document using X 0 otherwise Since each cataloger ui supplies one set of relevant keywords, there will be exactly one subset X E 2n such that v;(X) = 1. Also,. the empty set is not allowed to be a valid relevance opinion. If a cataloger is unwilling to give an opinion or is unsure about the proper classification of the document, the document is indexed by default in the root class f C , which is also an element of 2'. This convention makes sense because the root class represents the entire library, and is therefore the natural place t o store documents whose specific class membership is undiscernible. After all n catalogers have cast their indexing opinions regarding the same document d, we 33 Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 compute for each lexical subset X E 2n three 'relevance counters,' as follows: (X) = C r(Y,d) LIB Y cLIB(x) ( X ) = C r(Y,d) VOL Y €VOL(X) In words, r ( X ) , rIIB(X), and r V O L ( X ) count the number of catalogers who classified the document in X , in the library rooted in X, and in libraries that intersect (or in a hierarchical taxonomy, contain) X, respectively. (When d is fixed in our analysis, we will suppress the explicit dependence, and write r ( X ) instead of r(X, d).) Relationship t o t h e theory of evidence: Suppose now that the Boolean relevance opinions of the catalogers are averaged over the space of catalogers U through the following computation: The resulting function m ( X ) is a DS mass function over the lexical space K. Formally, we have the following proposition (the proofs are given in a separate appendix): 34 Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 Proposition 1: Let U = {ul, . . . , un) be a set of catalogers with their Boolean relevance opinions vl, . . . , v, : 2K -+ {O, 11. The real function m : 2" -+ [O, 11 defined by m ( X ) = i . r ( X ) = $ C:', v i ( X ) is a mass function, satisfying definition (3). The consequence of the proposition is that DS mass functions arise naturally when we view the relevance functions as derived from averages of multiple Boolean indexing opinions. We begin with a space U of n catalogers who are asked to index the same document using the same lexicon K. Each cataloger supplies an individual opinion that specifies which keywords are relevant to the document. Note that the cataloger's indexes are not restricted, and that they are free t o choose any keyword or combination of keywords that, in their opinion, are relevant to the document. Next, shifting our attention from the catalogers space U t o the lexical space K, we compute for each lexical subset X E X: a measure of 'average relevance,' 1 n r ( X ) , which represents the fraction of catalogers who thought that the document was relevant t o X. Disregarding the lexical subsets that no cataloger has chosen, we obtain a set of pairs of the form ((1'1, rl), . . . , ( I ' , %)} in which I{; E 2K and 0 < r; = i - r ( I ' ; ) n < 1. We are now in a position t o answer question 92, regarding the 'type' of relevance that DS mass functions represent, given the context of multiple relevance opinions. First, the canonical model has yielded the type of relevance numbers that are a t the center of any probabilistic indexing model. Second, according to Proposition 1, these numbers form a mass function, consistent with the standard DS model. Finally, the meaning of the mass m ( X ) is simply the fraction of catalogers who thought that the document was relevant t o the set of keywords X. Following the same line of reasoning, we can also provide an answer t o question Q4, t h a t Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 sought an IR interpretation of the meaning and relationship of mass functions, belief func- tions, and belief intervals. Given the IR context in which 8 tC, it is easily seen that the relevance counters (20-22) are proportional to the mappings that represent degrees of belief in the DS model. Specifically, dividing each counter by n - the number of cata- logers - yields the mass, belief, and plausibility, functions defined in (3), ( l l ) , and (12), respectively: I Bel(X) = - . r (X) n LIB If we combine these observations with the interpretation of the power set of the lexicon as a taxonomy, we see that the mass on a lexical subset X is given by the fraction of catalogers who indexed the document using X directly. Similarly, the belief in .X is the fraction of catalogers who indexed the document in libraries within X, and the plausibility of X is the fraction of catalogers who indexed the document in libraries that intersect (in a hierarchical taxonomy, contain) X. The key component of the canonical model that enables this interpretation of the DS functions is the assumption of multiple patrons and the v ; ( - ) functions that keep track of their individual indexing opinions. In the canonical model, the assumption of multiple patrons is explicit and is the foundation on which the entire analysis rests. In the DS model, the assumption of multiple Boolean opinions and their respective v i ( - ) functions are implicit. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 4.3 Aggregating Relevance So far, we have assumed that (i) relevance is a two-place function r ( X , d) between a doc- ument d and a lexical subset X, and that (ii) all the catalogers from whom r ( X , d ) was elicited were of the same 'type,' using Maron's terminology (see quote in Section 4.2). In this section we retract both assumptions. Specifically, we argue that relevance, in its most elementary form, is a three-place relation r(X, d, q) in which q is the classifier dimension, or context, in which d is judged to be relevant to X. With that in mind, r ( X , d) can be viewed as a measure of aggregate relevance that runs over all the possible contexts in which d's relevance to X is judged. We now turn t o describe a pooling mechanism that implements such an aggregation. Let Ul = {ul, . . . , u,,} be a group of nl catalogers who are asked t o index a document d using a keyword lexicon X: based on a certain classifier, denoted ql. Similarly, let U2 = {u:, . . . , u:,} be a group of n2 catalogers who are asked to index the same document, based on another classifier, denoted 92. The semantics of the classifiers depends on the indexing scenario. For example, ql might be the document's title, whereas 92 might be the document's abstract. Alternatively, in a dynamic model in which the relevance indexes of documents are continuously revised t o reflect actual use, ql and 92 can represent two different information needs, or queries, in the context of which the relevance of d t o X was judged, either explicitly or through an automatic keywords extraction algorithm. For example, let K = { A , B, C) and let Ul and U2 consist of 4 and 3 catalogers, respectively. Assume that within the U1 group, two catalogers index the document in {A, B), one in {A}, and one in {B). Within the U2 group, one cataloger indexes the document in {B, C}, Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 one in ( A , B), and one in {B). These indexing opinions are tabulated in the two tables on the left side of figure 5. The columns of each table represent the common lexicon X" = { A , B , C) . The ith tuple in each table represents the relevance opinion elicited from the i t h cataloger in the respective group as a binary vector. To be precise, 1 in the (i, j)th table entry indicates that cataloger i has included the j t h keyword in his indexing opinion and 0 indicates that he didn't. Put figure 5 around here In what follows, we denote the binary vector that represents the relevance opinion of cataloger u; by w;. Similarly, the set of all relevance opinions of a group of catalogers will be denoted W = {w;lu; E U}. Finally, the group of catalogers U together with their relevance opinions W will be denoted T =< U, W > and referred t o as a model. With this notation, consider two groups of catalogers Ul and U2 together with their relevance opinions Wl and W2. H a l l the catalogers in both groups are considered equally qualified to cast relevance opinions, then a variety of different pooling mechanisms may be used to compute the aggregate index induced by all the catalogers. Symbolically, we seek an operator 8 t o compute t h e model <'U, W >=< Ul, W; > @ < Ul, VV2 >. The pooling mechanism depicted in figure 5, denoted hereafter by @, implements an op- erator that was described by Hummel and Landy (1988) as "a consensus opinion formed by the committees of two." Here, the set of all possible committees is U = Ul x U2, con- sisting of all the nl - nz unique pairs of catalogers that can be drawn from Ul and from U2. The combined relevance opinion associated with the pair (u;, U S ) f U is defined t o be the binary conjunction of t h e individual opinions of u; f U1 and US E U2, which we denote Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 w;,j = w; . w i . For example, consider the first tuple in the U table in figure 5. This tuple gives the opinion of the committee (ul, ui), i.e. wl,ll = (0,1,0). This opinion is the binary conjunction of the individual opinion wl = (1,1,0) and wi = (0,1,1) as given by catalogers ul and u i respectively. T h e pooling operation QD is completed by treating U a s a new group of catalogers and using (23) t o compute the mass function that i t induces: Note that m' is not necessarily a mass function, since QD can yield a result like mf(0) > 0. This happens when there is a pair of opinions (e.g. 212 and u\ in our example), such t h a t the conjunction of the opinions gives the empty set even though neither opinion gives the empty set individually. To resolve the problem, we normalize mf(-) as follows: In words, for each lexical subset X E X, m t ( X ) is the fraction of the (paired) catalogers who classified the document in that subset. Next, the fraction of the catalogers who agreed on nothing - m' (0, 0,O) - is distributed evenly among the fractions of catalogers who agreed on something, yielding a new mass that sums up t o unity. This function is now taken t o Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 be the 'aggregate index' of the document d, implying the taxonomy depicted at the top right of the figure. We may also view m(X) as the fraction of (paired) catalogers who index the document in X among those paired catalogers who do not index the document in the empty set 0. That is, if we discard pairs that agree on no relevant keywords, then the remaining pairs can compute their pooled relevance and then yield a mass function m. R e l a t i o n s h i p t o t h e t h e o r y of evidence: In order to explore the relationship of the multiple catalogers/multiple classifiers scenario to the DS model, we first have to step back and say a few words about the role of 'sources of evidence' in the latter. Basically, the DS theory models a situation in which a finite set of 'pieces' or 'sources' of evidence E = {el,. . . , en) is used to discern the likelihoods of various possibilities X drawn from a common frame of discernment. Yet the identity of the sources of evidence is rather implicit in the model's language. That is, the common notation m;(X) and Bel;(X) is meant to be shorthand of the inass and belief functions m(XJei) and Bel(Xle;), where e; is the source of evidence whose 'support' of the possibility X we are trying t o capture. The total support that the body of evidence E lends to X is computed through Dempster's rule (7- a), which yields a new function of the form m(Xlel,. . . , e n ) = m(Xlel)$, . . . , $m(Xle,).6 For simplicity's sake, we denote the latter function m(X), which reads 'the mass that the possibility X attains after all the available evidence has been taken into consideration.' With that, the relationship between the canonical model and the DS model is as follows: possibilities correspond to lexical subsets, and sources of evidence correspond to classifiers, i.e. to different aspects of the document (title, abstract, author, etc.) that help discern the 6Dempster's rule (7-8) is commutative and associative, so its extension from 2 to n operands is straightforward. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 document's proper classification. The missing piece in the analogy is the set of catalogers who inspect each classifier individually and cast Boolean relevance opinions based on that information. In the DS model, these catalogers are implicit. In the canonical model, they are the driving force of the entire analysis. Another way to interpret the group of catalogers is t o view them as a group of l i b r q patrons who approach the same document with different information needs (or queries) in mind, each corresponding t o a piece of evidence that highlights one facet of the composite relation that we call 'relevance.' How should we combine this multitude of relevance opinions into an aggregate index? In the canonical indexing model, the opinions are combined at the catalogers level, through the cartesian consensus operator @. In the DS model, where the catalogers space is implicit, the opinions are combined at the classifiers level, via Dempster's rule @. The key point, as illustrated in figure 5, is that both combination methods lead to precisely the same result. Formally, we have the following proposition: P r o p o s i t i o n 2: Let Tl =< Ul, Wl > and T2 =< Uz, W2 > be two sets of catalogers together with their Boolean relevance opinions, and let T =< U, W > be the outcome of T = TI QZ, T2, as follows: (i) U = Ul x U2; and (ii) W = {wij = W ; w;~w; E Wl and w; E W2). Let @ be Dempster's rule a s it is applied to mass functions. Let mTl, mT2, and m be the mass functions induced by the models TI, T2, and TI @ T2. Then we have the following: mrIeT2 = mT, $ mT2. Proposition 2 serves to shed light on the prescriptive nature of Dempster's rule. That is, once we accept the fact that Dempster's rule @ is isomorphic to the cartesian product operator @, a whole set of questions emerges: (1) why are the individual catalogers forced Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 to specify only Boolean, and not probabilistic, relevance opinions? (2) why are the groups of catalogers joined using a set product operator, a s opposed to other set combination operators, e.g. union? (3) why committees of two, and not, say, comatous of three? (4) why are the individual relevance opinions combined using a binary conjunction rule? (5) why are all cataloger opinions given the same weight, where in practice some opinions may be more informed or worthy than others? A proper answer to these questions requires an elaborate research program, involving both theoretical and empirical work. Also, the exact nature of the combination rule can vary from one situation to another. In the specific context of information indexing and retrieval, one can think of a family of indexing models, designed to operate under different sets of assumptions. For example, if the catalogers prefer to express binary relevance opinions, we can use Dempster's rule (or the equivalent 8) to combine them. If they wish t o express relevance by selecting a number between 0 and 1, we can modify the combination rule t o accommodate this language as well (this will be similar t o the way Yen (1989) extended Dempster's rule in the GERTIS system ). If the catalogers wish t o use a discrete language such as 'remotely relevant,' 'somewhat relevant,' etc., we can develop a fuzzy version of the rule. The key point here is that the precise definition of 8, along with Proposition 2, provide clear guidelines as t o (i) which aspect of the combination rule has to be modified, and (ii) what will be the normative relationship between the modified rule, Dempster's rule, and probability theory. Center for Digital Economy Research Stem School of Business \Vorking Paper IS-92-27 5 Conclusion All the major implications of the research were already discussed in the body of the pa- per. We conclude with several comments regarding (i) efforts to apply the DS model to information indexing and retrieval applications; and (ii) efforts to interpret the theory of evidence on logical or probabilistic grounds. I n f o r m a t i o n I n d e x i n g a n d Retrieval: One .objective of the paper was to articulate a concrete relationship between the Dempster Shafer model and information indexing and retrieval applications. The relationship that we expounded can be summarized as follows: keyword lexicon (K) taxonomy (< C, H >) classification criteria ( q i ) groups of catalogers (Uj) individual indexing opinions (kK) relevance .measure to class ( r ) relevance measure to library (r (X) LIB relevance measure to volume (rVOL (X)) relevance aggregation operator (8) IR application frame of discernment (0) named subset of 2' sources of evidence (e;) implicit implicit mass function (m) belief function (Bel) plausibility function (Pl) Dempster's rule ($) Dempster-Shafer model We hope that the details of this 'mapping,' as discussed in the paper, will promote a better understanding of the proper way to apply the DS model to IR applications. In addition, Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 the mapping provides a practical foundation for building a variety of different indexing algorithms. These algorithms can use the 8 combination rule, or versions thereof, as called by the application. Ultimately, the success of one relevance calculus or another will depend on face validity and on field performance considerations. T h e D e m p s t e r Shafer t h e o r y of evidence: Several authors provided canonical exam- ples that explain the rationale of the DS model in the way of analogy, Zadeh (1986) illustrated how mass functions and Dempster7s rule can be mapped on fuzzy queries about i n t erval-valued, rather than point-valued, attributes, in a relational database . Gordon and Shortliffe (1985) gave a compelling interpretation of how a DS calculus can be used t o rep- resent and combine the degrees of belief that clinical symptoms (pieces of evidence) render to classes of bacterial organisms (disjunctions of hypotheses), whose set relationships forms a hierarchy. Coming from a different, domain-independent, direction, Hummel and Landy (19SS) analyzed the probabilistic foundation of the theory of evidence in general, without making any assumptions on the underlying domain or the logical structure of the hypothe- ses . In contrast to other researchers who attempted to interpret high-level constructs of the DS model directly (e.g. Baron , 1987 , Kyburg, 1987, , and Schocken and Kleindorfer, 1989 ), Hummel and Landy took a more fundamental viewpoint that showed how t h e the- ory's constructs were implicitly linked t o statistics of the opinions of hypothetical experts. However, their abstract mathematical analysis made no use of canonical examples, and it is difficult to interpret its implications on practical domains of application. With that in mind, one objective of this paper was to illustrate how constructs of t h e DS model that up until now defied simplistic interpretations yield to a plausible interpreta- tion in the practical context of a multi-classifier/multi-cataloger model. We have seen, in Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 propositions 1 and 2, that the canonical model leads to exactly the same set of functions and fdrmulae of the DS model. Hence, from a mathematical perspective, the canonical model is isomorphic to the DS model. Yet from a semantic perspective, it invokes the notion of multiple catalogers. Although the notion of multiple patrons appears in several major interpretations of bibliographical relevance (Maron and Kuhns, 1960 , Maron, 1982 ), it may or may not exist in other applications. To what extent, then, are we forced to accept the canonical interpretation of multiple cata- logers in principle? One can simply reject the notion, avoiding the isomorphism by denying the possibility of multiple opinions, and relying simply on the DS theory as presented in Section 2. In t h a t case, however, one is left with philosophical questions like Q1 through Q4. There could, of course, be other interpretations. However, in a real sense, all valid iilterpretations must be accepted or explained. That is, either the interpretation is ac- cepted as is, or one must show how another set of semantic constructs provides a plausible interpretation of the theory. One advantage of our approach is that new calculi can be developed, different from the DS combination rule, that might better suit particular appli- cations, based on modifications of the canonical model. I t is precisely the unsatisfactory elements of this canonical model that permit us to systematically seek improved methods for managing uncertainty. Since our analysis was strictly probabilistic, it seems to be consistent with Lindley's ob- servation that "Anything that can be done with belief functions can better be done with probability theory" (Lindley, 1987, , p. 38). However, we believe that this argument misses an important point. To use a crude but useful analogy, it will be unreasonable to write off a programming language like Pascal simply because every Pascal program can be rewritten Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 in machine language. Just like high-level languages provide complex structures for dealing with specialized problems, the DS model provides non-elementary functions and operators t h a t lend themselves nicely to certain domains of application, information indexing and retrieval being one such example. We conclude that the Dempster Shafer theory of evidence provides an attractive framework for supporting information indexing and retrieval applications, and that these applications, in turn, serve t o highlight the internal validity and limitations of the theory. Dempster's rule remains a controversial operator for combining degrees of beliefs,. but this paper has illustrated that it is just one member in a family of parameiric combination rules, and that the question of whether to use this rule or another is more a matter of reasoned choice than a matter of adhering t o a fixed set of formulae. Appendix: Proofs P r o p o s i t i o n 1: Let U = {ul , . . . , u,) be a set of catalogers with their Boolean relevance opinions vl, . . . , v , : 2" + { O , l ) . The real function m : 2" + [O, 11 defined by m ( X ) = ! r ( X ) = C:=, v ; ( X ) is a mass function, satisfying definition (3). P r o o f : For each class X E 2h, either all, some, or none of the catalogers indexed the document in X . Hence, r ( X ) = n , or r ( X ) < n, or r ( X ) = 0, respectively, implying that 0 5 m ( X ) _< 1. Hence, m(.) is a mapping from 2h to [O, 11, satisfying the first requirement of being a mass function. The second requirement is that the function will sum up t o 1 over all the subsets of K. This is proved as follows. For each cataloger u;, exactly one Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 of the subsets X 2 K: is such that vi(X) = 1. For all other subsets Y, vi(Y) = 0. Thus 1 3 x , 2 ~ vi(X) = 1. We thus have the following: Further, since no cataloger gives 8 as his opinion, it is always true that v40) = 0. Therefore, the third requirement of definition (3) is satisfied. Thus m is a mass function. Definition of t h e @ combination rule: Let U = {til,. . . , u,) be a set of catalogers with their Boolean relevance opinions vl, . . . , v, : 2K -4 { O , l ) . To denote the fact that the keyword k E X: was included in the indexing opinion of the i t h cataloger, we use the following notation: 1 if vi(X) = 1 and k f X wi(k) = 0 otherwise If K: = {kl,. . . , k,), the binary vector obtained by wi(kl), . . . , wi(kn) is denoted wi and called the Boolean relevance opinion of ui. The collection of all such opinions of members of U is denoted W = {w;(u; E U). To combine the relevance opinions of two sets of catalogers < Ul, Wl > and < Uz, Wz >, we use the following formulae (60): Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 Where wi(k) and wi(k) are as defined in (29) for u; E U1 and of u: E U2. In section 4.2 we have shown how a mass function can be constructed from a set of catalogers (Proposition 1). Specifically, recall that the mass function induced by the model T = < U, bV >, denoted hereafter mT(X), gives the fraction of catalogers in U, among those 4 catalogers who express an opinion (i.e. w; #O), whose relevance opinion exactly matched X . This is the same as those catalogers for whom w;(kj) = 1 if and only if k, E X. For T =< U, W >, This fraction can be written down exactly: for X # 0. Of course, m ~ ( 0 ) = 0. We are now in a position to prove the following. Proposition 2: Let TI =< Ul, Wl > and T2 =< U2,W2 > be two sets of catalogers together with their Boolean relevance opinions, and let T =< U, W > be the outcome of T = TI D T2, as follows: (i) U = Ul x U2; and (ii) W = {wij = w; w:lw; E Wl and w: f W). Let $ be Dempster's rule a s i t is applied t o mass functions. Let mT1, mT2, and mT1,%, be the mass functions induced by the models TI, Tz, and Tl @ T2. Then we have the following: mTleT2 = m, @ m,. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 Proof: This proposition asserts a relationship between the general Dempster Shafer model and the canonical indexing model presented in section 4. The fact that the mapping from one model to t h e other is homomorphic follows from Hummel and Landy (1988) , but we will supply an independent argument here in the context of the indexing model. Let us assume t h a t there are nl catalogers in Ul and n2 catalogers in U 2 , and let us fix a particular nonempty lexical subset X of the lexicon K. We wish to show that Beginning with the right hand side of (34) and using the definition of Dempster's rule @, (mTl @ m T 2 ) ( X ) is equivalent to Multiplying top and bottom by nl . n2 and distributing, we obtain Z A ~ B = X nlmTl ( A ) ' n2mT2 ( B ) Cnnl3+0 n l m T l ( A ) . n2mT2 ( B ) ' Recalling how mass functions are induced from the opinions of groups of catalogers (Eqn. 23 in Section 4 . 2 ) , we may interpret this expression as follows. The value nlmT1 ( A ) counts the number of catalogers in Ul who have indexed the document in the lexical subset A. Likewise, n2mT2 ( B ) counts the number of catalogers in T2 who have indexed the document Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 in the lexical subset B. Hence, the product nlmTl ( A ) n2mT2 ( B ) counts the number of distinct pairs of catalogers ( u ; , u j l ) in Ul x Uz where u; E Ul has indexed the document in A and u j f f Uz has indexed the document in B. Now, according t o the way 8 is defined, if u; has indexed in A and u j l has indexed in B, then the pair of catalogers ( u ; , u j l ) end up indexing the document in A n B = X. Thus, the numerator of expression (36) counts all the cataloger pairs that end up indexing the document in X. Precisely the same argument can be used t o show that the denominator of ( 3 6 ) counts all the pairs of catalogers who don't index the document on 0. Thus ( 3 6 ) gives the fraction of cataloger pairs in UI x U2 that have indexed the document in X out of the pairs of catalogers in Ul x U2 who have indexed the document in some non-empty lexical subset, which is exactly the definition of mTlQT2, the left hand side of ( 3 4 ) . Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 References [l] J . Baron. (1987). Second-order probabilities and belief functions. Theory and Decision, [2] G. Biswas, J.C. Bezdek, M. Marques, and V. Subramanian. (1987). Knowledge assisted document retrieval (I and 11). Journal of the American Society for Information Science, 38(2):83-110. [3] R. Buxton. (1989). Modeling uncertainty in expert systems. International Journal Man-Machine Studies, 31:415-476. [4] A.P. Dempster. (1967). Upper and lower probabilities induced by a multivalued map- ping. Annals Mathematics Statistics, 38:325-339. [5] A.P. Dempster. (1967). Upper and lower probability inferences based on a sample from a finite univariate population. Biometrica, 54:515-528. 161 J . Gordon and E.H. Shortliffe. (1985). A method for managing evidential reasoning in a hierarchical hypothesis space. Artificial Intelligence, 26:323-357. [7] R.A. Hummel and M.S. Landy. (1988). A statistical viewpoint on the theory of evi- dence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(2):235- 247. [S] H.E. Kyburg. (1987). Bayesian and non-Bayesian evidential updating. Artificial In- telligence, 31:271-293. [9] D.V. Lindley. (1987). The probability approach t o the treatment of uncertainty in artificial intelligence and expert systems. Statistical Science, 2(1):17-24. (1 01 M .E. Maron. (1982). Associative search techniques versus probabilistic retrieval mod- els. Journal of the American Society for Information Science, 308-310. [ll] M.E. Maron and J.L. Kuhns. (1960). On relevance, probabilistic indexing and infor- mation retrieval. Journal of the Association for Computing Machinery, 7(3):216-244. [12] G. Salton and C. Buckley. (1988). Term weighing approaches in automatic text re- trieval. Information Processing Management, 24(5):513-523. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 [13] G. Salton and M.J. McGill. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York. 1141 T . Saracevic. (1975). Relevance: a review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science, Nov- Dec:321-342. [15] S. Schocken and P.R. Icleindorfer. (1989). Artificial intelligence dialects of the Bayesian belief revision language. IEEE Transactions on Systems, Man, and Cybernetics, 19:1106-1121. [16] G. Shafer. (1976). A Mathematical Theory of Evidence. Princeton University Press. 1171 G. Shafer. (1987). Probability judgement in artificial intelligence and expert systems. Statistical Science, 2(1):3-44. [I81 R.M. Tong and D.G. Shapiro. (1985). Experimental investigations of uncertainty in a rule-based system for information retrieval. International Journal of Man-Machine Studies, 223265-282. [19] H. Turtle and W.B. Croft. (1991). Evaluation of an inference network-based retrieval model. A C M Transactions on Information Systems, 9(3):187-222. [20] J. Yen. (1989). Gertis: a Dempster-Shafer approach t o diagnosing hierarchical hy- potheses. Communications of the ACM, 32(5):573-585. [21] L.A. Zadeh. (1986). A simple view of the dempster-shafer theory of evidence and its implication on the rule of combination. The .AI Magazine, 85-90. Center for Digital Economy Research Stem School of Business Working Paper IS-92-27 E X T E R N A L V A L I D I T Y I Shafer I Indexing Model & Retrieval ( Theory I Indexing Model I N T E R N A L V A L I D I T Y Figure 1: A pictorial description of the paper's methodology. Section 3 uses the termi- nology and rationale of the Dempster Shafer theory to derive a DS indexing model for IR applications (top arrow). Taking the opposite direction, Section 4 builds a canonical index- ing model that is based on the domain specific requirements of IR applications. As it turns out, the canonical model provides a probabilistic and domain-independent interpretation of the Dempster Shafer theory of evidence. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 art . . . modern . . . impressionist . , . . . . Cubism . . . Dada . . . . . . Braque . . . Picasso . . . Janco . . . Figure 2: An excerpt from an art-related taxonomy designed to classify documents on major artists and artistic movements. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 Figure 3: An illustration of the relationship that exists among t h e a mass (top), belief (left), and plausibility (right) functions t h a t represent the same set of primitive degrees of support. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 (Braque, P i c a s s o ) (Braque , Janco3 ( P i c a s s o , Janco) (Braque) ( P i c a s s o ) (Janco) (Braque) ( P i c a s s o ) (Janco} modern cubism \ Braque P i c a s s o Janco Figure 4: T h e evolution of a taxonomy from a lexical power set Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27 The U1 taxonomy The U2 taxonomy ' w i t h m ( ~ , d , q l ) v a l u e s : w i t h m(X,d ,q2) valu'es : u l : 1 1 0 u2: 1 0 0 u3: 0 1 0 u4: 1 1 0 The U taxonomy w i t h m(X,d) v a l u e s : U2 A B ' C = UlxU2 A B ' C ---- ------------- ------ ------------- u l ' : 0 1 1 u l , u l ' : 0 1 0 u 2 ' : 1 I o ~ 1 ~ ~ 2 7 : 1 1 0 u 3 ' : 0 1 0 u l , u 3 ' : 0 1 0 I l 2 , u l J : 0 0 0 Figure 5: T h e individual indexing opinions of two groups of catalogers (U, and U2) are recorded a t the bottom of the figure. These opinions induce two different taxonomies and two different relevance functions, m(X, d,q,) and m(X, d, g2), depicted at the top of the figure. T h e combination of t h e relevance taxonomies via Dempster's rule @ a t t h e c l a s s i j e r s level and t h e combination of the opinions via t h e cartesian consensus rule 8 a t the catalogers level leads t o the s a m e pooled index depicted a t t h e top right of t h e figure. Center for Digital Economy Research Stem School of Business IVorking Paper IS-92-27