On the Use of the Dempster Shafer Model 

in Information Indexing and Retrieval Applications 

Shirnon Schocken' 

Stern School of Business, New York University 

Robert A .  flurnrneP 

Courant Institute of Mathematical sciences, New York University 

Workina PaDer Series 

STERN IS-92-27 

Department of Information Systems, Management Education Center, Stern School of 
Business, New York University, 44 W. 4th Street, New York, NY 10003. 

Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, 
, New York, NY 10012. 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


On the Use of the Dempster Shafer Model 

in Information Indexing and Retrieval Applications 

The Dempster Shafer theory of evidence concerns the elicitation and manipula- 
tion of degrees of belief rendered by multiple sources of evidence to a common 
set of propositions. Information indexing and retrieval applications use a variety 
of quantitative means - both probabilistic and quasi-probabilistic - to repre- 
sent and manipulate relevance numbers and index vectors. Recently, several 
proposals were made to use the Dempster Shafer model as a relevance calculus 
in such applications. The paper provides a critical review of these proposals, 
pointing at several theoretical caveats and suggesting ways to resolve them. 
The methodology is based on expounding a canonical indexing model whose 
relevance measures and combination mechanisms are shown to be isomorphic 
to S hafer 's belief functions and to Dempster's rule, respectively. Hence, the 
paper has two objectives: (i) to describe and resolve some caveats in the way 
the Dempster Shafer theory is applied to information indexing and retrieval, 
and (ii) to provide an intuitive interpretation of the Dempster Shafer theory, as 
it unfolds in the simple context of a canonical indexing model. 

Keywords: Theory of evidence, Dempster Shafer model, relevance measures, information 
indexing and retrieval. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


1 Introduction 

Consider a finite and exhaustive set of mutually-exclusive propositions and a body of evi- 

dence t h a t  supports some subsets of propositions and discounts others. Many theories were 

put forward t o  describe how one should represent and update one's degrees of belief in such 

propositions when new or additional evidence is brought to bear. The classical approach 

is t o  cast degrees of belief as probabilities - a set of numbers between 0 and 1 that obeys 

the axioms of subjective probability - and use Bayesian inference rules to revise them in 

light of new evidence. One problem with this approach is that it doesn't offer a clear way 

t o  model the various degrees of 'uncommitted beliefs,' or 'second order uncertainties,' that 

characterize most realistic inference problems. Fbr example, consider the extreme case of 

'insufficient reason,' in which one knows absolutely nothing about a given set of n propo- 

sitions. T h e  common solution, which goes back t o  LaPlace, is to assign a degree of belief 

of l / n  t o  each of the propositions under consideration. Incidently, this is also the solution 

t h a t  emerges from maximizing the unconstrained entropy function associated with the n 

unknown probabilities. 

Over the years, many students of belief revision theories have objected to this crude quan- 

tification of insufficient reason. Why, the argument goes, should ignorance be translated 

to the strong statement that every proposition (or state of nature) is equally likely? This 

criticism has led to several alternative models that attempt t o  capture the elusive notion of 

uncommitted belief by modifying the axiomatic framework of probability theory. Perhaps 

the best known model in this category is the 'theory of evidence,' originated by Demp- 

ster's (1967, 1967a) , work on upper and lower probabilities. Dempster's ideas, which were 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


based on a frequentist view of inference, were refined and extended by Shafer (1976), who 

also gave them a subjective interpretation. This led to the Dempster Shafer model - an 

elaborate formalism for representing and revising degrees of support rendered by multiple 

sources of evidence to a common set of propositions1. 

When the work of Dempster and Shafer was 'discovered' by the artificial intelligence com- 

munity, it immediately stirred a considerable interest in two application areas in which 

normative models of belief formation play a key role: expert systems, and information 

indexing and retrieval systems. For expert systems, the Dempster Shafer (DS) model pro- 

vides a mathematically-sound model for representing and manipulating rule-based degrees 

of belief, an area that was traditionally dominated by ad-hoc belief revision calculi whose 

relationship t o  probability theory was a t  best murky. For information indexing and re- 

trieval systems, the DS model can be used as a relevance calculus, designed t o  quantify and 

revise the degrees of relevance between documents, keywords, and user-supplied queries. 

This line of thought has led t o  the development of several DS-based information indexing 

and retrieval applications. For example, Biswas, Bezdek, Marques, and Subramanian (1957) 

built a document retrieval system in which the relevance of documents t o  taxonomical 

classes was measured and manipulated, respectively, by belief functions and Dempster's 

rule: "We choose to define similarity functions based on the Dempster Shafer theory of 

evidence ... one of the advantages of this approach is that it rejects the process of belief 
revision and updating just as in human reasoning processes." (Biswas et al, 1987). Coming 

from a different direction, Turtle and Croft (1991) describe a canonical representation 

in which relevance is handled through inference networks that are structured as directed 

l I n  this paper, the terms the Dempsier Shafer theory of evidence and the Dempster Shafer model are 
used interchangeably. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


acyclic graphs. The nodes in the networks correspond to keywords, documents, and queries, 

and "the arcs joining the nodes are interpreted as assertions that the parent node provides 

support for the child node." Turtle and Croft proposed to operationalize these degrees of 

support through either subjective probabilities, or DS belief functions. A similar approach 

was undertaken in RUBRIC, a full-text information retrieval system described by Tong and 

Shapiro (1985) . RUBRIC can be instantiated to operate with several alternative relevance 
calculi, the DS model being a prime example. 

The importance of such applications is obvious, as they attempt t o  take the DS model 'out 

of the lab' and implement it in realistic settings. 1n doing so, however, many adopters of the 

DS model have taken the model's validity for granted, either explicitly or implicitly. With 

that in mind, i t  is important t o  point out that both the cognitive and the normative roots 

of the DS model are still a matter of intense controversy: whereas Shafer (1987) argues 

that the theory of evidence is a natural extension of probability theory , many critics, e.g. 
Lindley (1987) , view it as a reformulated version of a specialized, albeit interesting, case 
of classical probabilistic inference. The debate is not helped by the somewhat forbidding 

notation of the DS model, which prevents an intuitive understanding of its underlying 

structure and philosophy. 

In fact, the gap between the theory and practice of the DS model seems to be two- 

directional. On the one hand, many practitioners believe that the normative correctness of 

the DS model is a 'closed case,' proceeding to implement i t  without questioning its underly- 

ing rationale. On the other hand, many researchers try t o  defend the DS model on complex 

philosophical and mathematical grounds, without realizing t h a t  simpler justifications can 

be found in the field, i.e. in the way t h e  model is actually used in certain canonicalsettings. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


The latter point is worth emphasizing: a close examination of certain applications of the 

DS model can provide not only a better understandi~g of the model, but, furthermore, a 

compelling normative justification. . 

The plan of the paper is depicted in figure 1 and described as follows. $2 presents the notion 

of index vectors and the challenge of eliciting and measuring relevance in a normative, rather 

than ad-hoc, fashion. 93 gives an overview of the DS model, as it unfolds in the context 

of a typical information indexing and retrieval (IR) application. This sets the stage t o  

four critical questions regarding the theoretical fit between the general features of the DS 

model and the specific requirements of IR applications. In order to answer these questions, 

$4 presents a canonical indexing model in which the notions of lexicons, taxonomies, and 

relevance, are treated formally and unambiguously. It is then shown that the canonical 

model is completely isomorphic t o  the DS model, leading to a new intuitive understanding 

of the latter. $5 offers concluding remarks about the implications of the research on the 

DS model and on IR applications. 

P u t  figure 1 around here 

The Problem 

Models of bibliographical indexing concern the construction of data structures that enable 

rapid content-based access t o  collections of documents. Given a document, on the one 

hand, and a keyword lexicon, on the other, the goal of the indexing model is t o  select a . 
subset of keywords that 'best' describes the document t o  its prospective users. Since some . 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


keywords are more relevant to the document than others, a numeric scale is often used 

to express the strength of association 'between the document and the selected keywords. 

The result is an index vector, consisting of pairs of keywords and their respective relevance 

weights. Several models exist for representing and manipulating such relevance vectors, 

and the reader is referred to Salton and McGill (1983) and to Salton and Buckley (1988) 

for comprehensive treatments of the general approach to the subject. 

Formally, let D be a set of documents about a certain domain of interest, and let K: = 

{Isl,. . . , k,,,) be a lexicon, or a set of domain keywords. The index of each document d E D 
is a set of pairs of the form: 

where Ii'; E K: and 0 5 T ;  5 1, i = I,. . . , n. The I'i's are Iezical subsets, representing 
different groupings of keywords, and the ri's are called relevance numbers. Taken together, 

the pair (I(;,r;) E Sd says that the degree of relevance between the document d and the 

lexical subset ICi is Ti. Had we restricted the IG's to be singletons only, (1) would become the 

familiar 'term-weight vectors' that are normally used in information indexing and retrieval 

applications. Further, had we required that all the r; be 0 or 1 only, (1) would be reduced 

to the familiar keyword list (also called 'subject headings') that is normally used to classify 

articles in professional journals. Given the obvious simplicity of a Boolean indexing scheme, 

why bother about developing formalisms for weighted indexing? 

The answer is that relevance is a subjective and composite relation, based on an aggregation 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper 19-92-27 


of several indexing opinions. Specifically, each document has many classifiers, or discerning 

characteristics, that determine its relevance. For example, the title of a document can sug- 

gest one index, whereas the abstract can suggest another. Other aspects of the document, 

obtained through lexical, linguistic, and citations, analyses will yield additional indexing 

opinions that must be taken into consideration. Hence, even if the individual opinions were 

forced t o  be binary, their aggregation would probably induce a continuous index. In addi- 

tion, the indexing opinions are not cast automatically; ;ather, they are elicited from human 

catalogers who inject yet another level of uncertainty and subjectivity to the indexing pro- 

cess. That is, when two catalogers are given access to the same classifier as background 

information, they may well supply two different (but hopefully similar) indexing opinions. 

Different IR applications use different models to handle this pluralism in a formal way. From 

a theoretical perspective, the credibility of these models hinges on their capacity t o  elicit, 

represent, and synthesize, relevance opinions in a normative, rather than ad-hoc, fashion. 

In order t o  do so, the relevance numbers and the rules that combine them must be given a 

compelling interpretation. So far, the leading interpretation in the study of relevance has 

been probabilistic, beginning with the seminal work of Maron and Kuhns (1960). Recently, 

however, several attempts were made t o  handle relevance in IR applications using the 

Dempster Shafer model, which is widely considered to be a less restricted extension of 

probability theory. The strengths and weaknesses of the latter approach are discussed in 

the next section. 

Center for Digital Economy Research 
Stem School of Business 
\Vorking Paper IS-92-27 


A Dempster-Shafer Indexing Model 

The DS theory of evidence concerns the representation and manipulation of degrees of 

support rendered by different sources of evidence to a common set of propositions, denoted 

0 and called the frame of discernment. In contrast to a standard Bayesian design, in 

which degrees of belief are normally assigned to elements of 0 directly, the DS model 

assigns degrees of belief to subsets of propositions, i.e. to members of the power set 2e, also 

called 'possibilities.' The DS model offers several complementary ways to express evidential 

support in possibilities. In particular, the model defines three mappings from 2' to [0, 11 

termed mass, belief, and plausibility, functions. The three mappings are mathematically 

equivalent in the sense that knowledge of any one of them (for every possibility) can be 

used to compute the other two. Therefore, we view them as alternative means t o  keep 

score of the same primitive set of degrees of support. In the standard model, when several 

sources of evidence support a common set of possibilities (the support can be cast in terms 

of either mass, belief, or plausibility functions), the overall support in the possibilities is 

computed through Dempster's rule of combination. , 

What is the nexus of the DS model and information indexing and retrieval applications? 

In one way or another, all DS-based IR applications are based on the following premises: 

(i) The DS notion of degrees of support can be used to operationalize the IR notion of 

relevance numbers; and (ii) When two or more classification criteria supply different sets 

of relevance numbers concerning the same document, Dempster's rule provides a plausible 

mechanism to combine them into a composite index (said otherwise: revise the relevance 

of the document to certain keywords in light of new evidence). The goal of this section is 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


to motivate a critical analysis of these premises. Specifically, we intend to: 

First, provide a rigorous but accessible overview of the DS model, as it unfolds in the 

familiar context of an IR application; 

0 Second, present a series of questions regarding the theoretical fit between the general 

features of the DS model and the specific characteristics of IR applications. 

The F'rame of Discernment: The frame of discernment 6' is an exhaustive set of mutually 

exclusive elements that can be interpreted as hypotheses, propositions, or simply 'labels.' 

The power-set that contains all the subsets of 6' (including 6' itself and the empty set) is 

denoted 2'. In general, the semantics of the labels depends on the context in which the DS 

model is applied. In information indexing and retrieval applications, the frame of discern- 

ment is normally taken t o  be a keywords lexicon K: = {kl,. . . , k,). To illustrate, a lexicon 
that supports a collection of documents about modern art might be K: = { ~ r p , B r a q u e ,  

Cezanne, . . . , ~ o r n ) ,  enumerating all the major artists of the Twentieth Century. The power 
set in this case is 2' = {{Arp), {Braque), {Cezanne), . . . , {Arp, Braque), { ~ r p ,  Cezanne), 
{ ~ r a q u e ,  cezanne), , . . . , { ~ r p ,  Braque, cezanne}, . . . ,0, K), the last two elements beings 
the empty set and K: itself. Each element in 2K represents a disjunction of keywords, 

denoted hereafter a lexical subset. The act of indexing a document using X: amounts t o  

choosing, among all the indexing possibilities in 2', the one or more lexical subsets t h a t  

best describe the document to its potential users. 

For example, suppose that an art scholar is asked to index the document "The Influence 

of Cezanne on early Cubism" using K, based on partial information such as the docu- 

ment's title or abstract. Without loss of generality, assume that (i) the main focus of 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


the document is Cezanne; and (ii) the only Cubist artists in the lexicon are Braque and 

Picasso. Under these assumptions, the scholar will probably supply an index of the form 

S = {({cezanne) , TI), ({Braque, Picasso), rz)) , with rl > 1-2. This would entail the follow- 
ing information: (i) the document is relevant to Cezanne; (ii) it is also relevant, to a lesser 

extent, to either Braque or to Picasso. This is quite different from the indexing opinion 

St = ( ( ( ~ e z a n n e ) ,  TI), ( ( ~ r a q u e ) ,  r2), ( {Picasso), rz)), which would be more appropriate 

if the document's title were, say, "The Influence of Cezanne on the early work of Braque 

and Picasso". 

We arrive a t  our first question: 

Question Q1: When the DS model is applied to information indexing and 

retrieval applications, the keyword lexicon X: is taken t o  be the frame of 

discernment, and indexing possibilities are taken to be elements of the lex- 

ical power set 2'. What are the taxonomical implications and limitations 

of this representation? 

To motivate this question, consider again the document "The Influence of Cezanne on 

early Cubism". Note that the most reasonable index of this document would be SN = 

{({cezanne), r l ) ,  ({cubism), r z ) ) ,  especially if the document's abstract makes no references 

to specific artists other than Cezanne. However, Cubism is not an element of the original 

lexicon X:, so it doesn't entail an indexing possibility. To solve the problem, we may want t o  

extend the original frame of discernment, creating a lexicon of the form K t  = KU (cubism}. 

However, the keywords Braque, P i c a s s o  and Cubism, have a great deal in common from 

a bibliographical standpoint. Therefore, K t  is not a valid frame of discernment, because 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


some of its elements are no longer mutually exclusive. Before we present a solution t o  

this problem, we have to be very specific about the proper relationship among frames of 

discernment, keyword lezicons, and tazonomies of classes. We'll return to this issue in 

section 3 ,  where an answer to Q1 is given. 

Mass F'unctions: A mapping m : 2' -+ [O, 11 with the properties: 

is called a mass function2, In the DS model, the mass m(X) represents the degree to which 

a certain source of evidence supports the possibility X, where X E 6. As a convention, 
the mass which is 'left over' after all the proper subsets of 6 have been assigned masses is 

allocated to 6 itself and denoted the uncommitted belief displayed by m, or m(6). 

In DS-based IR applications, where 0 is taken to be a keywords lexicon K, the mass m ( X )  

is taken to represent (to a first approximation that will be discussed shortly) a degree of rel- 

evance, or, more accurately, the degree of belief that the document is relevant to the lexical 

subset X C IC, according to a certain classifier. Hence, if a classifier (say, classifier number 

1) supplies the relevance opinion Sl = {({~ezanne), O.6), ({Braque, picasso), 0.3)), then 

the mass function that is induced by this opinion is defined as follows: 

2Throughout the paper, upper case variables refer to sets and lower case variables refer to scalars. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


ml({cezanne)) = 0.6 
ml ( { ~ r a q u e ,  Picasso)) = 0.3 

rnl(K) = 0.1 (4) 
ml(X) = 0 for all other proper subsets of K 

Note that the uncommitted belief induced by the opinion is assigned by default to the frame 

of discernment by means of ml(K) = 1 - 0.6 - 0.3 = 0.1. The rationale for this assignment 
is as follows. If a certain classifier provides no information whatsoever about indexing 

possibilities, the classifier's 'ignorance' can be represented by the indes S = {(K, I)}. 

This implies the mass function m ( ( ~ r p ,  Braque, Cezanne, . , . , ~ o r n } )  = 1 and m ( X )  = 0 
elsewhere, reflecting the (not very useful) opinion that the document is relevant t o  Arp, 

or t o  Braque, or to Cezanne, or to any other artist in the lexicon. Other classifiers can 

provide more focused relevance opinions, resulting with lower levels of m(K). Hence, unlike 

a standard probabilistic design, where the notion of uncommitted belief is not well-defined, 

the DS model provides explicit means to quantify and manipulate it via m(K). Although 

uncommitted beliefs, or 'second-order uncertainties,' can and have been treated in the 

standard framework of subjective probability, (e.g. Baron, 1987 ), there is no simple way 

to do it. The theory of evidence is unique in that it treats the notion of uncommitted belief 

explicitly, a t  the axiomatic level. 

It's important to observe that mass functions represent indivisible, or atomic, degrees 

of belief. For example, the magnitudes of m({Braque, ~ i c a s s o } ) ,  rn({Braque}), and 

m ( { ~ i c a s s o } )  are unrelated, and a mass function like m((Braque, ~ i c a s s o } )  = 0.9, 

m({Braque)) = 0, and rn({Picasso}) = 0 is not inconsistent with the theory. This par- 

ticular function represents a cataloger who strongly believes that the document is relevant 

to either Braque or to Picasso, although he is not willing to say anything more specific 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper 19-92-27 


beyond this assessment. 

But what does this notion of relevance mean? We arrive at our next question: 

Question Q2: A mass function is a formal, domain-independent, component 

of t h e  DS model. Relevance is an informal, but highly intuitive, concept 

that plays a key role in information indexing and retrieval applications. 

If a mass function is taken to represent relevance, then what is the exact 

semantics of this representation? Said otherwise, what type of relevance 

do mass functions represent? 

Question q2 suggests the premise that mass functions are not necessarily a natural rep- 

resentation of the intuitive notion of relevance, a s  it is typically construed in information 

indexing and retrieval applications. For example, if mass functions are used to represent 

relevance, then the relevance numbers in each index must sum up to 1. That is, the set 

of allowable indexing opinions {(Ii; , r l ) ,  . . . , (I(,, r,)) is constrained by xy ri = 1. Many 
would argue that this constraint doesn't make sense, and that an indexing opinion like, 

say, { ( { ~ l b e r s ) , 0 . 8 ) ,  ({Kandisnki),0.4), ({Klee),0.4)) is perfectly reasonable, The only 

'wrong' thing about this opinion is that it is inconsistent with the DS notion of a mass 

function, but this seems to be a limitation of the model's application, not of the opinion. 

One pragmatic solution is to treat the relevance numbers not as absolute, but rather as 

relative, measures of subjective relevance. According to this position, the two indexes S = 

{(A, O.8), (B, 0.4, )(C, 0.4)) and S' = {(A, 0.4), (B,0.2), (C, 0.2)) are equally informative, 

as both imply exactly the same relative information: the document is twice as relevant 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


to A as it is t o  B, and it is as relevant to B as it is to C. However, this immediately 

leads to another snag: according to the same principle, the index is also equivalent to 

St' = { ( A ,  0.2), (B, 0.1), (C, 0.1)). Yet St and St' reflect two different states of uncommitted 

belief (0.2 and 0.6, respectively), and thus they don't induce the same mass function. 

To get around the problem, we can elicit uncommitted beliefs directly from the catalogers3. 

For example, having specified a relevance opinion, say { A ,  0.8, B, 0.4, C, 0.41, the cataloger 

can be asked t o  rate his confidence in the opinion on a scale of 0 to 1. If the confidence 

level is 1, the index is normalized to { A ,  0.5, B, 0.25, C, 0.251, reflecting an uncommitted 

belief of 0. If the confidence level is 0.8, the index is normalized to { A ,  0.4, B, 0.2, C, 0.21, 

reflecting an uncommitted belief of 0.2. In general, for any unconstrained indexing opin- 

ion ((16, rl), . . . , ( I ' ,  rn)) and a confidence level c, we can find a unique mass function 
{m(Kl), . . . , m ( I L ) ,  m(K)) such that (i) the m(I<;)'s preserve the relative properties of 
the unconstrained rib; and (ii) m ( K )  = 1 - c. 

The shift from an absolute to a relative scale of relevance has several justifications. First, a 

significant body of psychological and cognitive evidence indicates that relevance is indeed 

a relative property (Saracevic, 1975 ). Second, we are motivated by the observation that 

ultimately, an IR application must satisfy the information needs of library patrons, and 

that relevance numbers should be used pragmatically to that end. For example, according 

to Maron (1982)'s 'Ranking Principle' , the chief objective of relevance numbers is to 
present to the patron a set of documents, sorted in decreasing order of perceived relevance 

to .his or her query. A similar principle is used in diagnostic expert systems, where ordinal, 

rather than cardinal, degrees of beliefs are often used to guide the inference engine to 

31n this section, the terms classifier and caialoger are used interchangeably. T h e  distinction between 
the two terms is made explicit in the next section. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


promising directions and to explain the system's reasoning to the people who consult it. 

If we accept Maron's Ranking Principle a s  a working assumption, then normalization is 

not an issue, since rankings are invariant under normalization. However, when multiple 

indexing opinions are aggregated into a pooled index (something that we haven't done yet), 

normalization becomes a tricky manipulation. Specifically, let Sl and S2 be two indexing 

opinions, $ an aggregation operator, and N a normalization operator. In many cases 

(depending on the specific definitions of $ and N), it can be shown that N(Sl $ S2) # 
N(Sl) $ N(S2), i.e. that N is not homomorphic. 

In conclusion, we see that even though relevance numbers can be represented by mass 

functions, the representation has some theoretical caveats. Clearly, these limitations are 

related to the fact that we are still lacking explicit domain semantics. That is, we don't 

know yet what is the exact meaning of relevance numbers. This analysis is taken up in 

section 3, where a complete answer to question Q2 is given, 

T h e  Core: The core of a mass function m : 2' -t [ O , 1 ]  is the set of possibilities X E 2' for 

which m ( X )  > 0. When the frame 8 is taken to be a keyword lexicon K, the core becomes a 
list of indexing possibilities, in the view of one particular classifier. For example, the core of 

the mass function induced by classifier 1 (Eqn. 4) is Cl = { { ~ e z a n n e ) ,  { ~ r a q u e ,  p i c a s s o ) ,  

K). Suppose now that the same document is indexed by another classifier (classifier no. 

2), whose indexing opinion is captured by the following mass function: 

m 2 ( { ~ i c a s s o ) )  = 0.8 
m2(AC) = 0.2 
m 2 ( X )  = 0 for all other proper subsets of K: 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


The core of this mass function is C2 = {{Picasso), K). Is there a credible way to combine - 

the two indexing opinions (4),(5) into an aggregate index? As a first approximation, one 

can focus on all the lexical subsets that both classifiers agree are relevant to the document. 

In particular, if classifier 1 thinks that X is relevant and classifier 2 thinks that Y is 

relevant, then both classifiers agree that X n Y is relevant (recall that both X and Y are 
interpreted as disjunctions of keywords). This leads to the following definition of a pooled 

core: Let ml, rn2 : 2' 4 [O, 11 be two mass functions with cores C1 and C2. The pooled 

core C = C1 @ C2 will be: 

For example, the pooled core of Cl = {{Cezanne), {Braque,Picasso),K} and Cz = 

{ { ~ i c a s s o ) ,  K) is Cl $C2 = {{Cezanne), { ~ i c a s s o ) ,  {Braque, Picasso), K)". In general; 

then, the pooled core can be viewed a s  a first approximation of the degree of consensus 

or disagreement displayed by two independent indexing opinions. If Cl 83 C2 = Cl = C2, 

we have a consensus regarding which possibilities are likely. If Cl $ C2 = 0, the classifiers 

agree on nothing. If C1$ C2 is not empty, we have an overlap of some opinions. Of course 

the problem of (6) is that it merely identifies areas of mutual agreement (or lack thereof) 

between two classifiers. In order to compute the intensity of such agreements, a more 

sensitive pooling mechanism is required. Dempster's rule provides one such mechanism. 

Dempster's Rule: The most fundamental (and debateable) pillar of the DS model is 

the convention that once degrees of support are cast in terms of mass functions, Demp- 

4Note that IC acts a s  an attractor, in that A n K = A for all A E K. 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


ster's rule provides a proper mechanism to combine them. Let ml and m2 be two mass 

functions defined over the same frame of discernment: ml, m2 : 2' -+ [O, 11, with cores 

Cl = {Al, . . . , A,, 1 and C2 = {Bl, . . . , Bn2 1, respectively. Dernpster's rule computes the 
pooled mass function m = m l  $ m2 : 2' -t [0, 11 as follows: 

The rationale behind (7-5) can be explicated through an 'intersection table.' In our two- 

classifiers scenario (4-5), the table has the following form: 

The top row of the table records the mass function of the first classifier excluding its zero 

elements, i.e. the set of values ml ( A l ) ,  . . . , ml (A,, ) for elements A; in the core Cl. T h e  
left column of the table records the mass values of the second classifier for its core elements, 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


i.e. the set of values m2(B1), . . . , m2(B,, ) (The curly brackets are dropped for the sake of 
brevity, e.g. m(Picasso, Braque) stands for m({Picasso, ~ r a q u e ) ) ,  etc.). Inside the table, 

t h e  (i, j)'th cell records the pooled mass contributed t o  A; n Bj by t6e pair A; and Bj, 
which is taken t o  be the product m(Ai) . m(Bj). Using these entries and combining cells 
with equivalent intersections following (7-8), one obtains: 

ml(cezanne) = 0.12, 
m l ( ~ i c a s s o )  = 0.24 + 0.08 = 0.32, 

m f ( P i c a s s o ,  Braque) = 0.06, 
m1(lC) = 0.02, 
m'(0) = 0.48, 

After multiplying by & = 1.02 one obtains: 

m(Cezanne) = 0.23 
m(Picasso) = 0.62 

m ( P i c a s s o ,  Braque) = 0.11 
m(K) = 0.04 
m(0) * O 

Since the m(-)'s sum up to 1 and m(0) = 0, the mapping m = ml-$ mz that emerges from 

Dempster's rule is also a mass function, consistent with (3). 

In words, Dempster's rule computes a measure of agreement between two sources of evi- 

dence concerning various possibilities drawn from a common frame of discernment. T h e  

rule is conservative in the sense that it focuses only on those possibilities t h a t  both sources 

support. The magnitude of the pooled support that a possibility X collects is computed by 

summing the products of the two masses rnl ( X )  and mz(X), which explains the product 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


operator in (7). Because the sources of evidence express their opinions over 2* rather than 

over 8, a joint agreement on a possibility can occur in more than one way, i.e. whenever the 

two sources support supersets of X. This explains the summation operator in (7)' which 

runs over all the possible supersets of X ,  Finally, when a pairing of two opinions results in 

a null possibility (the empty set), the multiplication of their masses may still be positive. 

This is an anomaly, since the definition of a mass function (3) requires that the mass of 

the null possibility be zero. This explains the role of (8)' in which r n l ( 0 )  is deducted from 

the total mass and the remaining mass is divided by (1 - m'(0)) t o  ensure that the pooled 
mass will sum up to 1. 

Dempster's rule is often compared to and contrasted with Bayes rule, because both rules 

concern the combination of probabilistic opinions into an aggregate (posterior) opinion. It 

is crucial to observe however that unlike Bayes rule, which is a trivial consequence of the 

axioms of probability theory, Dempster's rule is a prescriptive pooling mechanism which is 

neither right nor wrong, and thus it is less of a 'rule,' and more of a 'recipe.' Therefore, we 

take the position that the ultimate justification of Dempster's rule should be sought in the 

field, i.e., in the various applications in which the rule is supposed to have a certain sense 

of domain validity. This leads t o  the following question: 

Question Q3: What is the intuitive justification of Dempster's rule in the 

context of information indexing and retrieval applications? If one wishes 

to aggregate indexing opinions via a certain pooling mechanism, then why 

use (7-S) and not another set of formulae? 

A typical way to avoid this question is to invoke the argument: "If one uses mass functions 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper 19-92-27 


to represent relevance numbers, then one should combine them using Dempster's rule, be- 

cause that's how mass functions are combined in the DS model." This argument could have 

been valid if Dempster's rule had a normative, domain-independent, and non-controversial 

justification. But this is not the case. In fact, many researchers have struggled to make 

sense of Dempster's rule, and the debate is still going strong: "Shafer's theory has been 

strongly criticized for its failure to give a meaning to the measures of belief and plausibil- 

ity, or to show how someone might am've at a particular numerical assessment. In the 

absence of a definite interpretation, it is difficult to see how the rules of the theory, and 

in particular Dempster's rule, can be justified " (Buxton, 1989 ). Given this controversy, 
the importance of question Q3 is obvious. Hence, our goal is to clarify, and to a certain 

extent defend, the meaning of 1)empster's rule in the specific bibliographical context of an 

information indexing and retrieval application. This analysis is carried out in section 3, 

where a complete answer to Q3 is presented. 

Belief Functions: Building on the elementary notion of a mass function m : 2' -t [0,1], 

the.function Be1 : 2' -+ [O, 11, denoted a belief function, can be defined as follows: 

Whereas m ( X )  measures the support rendered to X ( a  subset of propositions) directly, 

Bel(X') measures the total support rendered to X and t o  all its subsets (each being a 

more specific proposition). This relationship is depicted in figure 3, which illustrates how a 

Bel(.) function can be derived from the m(.) function given by (10). Note that (3) and (11) 
C 

imply that Bel(8) = 0 and Bel(0) = 1 always. In fact, (11) implies that the Be1 function 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


is completely determined by the mass function m, and, likewise, that m can be recovered 

from Bel's definition (Shafer, 1976, , p. 39). 

Plausibility Functions: Whereas Bel(X) measures the total support rendered t o  a pos- 

sibility X, the plausibility of X, denoted Pl(X), measures the maximal support that X can 

possibly attain under a given mass function m. Specifically: 

In words, Pl(X) records the total mass allocated to all the possibilities with which X 

intersects. For a pictorial description of this relationship, refer again to figure 3. 

Put figure 3 around here 

The intuitive relationship between the three functions m(.), Bel(.), and Pi(.) can be de- 

scribed as follows. Beginning with the definition of Bel, consider the two possibilities 

X ,  A E 0. Since both X and A are disjunctions of propositions, the set-theoretic statement 
A 2 X is equivalent to the logical rule A -+ X, which we will interpret as: 'If the truth lies 

in A, it must also lie in X.' Therefore, the sum of all the masses associated with premises 

A that imply X can be viewed as a measure of the total support rendered to X. As regards 

Pi's definition, suppose now that A n  X # 0 (but A is not necessarily a subset of X ) .  Since 
the possibility A is a disjunction of propositions, the mass m(A) rendered to it can 'float' 

freely to any one of its subsets, including those that intersect X. In the extreme case, the 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


intersection A f l  X may inherit the entire mass of A. It follows that P l ( S )  is the upper 

bound of Bel(X). 

To do justice t o  the theory of evidence, it should be noted that the construction of Be1 and 

P1 using m is only one way to define these functions. Shafer provided direct definitions of 

mass, belief and plausibility functions in terms of each other. He has also emphasized the 

key role that subaditivity plays in the theory of evidence, a point which we now turn t o  

discuss in the specific context of information indexing and retrieval. 

Sub Additivity: The complement of a set X C 0, i.e. the set of all propositions that are 

in 0 and not in X, is denoted hereafter W. Definitions (11) and (12) imply the following 
important relationships: 

If a certain Belb were a Bayesian representation of degrees of belief, the additivity axiom 

of probability theory ( X  i l  Y = 8 implies Belb(X U Y )  = Beb(X) + Belb(Y)) would mean 
that 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


yet (13) and (14) imply that in the general case Bel(X) < 1 - Bel(x), leading to the famous 
subaditivity property of the theory of evidence: 

In other words, the belief that one holds in a possibility does not automatically imply 

one's disbelief in the negation of that possibility. In information indexing and retrieval 

applications, where 8 is taken to be a keyword lexicon K, this tenant has important im- 

plications. For example, if the admittance of new evidence causes a cataloger to increase 

his belief in the document's relevance to a lexical subset X, the same evidence should not 

necessarily decrease his belief in the document's relevance to lexical subsets in X, espe- 
cially if the cataloger is not confident in his relevance opinion. In particular, the difference 

1 - Bel(X) - Bel(X) is called the uncommitted belief with respect to X. If Be1 were 
a Bayesian representation of degrees of belief, the uncommitted belief would be zero by 

definition. This is best illustrated in the 'state of insufficient reason,' in which one knows 

absolutely nothing about a set of propositions 8 = {ql,. . . , q,). Whereas the common so- 
lution is to set Bel(q) = l / n  for all q; f 8, the theory of evidence would set Bel(6) = l and 

Bel(X) = 0 for all the other proper subsets of 0. This is the case when the uncommitted 

belief is at maximum. 

The interpretation of Bel(.) and PI(.) as lower and upper-probabilities has led many t o  

view the theory of evidence as a novel calculus for eliciting and manipulating interval- 

valued, rather than point-valued, degrees of beliefs, Indeed, the theory allows one t o  

express the belief in every hypothesis X by means of the interval [Bel(X), Pl(X)], which 
I 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


may be updated as new evidence about X is admitted. Further, the width of the interval, 

P l ( X )  - Bel(X), is by definition 1 - Bel(X) - Bel(x), or the uncommitted belief with 
respect t o  X. If the uncommitted beliefs induced by a certain mass function rn were zero 

for all the hypotheses under consideration, the intervals would degenerate to zero widths 

and Be1 would be a standard probability function. Yet in the more general case in which 

the mass reflects some 'second-order uncertainty,' or 'ambiguity,' the degree of belief in 

possibilities X drawn from 8 is allowed to 'float' between Bel(X) and Pl(X). One benefit 

of such a model is that it is more robust and less prone to human errors in assessing 

subjective degrees of support. 

We arrive a t  our last question: 

Q u e s t i o n  Q4: The designer of a DS-based IR application can choose to elicit 

and represent relevance through three alternative languages: mass func- 

tions, belief functions, and belief intervals. What is the relationship among 

these three representation in the specific context of information indexing 

and retrieval applications? 

Recall that the three functions m, Bel, and P1, are mathematically equivalent, in the sense 

that knowledge of any one of them (for every possibility) can be used to compute the other 

two. This equivalence might lead one to concur that the question of whether to use m ,  Bel, 

or [Bel, Pl] to elicit and manipulate degrees of support depends on cognitive and efficiency 

considerations. As it turns out, this conclusion is quite naive. For example, belief intervals 

are not as flexible a representation as we would like them to be. That is, when one elicits 

[Bel, Pl] intervals from a source of evidence, it is not true that the only restriction is that 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


0 < Be1 < P1 < 1. Again, a full understanding of these constraints requires a semantic 
interpretation, which we now proceed to present. 

A Canonical Indexing Model 

As figure 1 illustrates, the key theme of this paper is the interplay of the theory and practice 

of the Dempster Shafer model, as viewed through the 'lens' of a particular application. 

The previous section was structured around the key constructs of the theory: the frame 

of discernment, mass and belief functions, and Dempster's rule. Coming from the other 

extreme, this section is structured around the key constructs of the application: taxonomies, 

relevance functions, and index aggregation operators. This leads to the development of a 

canonical indexing model, around which the remainder of the paper evolves. In building 

this model, our intention is to articulate an indexing mechanism which is simple, intuitive, 

and, most importantly, probabilistic. 

The main result that we are aiming at is this: notwithstanding its domain-specific origin 

and its strict probabilistic nature, the canonical model that is expounded here is completely 

isomorphic t o  the DS model. This has several important implications. First, the canonical 

model provides concrete answers t o  all the questions that were raised about the theoretical 

fit between the DS theory and information indexing and retrieval applications. Second, 

because the limitations of the former will be explicit, implicit limitations of the latter will 

become apparent. Third, because the canonical model makes no use of extra probabilistic 

arguments, it also provides a simple probabilistic interpretation t o  the DS theory, which is 

often claimed t o  be an extension of probability theory. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


4.1 The Taxonomy 

In most IR applications, documents are indexed and sought within a data structure that 

is called a tuzonomy. The taxonomy is a finite set of classes, or categories, designed t o  

organize documents in a particular subject of interest. For example, consider the follow- 

ing set of classes, taken from an art-related taxonomy: C = {~rt , Braque, Cubism, Dada, 

I m p r e s s i o n i s t ,  Janco, Modern, Picasso). Taxonomies are constructed by domain experts 

- in this case art scholars - who provide two types of information: (i) a set of classes; and 

(ii) a taxonomical data structure, expressed as ordered pairs of classes. Specifically, if we let 

(x, y) code the assertion 'y is a direct sub-class of x', then the expert might specify a rela- 

tion like H =  {(Art, Modern), (Art, Impressionists), (Modern, Cubism), (Cubism, ~ r a q u e ) ,  

(cubism, picasso), (Dada, Picasso), (Dada, Janco)), resulting with the taxonomy depicted 

in figure 2. 

Put figure 2 around here 

Formally, a taxonomy is a rooted directed acyclic graph < C, H >. The nodes set C 
represents taxonomical classes, and the edges set H represents a relation on C (i.e., a 

subset of C x C, giving directed pairs) with two restrictions: (i) no cycles exist in t h e  

digraph, and (ii) the digraph contains exactly one root, i.e. a class r f C such t h a t  no 

edge ( x , r )  exists in H .  The descendants of a class x are the subclasses of x, and t h e  

predecessors are the generalization of x. The root of the taxonomy is the only class t h a t  

(i) has no predecessors, and (ii) generalizes all other classes, e-g. the class a r t  in figure 

2. Since the taxonomy is a finite and acyclical graph, it contains a 'boundary,' or a set of 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


terminal classes. A class k E C is said to be terminal if has no descendants, i.e. if no edges 

of the form (k, x) exist in H. In the example of figure 2, the terminal classes are { ~ r a ~ u e ,  

P i c a s s o ,  ~ a n c o } .  We will denote the set of terminal classes by K. This notation is not 

coincidental, a s  K is precisely the keyword lexicon discussed in the previous section. 

In a taxonomy, each class c E C is characterized by two sets that we denote LIB(C) and 

VOL(C) and call the library rooted in c and the libraries that intersect c, respectively. LIB(C) 

contains all the classes that can be reached by paths beginning at c and following edges all 

the way 'down' to the terminal classes. This set of classes, which includes c itself, represents 

the entire set of classes into which c may be decomposed. Conversely, VOL(C) contains all 

the classes that can be reached by following 'upward' paths beginning at c and ending a t  

the taxonomy's root. This set, which also includes c itself, representi all the classes to 

which c can be generalized. For example,  cubism) ism) = {Cubism, Braque, P i c a s s o )  and 

~ ~ ~ ( ~ u b i s r n )  = {Cubism, Modern, ~ r t ) .  These sets can be given a recursive definition, as 

follows: 

LIB(C) = {c} U {x E C13y E LIB(Y) with (y,x) E H} (17) 

VOL(C) = {c) U {x E C(3y f V O L ( ~ )  with (x, y) E H )  (18) 

A taxonomy is similar to a hierarchical tree structure, with a difference. Unlike a common 

tree, each class in the taxonomy can have as many parents as we desire. For example, 

in figure 2 { ~ i c a s s o )  is a subclass of both {cubism) and { ~ a d a ) .  Also note in passing 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


that definitions (18-17) imply that (i) LIB(TOO~) = C, so that the root library contains all 

the classes in the taxonomy; (ii) ~ O ~ ( r o o t )  = {root), the root library is not contained by 

any other library, and (iii)  LIB(^) = {k) if and only if k E K, so that terminal classes are 

characterized by libraries that contain singleton classes only. 

The indexing process: The act of indexing a document within a taxonomy can be de- 

scribed as a top-down, depth-first search process. To illustrate, suppose that an art-related 

document is t o  be indexed within the art taxonomy from figure 2. Without loss of gen- 

erality, assume that the document is relevant to modern art. Beginning at the first level 

under Modern and proceeding left to right, we test if the document is relevant to Cubism. 

If the answer is 'yes,' we step down one level and test if it's relevant to Braque. If the 

answer is 'yes,' we index the document in Braque. If the answer is either 'no' or 'unsure,' 

we test if it's relevant to Picasso. If the answer is either 'no' or 'unsure,' and assuming 

that P i c a s s o  is the last class below Cubism, we backtrack one level, index the document 

in Cubism, and proceed to explore Dada. If the document is deemed irrelevant t o  any one 

of the classes thus visited, we backtrack one level and index the document under Modern. 

This would reflect the notion that even though the document is related to modern art, 

the existing taxonomy fails t o  discern the exact category t o  which it belongs. Thus the 

indexing process involves a depth first search which is cut off a t  any class that is deemed 

to be irrelevant to the indexed document. 

We see that the notion of relevance that is consistent with this process is defined over 

subsets of classes, not over individual classes in li. That is, if a document is indexed under, 

say, Cubism, it implies that the document belongs to the library ~ ~ ~ ( C u b i s r n ) ,  i.e., t o  the 

collection of documents about Cubism, Braque, or Picasso. This definition of relevance is 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


convenient because it allows us to be as specific as we wish in our relevance statements. 

If we're sure t h a t  a document is relevant to a certain class, we index it under that class. 

If we're not sure, we can step back and index the document in a library that contains the 

class. We can do this all the way up to the root of the taxonomy, at which point the 

indexing decision r o o t  would express the opinion that the document belongs somewhere 

in the library, without specifying exactly where. 

Relationship to the theory of evidence: The relationship between a taxonomy 

< C, H > and a lexical frame of discernment K: is simple, but not trivial. From a mathe- 
matical perspective, the taxonomy can be viewed as a subset of a graph G whose vertices 

are indexed by 2'. In the graph G, there is an edge from the vertex indexed by the subset 

A E 2K t o  the vertex indexed by the subset B f 2' if and only if A 2 B such that no 

other subset C satisfies A 2 C 2 B. In the taxonomy, which is a subset of G, each vertex 

c corresponds t o  the vertex of G that is indexed by the subset of K: obtained from the 

terminal elements in LIB(c). Thus, each class in the taxonomy can be associated with an 

element in 2 K ,  namely the subset obtained from the keywords of the terminal classes that 

can be reached by looking 'downward' from the class in the taxonomy. 

While there is this mathematical association, there are important differences between the 

notion of a taxonomy and the power set of K: as used in the DS model. First, we may 

distinguish between two types of taxonomies: static and adaptive. A s t a t i c  t a x o n o m y  

consists of a fixed and unmodifiable set of classes, like the Dewey decimal system or t h e  

Library of Congress index. An a d a p t i v e  t a z o n o m y  is a dynamic data structure that evolves 

from the indexing process itself. Such a taxonomy consists of a fixed set of keywords, 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


denoted K, and an 'open-ended' set of classes, each class being a different grouping of 

keywords frorn K. That is, when a new document is deemed relevant to a subset of keywords 

that don't make up an existing category, we simply announce this subset a new class and 

add it to the taxonomy. Hence, a document titled "A letter from Braque to Junco" may 

well be indexed in the class { ~ r a ~ u e ,  ~ a n c o ) ,  something that would have been impossible 

in a static taxonomy that doesn't contain such a predefined category. The only restriction 

that is placed on an adaptive taxonomy is that it must contain at least all the elements in 

K (as singletons, or classes that are made up of single keywords), as well as K: itself. Hence, 

we begin with the initial set of classes C = {(kl}, . . . {kn)), K), and add more classes to it 
as we go along. 

Thus, the precise relationship between the IR notion of a taxonomy and the theoretical 

DS notion of a lexical frame of discernment K: can be described in two steps. First, any 

static taxonomy is conceptually a 'frozen' and 'named' version of some adaptive taxonomy. 

Second, any adaptive taxonomy, in turn, is a subset of the lexical power set 2'. An exarnple 

is illustrated in figure 4, using the simple lexicon X: = {Braque, Picasso, ~anco}. Figure 

. 4-a depicts the lexical power set 2K (excluding 0). In practice, dealing with the power 

set of keywords is unrealistic, since the set of all possible classes becomes prohibitively 

large even with only a few dozen keywords. However, once the semantics of the lexicon 

is taken into consideration, many if not most of the classes in 2n become irrelevant, since 

they represent arbitrary grouping of keywords that can be excluded from the taxonomy for 

all practical purposes. If we choose to focus on tree taxonomies only, the power set can 

be restricted further by disregarding all its non-hierarchical subsets.' Figure 4-b depicts a 

=Using the notation 1x1 to represent the cardinality of a set X ,  characterize each class X E C by the 
set L ( X )  = {Y E CllXl = IYI). A taxonomy < C, H > will be a tree tazonomy if and only if for every 
class X E C, L(X) contains only disjoint sets. 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


specific adaptive taxonomy that might have emerged from a hypothetical indexing process. 

By definition, this taxonomy is a subset of the exhaustive 4-a taxonomy. Finally, figure 

4-c depicts a 'frozen' and 'named' version of the 4-b taxonomy. The naming procedure is 

domain-dependent: if certain classes 'make sense' on semantic grounds, they can be given 

descriptive names that refiect their contents. For example, the class ( ~ r a q u e ,  P i c a s s o )  

can be named Cubism, the class {Braque, Picasso, Janco) Modern, etc. 

Put figure 4 around here 

We now turn to question 41, which asked whether the DS concept of a lexical power set 

provides an adequate 'skeleton' for indexing documents in IiR applications. The answer to 

this question is 'yes,' but there is a caveat. Note that there is a subtle difference between 

a bibliographical taxonomy and a subset of the DS power set: in the former, the classes 

have names; in the latter, the classes correspond to anonymous lexical subsets. That is, 

in the logical context of the DS model, to say that a document is relevant to {kl, k2} is 

tantamount to saying that the document is relevant to either kl, or to k2. Yet in the context 

of a bibliographical taxonomy, most lexical subsets have meaningful names, like Cubism aad 

Dada, just like the elementary keywords that make up their contents. p here fore, indexing 
I 

a document in a named class might mean something quite different than the implication 

that the document should be indexed in one or more of the class's constituent keywords. 

For example, suppose that a cataloger decided to index the title "Cubist Landscapes" 

directly in the class Cubism. In the standard DS model, this indexing opinion would 

imply that "the document is relevant either to Picasso, or to Braque, or to another Cubist 

artist." Although this interpretation is logically correct, it clearly entails a loss of concrete 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


information about the document's direct relevance to Cubism at large. Also, it leads to a 

situation in which the set of documents relevant to a class is larger than or equal to the 

union of the sets of documents relevant to all of its children classes, which is inconsistent 

with the disjunctive interpretation of a standard DS power set. 

How can we augment the power set representation of the DS model so as not to force a 

cataloger to disregard information about a documents's direct relevance to non-singleton 

classes? By viewing the power set (qr the portion of the power set that is in use for 

indexing) as a taxonomy < C, H >, the problem may be solved by adding to the taxonomy 
a new set of net classes, as follows. For each non-terminal class c f C, add (i) a new 

class named net-c to C, and (ii) a new link (c,net-c) to H. The new class net&, which is 

a direct terminal descendant of c, can now serve as the index of the documents that are 

relevant specifically and directly to c. With this modification, each class c becomes a mere 

tag, or a pointer, and the proposition 'the document is relevant to the class c' is once again 

equivalent to the proposition 'the document is relevant to the library rooted at c.' 

Since the net classes are terminal classes, they become elements of the lexicon. Therefore, 

in a taxonomy which is augmented with a set of net classes, every indexing decision can 

be interpreted as selecting subsets of relevant keywords (which may include net classes) 

from the lexicon, so we are back in the familiar disjunctive stance of the DS frame of 

discernment. Purists may find this solution crude, but the adjustment is necessary if one 

wants to apply the DS model to information indexing and retrieval applications without 

violating, or misinterpreting, the set theoretic premise of the model. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


4.2 Relevance Functions 

The fundamental rule of indexing is that a document should be indexed using certain 

keywords if prospective users of the document would find it relevant to these keywords. 

In its most primitive form, then, relevance is a Boolean and subjective relation, indicating 

categorically that a document d E D is relevant to a lexical subset X = {kl,. . . , km) in the 
view of a particular library patron. However, due to the fact that bibliographical classes 

don't have crisp boundaries, and due to the multitude of relevance opinions expressed by 

different catalogers and library patrons, a more reasonable question is not whether d is 

relevant t o  X, but rather what is the intensity of this relation. In other words, we seek to 

represent relevance in terms of a mapping r : 2K x D --+ [O, 11, rather than in terms of a 

characteristic function r : 2 R  x D --+ {0,1). 

There have been many efforts to interpret relevance on probabilistic grounds, Maron and 

Kuhns (1960) being the defining article. One of the fundamental problems in this area 

has been the proper definition of the sample space from which relevance propositions are 

drawn. This point was alluded to by Maron, as follows: 

"The notion of probability of relevance can be interpreted in two different per- 

spectives: of the document, as the proportion of patrons of a given type who 

would judge that document relevant, and of the patron himself, as the propor- 

tion of documents of a given type which he would judge relevant. The first 

model leads t o  a theory of probabilistic indexing; The second model leads to a 

theory of probabilistic query formulation (Maron, 1982 ) ." 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


In what follows we will focus on Maron's first perspective, in which multiple patrons form 

relevance opinions about a fixed document. Consistent with Maron's observation, this 

perspective yields a model of inexact indexing. Unlike Maron, though, the uncertainty 

associated with the indexes in our model will lead not to probability functions, but rather 

to Dempster Shafer mass functions, i.e. functions that conform to definition (3). 

Let U = { u ~ , .  . . , u,,) be a set of catalogers, and let X: be a keyword lexicon. Suppose 
that each cataloger in U is asked to index the same document using AC, i.e. to specify one 

or more keywords from K: that are relevant to the document. Suppose that cataloger u; 

supplies the opinion that the document is relevant to the lexical subset C K; we then 

record this opinion by means of the following Boolean function: 

v;(X) = { 1 if u; indexed the document using X 
0 otherwise 

Since each cataloger ui supplies one set of relevant keywords, there will be exactly one 

subset X E 2n such that v;(X) = 1. Also,. the empty set is not allowed to be a valid 

relevance opinion. If a cataloger is unwilling to give an opinion or is unsure about the 

proper classification of the document, the document is indexed by default in the root class 

f C ,  which is also an element of 2'. This convention makes sense because the root class 

represents the entire library, and is therefore the natural place t o  store documents whose 

specific class membership is undiscernible. 

After all n catalogers have cast their indexing opinions regarding the same document d, we 

33 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


compute for each lexical subset X E 2n three 'relevance counters,' as follows: 

(X) = C r(Y,d) LIB 
Y cLIB(x) 

( X )  = C r(Y,d) VOL 
Y €VOL(X) 

In words, r ( X ) ,  rIIB(X), and r V O L ( X )  count the number of catalogers who classified 

the document in X ,  in the library rooted in X, and in libraries that intersect (or in a 

hierarchical taxonomy, contain) X, respectively. (When d is fixed in our analysis, we will 

suppress the explicit dependence, and write r ( X )  instead of r(X, d).) 

Relationship t o  t h e  theory of evidence: Suppose now that the Boolean relevance 

opinions of the catalogers are averaged over the space of catalogers U through the following 

computation: 

The resulting function m ( X )  is a DS mass function over the lexical space K. Formally, we 

have the following proposition (the proofs are given in a separate appendix): 

34 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


Proposition 1: Let U = {ul, . . . , un) be a set of catalogers with their Boolean relevance 
opinions vl, . . . , v, : 2K -+ {O, 11. The real function m : 2" -+ [O, 11 defined by 
m ( X )  = i . r ( X )  = $ C:', v i ( X )  is a mass function, satisfying definition (3). 

The consequence of the proposition is that DS mass functions arise naturally when we view 

the relevance functions as derived from averages of multiple Boolean indexing opinions. We 

begin with a space U of n catalogers who are asked to index the same document using the 

same lexicon K. Each cataloger supplies an individual opinion that specifies which keywords 

are relevant to the document. Note that the cataloger's indexes are not restricted, and that 

they are free t o  choose any keyword or combination of keywords that, in their opinion, are 

relevant to the document. Next, shifting our attention from the catalogers space U t o  the 

lexical space K, we compute for each lexical subset X E X: a measure of 'average relevance,' 
1 n r ( X ) ,  which represents the fraction of catalogers who thought that the document was 

relevant t o  X. Disregarding the lexical subsets that no cataloger has chosen, we obtain a 

set of pairs of the form ((1'1, rl), . . . , ( I ' ,  %)} in which I{; E 2K and 0 < r; = i - r ( I ' ; )  n < 1. 

We are now in a position t o  answer question 92, regarding the 'type' of relevance that 

DS mass functions represent, given the context of multiple relevance opinions. First, the 

canonical model has yielded the type of relevance numbers that are a t  the center of any 

probabilistic indexing model. Second, according to Proposition 1, these numbers form a 

mass function, consistent with the standard DS model. Finally, the meaning of the mass 

m ( X )  is simply the fraction of catalogers who thought that the document was relevant t o  

the set of keywords X. 

Following the same line of reasoning, we can also provide an answer t o  question Q4, t h a t  

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


sought an IR interpretation of the meaning and relationship of mass functions, belief func- 

tions, and belief intervals. Given the IR context in which 8 tC, it is easily seen that 

the relevance counters (20-22) are proportional to the mappings that represent degrees of 

belief in the DS model. Specifically, dividing each counter by n - the number of cata- 
logers - yields the mass, belief, and plausibility, functions defined in (3), ( l l ) ,  and (12), 
respectively: 

I 
Bel(X) = - . r (X) n LIB 

If we combine these observations with the interpretation of the power set of the lexicon as a 

taxonomy, we see that the mass on a lexical subset X is given by the fraction of catalogers 

who indexed the document using X directly. Similarly, the belief in .X is the fraction of 

catalogers who indexed the document in libraries within X, and the plausibility of X is the 

fraction of catalogers who indexed the document in libraries that intersect (in a hierarchical 

taxonomy, contain) X. 

The key component of the canonical model that enables this interpretation of the DS 

functions is the assumption of multiple patrons and the v ; ( - )  functions that keep track of 

their individual indexing opinions. In the canonical model, the assumption of multiple 

patrons is explicit and is the foundation on which the entire analysis rests. In the DS 

model, the assumption of multiple Boolean opinions and their respective v i ( - )  functions are 

implicit. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


4.3 Aggregating Relevance 

So far, we have assumed that (i) relevance is a two-place function r ( X ,  d) between a doc- 

ument d and a lexical subset X, and that (ii) all the catalogers from whom r ( X , d )  was 

elicited were of the same 'type,' using Maron's terminology (see quote in Section 4.2). In 

this section we retract both assumptions. Specifically, we argue that relevance, in its most 

elementary form, is a three-place relation r(X, d, q) in which q is the classifier dimension, or 

context, in which d is judged to be relevant to X. With that in mind, r ( X ,  d) can be viewed 

as a measure of aggregate relevance that runs over all the possible contexts in which d's 

relevance to X is judged. We now turn t o  describe a pooling mechanism that implements 

such an aggregation. 

Let Ul = {ul, . . . , u,,} be a group of nl catalogers who are asked t o  index a document d 
using a keyword lexicon X: based on a certain classifier, denoted ql. Similarly, let U2 = 

{u:, . . . , u:,} be a group of n2 catalogers who are asked to index the same document, 
based on another classifier, denoted 92. The semantics of the classifiers depends on the 

indexing scenario. For example, ql might be the document's title, whereas 92 might be 

the document's abstract. Alternatively, in a dynamic model in which the relevance indexes 

of documents are continuously revised t o  reflect actual use, ql and 92 can represent two 

different information needs, or queries, in the context of which the relevance of d t o  X was 

judged, either explicitly or through an automatic keywords extraction algorithm. 

For example, let K = { A ,  B, C) and let Ul and U2 consist of 4 and 3 catalogers, respectively. 

Assume that within the U1 group, two catalogers index the document in {A, B), one in 

{A}, and one in {B). Within the U2 group, one cataloger indexes the document in {B, C}, 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


one in ( A ,  B), and one in {B). These indexing opinions are tabulated in the two tables 

on the left side of figure 5. The columns of each table represent the common lexicon 

X" = { A ,  B ,  C) . The ith tuple in each table represents the relevance opinion elicited from 
the i t h  cataloger in the respective group as a binary vector. To be precise, 1 in the (i, j)th 

table entry indicates that cataloger i has included the j t h  keyword in his indexing opinion 

and 0 indicates that he didn't. 

Put figure 5 around here 

In what follows, we denote the binary vector that represents the relevance opinion of 

cataloger u; by w;. Similarly, the set of all relevance opinions of a group of catalogers 

will be denoted W = {w;lu; E U}. Finally, the group of catalogers U together with their 

relevance opinions W will be denoted T =< U, W > and referred t o  as a model. With 

this notation, consider two groups of catalogers Ul and U2 together with their relevance 

opinions Wl and W2. H a l l  the catalogers in both groups are considered equally qualified 

to cast relevance opinions, then a variety of different pooling mechanisms may be used 

to compute the aggregate index induced by all the catalogers. Symbolically, we seek an 

operator 8 t o  compute t h e  model <'U, W >=< Ul, W; > @ < Ul, VV2 >. 

The pooling mechanism depicted in figure 5, denoted hereafter by @, implements an op- 

erator that was described by Hummel and Landy (1988) as "a consensus opinion formed 

by the committees of two." Here, the set of all possible committees is U = Ul x U2, con- 

sisting of all the nl - nz unique pairs of catalogers that can be drawn from Ul and from 
U2. The combined relevance opinion associated with the pair (u;, U S )  f U is defined t o  be 
the binary conjunction of t h e  individual opinions of u; f U1 and US E U2, which we denote 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


w;,j = w; . w i .  For example, consider the first tuple in the U table in figure 5. This tuple 
gives the opinion of the committee (ul, ui), i.e. wl,ll = (0,1,0). This opinion is the binary 

conjunction of the individual opinion wl = (1,1,0) and wi = (0,1,1) as given by catalogers 

ul and u i  respectively. 

T h e  pooling operation QD is completed by treating U a s  a new group of catalogers and using 

(23) t o  compute the mass function that i t  induces: 

Note that m' is not necessarily a mass function, since QD can yield a result like mf(0) > 0. 

This happens when there is a pair of opinions (e.g. 212 and u\ in our example), such t h a t  

the conjunction of the opinions gives the empty set even though neither opinion gives the 

empty set individually. To resolve the problem, we normalize mf(-) as follows: 

In words, for each lexical subset X E X, m t ( X )  is the fraction of the (paired) catalogers 

who classified the document in that subset. Next, the fraction of the catalogers who agreed 

on nothing - m' (0, 0,O) - is distributed evenly among the fractions of catalogers who agreed 

on something, yielding a new mass that sums up t o  unity. This function is now taken t o  

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


be the 'aggregate index' of the document d, implying the taxonomy depicted at the top 

right of the figure. We may also view m(X) as the fraction of (paired) catalogers who 

index the document in X among those paired catalogers who do not index the document 

in the empty set 0. That is, if we discard pairs that agree on no relevant keywords, then 

the remaining pairs can compute their pooled relevance and then yield a mass function m. 

R e l a t i o n s h i p  t o  t h e  t h e o r y  of evidence: In order to explore the relationship of the 

multiple catalogers/multiple classifiers scenario to the DS model, we first have to step 

back and say a few words about the role of 'sources of evidence' in the latter. Basically, 

the DS theory models a situation in which a finite set of 'pieces' or 'sources' of evidence 

E = {el,. . . , en) is used to discern the likelihoods of various possibilities X drawn from a 
common frame of discernment. Yet the identity of the sources of evidence is rather implicit 

in the model's language. That is, the common notation m;(X) and Bel;(X) is meant to 

be shorthand of the inass and belief functions m(XJei) and Bel(Xle;), where e; is the 

source of evidence whose 'support' of the possibility X we are trying t o  capture. The total 

support that the body of evidence E lends to X is computed through Dempster's rule (7- 

a), which yields a new function of the form m(Xlel,. . . , e n )  = m(Xlel)$, . . . , $m(Xle,).6 
For simplicity's sake, we denote the latter function m(X), which reads 'the mass that the 

possibility X attains after all the available evidence has been taken into consideration.' 

With that, the relationship between the canonical model and the DS model is as follows: 

possibilities correspond to lexical subsets, and sources of evidence correspond to classifiers, 

i.e. to different aspects of the document (title, abstract, author, etc.) that help discern the 

6Dempster's rule (7-8) is commutative and associative, so its extension from 2 to n operands is 
straightforward. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


document's proper classification. The missing piece in the analogy is the set of catalogers 

who inspect each classifier individually and cast Boolean relevance opinions based on that 

information. In the DS model, these catalogers are implicit. In the canonical model, 

they are the driving force of the entire analysis. Another way to interpret the group of 

catalogers is t o  view them as a group of l i b r q  patrons who approach the same document 

with different information needs (or queries) in mind, each corresponding t o  a piece of 

evidence that highlights one facet of the composite relation that we call 'relevance.' 

How should we combine this multitude of relevance opinions into an aggregate index? In 

the canonical indexing model, the opinions are combined at the catalogers level, through 

the cartesian consensus operator @. In the DS model, where the catalogers space is implicit, 

the opinions are combined at the classifiers level, via Dempster's rule @. The key point, as 

illustrated in figure 5, is that both combination methods lead to precisely the same result. 

Formally, we have the following proposition: 

P r o p o s i t i o n  2: Let Tl =< Ul, Wl > and T2 =< Uz, W2 > be two sets of catalogers 
together with their Boolean relevance opinions, and let T =< U, W > be the outcome 
of T = TI QZ, T2, as follows: (i) U = Ul x U2; and (ii) W = {wij = W ;  w;~w; E 

Wl and w; E W2). Let @ be Dempster's rule a s  it is applied to mass functions. 

Let mTl, mT2, and m be the mass functions induced by the models TI, T2, and 

TI @ T2. Then we have the following: mrIeT2 = mT, $ mT2. 

Proposition 2 serves to shed light on the prescriptive nature of Dempster's rule. That is, 

once we accept the fact that Dempster's rule @ is isomorphic to the cartesian product 

operator @, a whole set of questions emerges: (1) why are the individual catalogers forced 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


to specify only Boolean, and not probabilistic, relevance opinions? (2) why are the groups 

of catalogers joined using a set product operator, a s  opposed to other set combination 

operators, e.g. union? (3) why committees of two, and not, say, comatous of three? (4) 

why are the individual relevance opinions combined using a binary conjunction rule? (5) 

why are all cataloger opinions given the same weight, where in practice some opinions may 

be more informed or worthy than others? 

A proper answer to these questions requires an elaborate research program, involving both 

theoretical and empirical work. Also, the exact nature of the combination rule can vary 

from one situation to another. In the specific context of information indexing and retrieval, 

one can think of a family of indexing models, designed to operate under different sets of 

assumptions. For example, if the catalogers prefer to express binary relevance opinions, we 

can use Dempster's rule (or the equivalent 8) to combine them. If they wish t o  express 

relevance by selecting a number between 0 and 1, we can modify the combination rule t o  

accommodate this language as well (this will be similar t o  the way Yen (1989) extended 

Dempster's rule in the GERTIS system ). If the catalogers wish t o  use a discrete language 

such as 'remotely relevant,' 'somewhat relevant,' etc., we can develop a fuzzy version of 

the rule. The key point here is that the precise definition of 8, along with Proposition 2, 

provide clear guidelines as t o  (i) which aspect of the combination rule has to be modified, 

and (ii) what will be the normative relationship between the modified rule, Dempster's 

rule, and probability theory. 

Center for Digital Economy Research 
Stem School of Business 
\Vorking Paper IS-92-27 


5 Conclusion 

All the major implications of the research were already discussed in the body of the pa- 

per. We conclude with several comments regarding (i) efforts to apply the DS model to 

information indexing and retrieval applications; and (ii) efforts to interpret the theory of 

evidence on logical or probabilistic grounds. 

I n f o r m a t i o n  I n d e x i n g  a n d  Retrieval: One .objective of the paper was to articulate a 

concrete relationship between the Dempster Shafer model and information indexing and 

retrieval applications. The relationship that we expounded can be summarized as follows: 

keyword lexicon (K) 
taxonomy (< C, H >) 
classification criteria ( q i )  
groups of catalogers (Uj) 
individual indexing opinions (kK) 
relevance .measure to class ( r )  
relevance measure to library (r (X) LIB 
relevance measure to volume (rVOL (X)) 
relevance aggregation operator (8) 

IR application 

frame of discernment (0) 
named subset of 2' 
sources of evidence (e;) 
implicit 
implicit 
mass function (m) 
belief function (Bel) 
plausibility function (Pl) 
Dempster's rule ($) 

Dempster-Shafer model 

We hope that the details of this 'mapping,' as discussed in the paper, will promote a better 

understanding of the proper way to apply the DS model to IR applications. In addition, 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


the mapping provides a practical foundation for building a variety of different indexing 

algorithms. These algorithms can use the 8 combination rule, or versions thereof, as called 

by the application. Ultimately, the success of one relevance calculus or another will depend 

on face validity and on field performance considerations. 

T h e  D e m p s t e r  Shafer t h e o r y  of evidence: Several authors provided canonical exam- 

ples that explain the rationale of the DS model in the way of analogy, Zadeh (1986) 

illustrated how mass functions and Dempster7s rule can be mapped on fuzzy queries about 

i n t  erval-valued, rather than point-valued, attributes, in a relational database . Gordon and 
Shortliffe (1985) gave a compelling interpretation of how a DS calculus can be used t o  rep- 

resent and combine the degrees of belief that clinical symptoms (pieces of evidence) render 

to classes of bacterial organisms (disjunctions of hypotheses), whose set relationships forms 

a hierarchy. Coming from a different, domain-independent, direction, Hummel and Landy 

(19SS) analyzed the probabilistic foundation of the theory of evidence in general, without 

making any assumptions on the underlying domain or the logical structure of the hypothe- 

ses . In contrast to other researchers who attempted to interpret high-level constructs of 

the DS model directly (e.g. Baron , 1987 , Kyburg, 1987, , and Schocken and Kleindorfer, 
1989 ), Hummel and Landy took a more fundamental viewpoint that showed how t h e  the- 

ory's constructs were implicitly linked t o  statistics of the opinions of hypothetical experts. 

However, their abstract mathematical analysis made no use of canonical examples, and it 

is difficult to interpret its implications on practical domains of application. 

With that in mind, one objective of this paper was to illustrate how constructs of t h e  DS 

model that up until now defied simplistic interpretations yield to a plausible interpreta- 

tion in the practical context of a multi-classifier/multi-cataloger model. We have seen, in 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


propositions 1 and 2, that the canonical model leads to exactly the same set of functions 

and fdrmulae of the DS model. Hence, from a mathematical perspective, the canonical 

model is isomorphic to the DS model. Yet from a semantic perspective, it invokes the 

notion of multiple catalogers. Although the notion of multiple patrons appears in several 

major interpretations of bibliographical relevance (Maron and Kuhns, 1960 , Maron, 1982 
), it may or may not exist in other applications. 

To what extent, then, are we forced to accept the canonical interpretation of multiple cata- 

logers in principle? One can simply reject the notion, avoiding the isomorphism by denying 

the possibility of multiple opinions, and relying simply on the DS theory as presented in 

Section 2. In t h a t  case, however, one is left with philosophical questions like Q1 through 

Q4. There could, of course, be other interpretations. However, in a real sense, all valid 

iilterpretations must be accepted or explained. That is, either the interpretation is ac- 

cepted as is, or one must show how another set of semantic constructs provides a plausible 

interpretation of the theory. One advantage of our approach is that new calculi can be 

developed, different from the DS combination rule, that might better suit particular appli- 

cations, based on modifications of the canonical model. I t  is precisely the unsatisfactory 

elements of this canonical model that permit us to systematically seek improved methods 

for managing uncertainty. 

Since our analysis was strictly probabilistic, it seems to be consistent with Lindley's ob- 

servation that "Anything that can be done with belief functions can better be done with 

probability theory" (Lindley, 1987, , p. 38). However, we believe that this argument misses 
an important point. To use a crude but useful analogy, it will be unreasonable to write off 

a programming language like Pascal simply because every Pascal program can be rewritten 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


in machine language. Just like high-level languages provide complex structures for dealing 

with specialized problems, the DS model provides non-elementary functions and operators 

t h a t  lend themselves nicely to certain domains of application, information indexing and 

retrieval being one such example. 

We conclude that the Dempster Shafer theory of evidence provides an attractive framework 

for supporting information indexing and retrieval applications, and that these applications, 

in turn, serve t o  highlight the internal validity and limitations of the theory. Dempster's 

rule remains a controversial operator for combining degrees of beliefs,. but this paper has 

illustrated that it is just one member in a family of parameiric combination rules, and that 

the question of whether to use this rule or another is more a matter of reasoned choice than 

a matter of adhering t o  a fixed set of formulae. 

Appendix: Proofs 

P r o p o s i t i o n  1: Let U = {ul , . . . , u,) be a set of catalogers with their Boolean relevance 
opinions vl, . . . , v ,  : 2" + { O , l ) .  The real function m : 2" + [O, 11 defined by 
m ( X )  = ! r ( X )  = C:=, v ; ( X )  is a mass function, satisfying definition (3). 

P r o o f :  For each class X  E 2h, either all, some, or none of the catalogers indexed the 

document in X .  Hence, r ( X )  = n ,  or r ( X )  < n, or r ( X )  = 0, respectively, implying that 

0 5 m ( X )  _< 1. Hence, m(.) is a mapping from 2h to [O, 11, satisfying the first requirement 

of being a mass function. The second requirement is that the function will sum up t o  1 

over all the subsets of K. This is proved as follows. For each cataloger u;, exactly one 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


of the subsets X 2 K: is such that vi(X) = 1. For all other subsets Y, vi(Y) = 0. Thus 

1 3 x , 2 ~  vi(X) = 1. We thus have the following: 

Further, since no cataloger gives 8 as his opinion, it is always true that v40) = 0. Therefore, 

the third requirement of definition (3) is satisfied. Thus m is a mass function. 

Definition of t h e  @ combination rule: Let U = {til,. . . , u,) be a set of catalogers 
with their Boolean relevance opinions vl, . . . , v, : 2K -4 { O , l ) .  To denote the fact that 
the keyword k E X: was included in the indexing opinion of the i t h  cataloger, we use the 

following notation: 

1 if vi(X) = 1 and k f X 
wi(k) = 

0 otherwise 

If K: = {kl,. . . , k,), the binary vector obtained by wi(kl), . . . , wi(kn) is denoted wi and 
called the Boolean relevance opinion of ui. The collection of all such opinions of members 

of U is denoted W = {w;(u; E U). To combine the relevance opinions of two sets of 

catalogers < Ul, Wl > and < Uz, Wz >, we use the following formulae (60): 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


Where wi(k) and wi(k) are as defined in (29) for u; E U1 and of u: E U2. 

In section 4.2 we have shown how a mass function can be constructed from a set of catalogers 

(Proposition 1). Specifically, recall that the mass function induced by the model 

T = < U, bV >, denoted hereafter mT(X), gives the fraction of catalogers in U, among those 
4 

catalogers who express an opinion (i.e. w; #O), whose relevance opinion exactly matched 

X .  This is the same as those catalogers for whom w;(kj) = 1 if and only if k, E X. For 

T =< U, W >, This fraction can be written down exactly: 

for X # 0. Of course, m ~ ( 0 )  = 0. We are now in a position to prove the following. 

Proposition 2: Let TI =< Ul, Wl > and T2 =< U2,W2 > be two sets of catalogers 
together with their Boolean relevance opinions, and let T =< U, W > be the outcome 

of T = TI D T2, as follows: (i) U = Ul x U2; and (ii) W = {wij = w; w:lw; E 

Wl and w: f W). Let $ be Dempster's rule a s  i t  is applied t o  mass functions. 
Let mT1, mT2, and mT1,%, be the mass functions induced by the models TI, Tz, and 

Tl @ T2. Then we have the following: mTleT2 = m, @ m,. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


Proof: This proposition asserts a relationship between the general Dempster Shafer model 

and the canonical indexing model presented in section 4. The fact that the mapping from 

one model to t h e  other is homomorphic follows from Hummel and Landy (1988) , but we 
will supply an independent argument here in the context of the indexing model. 

Let us assume t h a t  there are nl catalogers in Ul and n2 catalogers in U 2 ,  and let us fix a 

particular nonempty lexical subset X of the lexicon K. We wish to show that 

Beginning with the right hand side of (34) and using the definition of Dempster's rule @, 

(mTl @ m T 2 ) ( X )  is equivalent to 

Multiplying top and bottom by nl . n2 and distributing, we obtain 

Z A ~ B = X  nlmTl ( A )  ' n2mT2 ( B )  
Cnnl3+0 n l m T l  ( A )  . n2mT2 ( B )  ' 

Recalling how mass functions are induced from the opinions of groups of catalogers (Eqn. 

23 in Section 4 . 2 ) ,  we may interpret this expression as follows. The value nlmT1 ( A )  counts 

the number of catalogers in Ul who have indexed the document in the lexical subset A. 

Likewise, n2mT2 ( B )  counts the number of catalogers in T2 who have indexed the document 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


in the lexical subset B. Hence, the product nlmTl ( A )  n2mT2 ( B )  counts the number of 

distinct pairs of catalogers ( u ; ,  u j l )  in Ul x Uz where u; E Ul has indexed the document in 

A and u j f  f Uz has indexed the document in B. Now, according t o  the way 8 is defined, 

if u; has indexed in A  and u j l  has indexed in B, then the pair of catalogers ( u ; ,  u j l )  end up 

indexing the document in A  n B = X. Thus, the numerator of expression (36) counts all 

the cataloger pairs that end up indexing the document in X. 

Precisely the same argument can be used t o  show that the denominator of ( 3 6 )  counts all 

the pairs of catalogers who don't index the document on 0. Thus ( 3 6 )  gives the fraction 

of cataloger pairs in UI x U2 that have indexed the document in X out of the pairs of 

catalogers in Ul x U2 who have indexed the document in some non-empty lexical subset, 

which is exactly the definition of mTlQT2, the left hand side of ( 3 4 ) .  

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


References 

[l] J .  Baron. (1987). Second-order probabilities and belief functions. Theory and Decision, 

[2] G. Biswas, J.C. Bezdek, M. Marques, and V. Subramanian. (1987). Knowledge assisted 
document retrieval (I and 11). Journal of the American Society for Information Science, 
38(2):83-110. 

[3] R. Buxton. (1989). Modeling uncertainty in expert systems. International Journal 
Man-Machine Studies, 31:415-476. 

[4] A.P. Dempster. (1967). Upper and lower probabilities induced by a multivalued map- 
ping. Annals Mathematics Statistics, 38:325-339. 

[5] A.P. Dempster. (1967). Upper and lower probability inferences based on a sample 
from a finite univariate population. Biometrica, 54:515-528. 

161 J .  Gordon and E.H. Shortliffe. (1985). A method for managing evidential reasoning in 
a hierarchical hypothesis space. Artificial Intelligence, 26:323-357. 

[7] R.A. Hummel and M.S. Landy. (1988). A statistical viewpoint on the theory of evi- 
dence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(2):235- 
247. 

[S] H.E. Kyburg. (1987). Bayesian and non-Bayesian evidential updating. Artificial In- 
telligence, 31:271-293. 

[9] D.V. Lindley. (1987). The probability approach t o  the treatment of uncertainty in 
artificial intelligence and expert systems. Statistical Science, 2(1):17-24. 

(1 01 M .E. Maron. (1982). Associative search techniques versus probabilistic retrieval mod- 
els. Journal of the American Society for Information Science, 308-310. 

[ll] M.E. Maron and J.L. Kuhns. (1960). On relevance, probabilistic indexing and infor- 
mation retrieval. Journal of the Association for Computing Machinery, 7(3):216-244. 

[12] G. Salton and C. Buckley. (1988). Term weighing approaches in automatic text re- 
trieval. Information Processing Management, 24(5):513-523. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


[13] G. Salton and M.J. McGill. (1983). Introduction to Modern Information Retrieval. 
McGraw-Hill, New York. 

1141 T .  Saracevic. (1975). Relevance: a review of and a framework for thinking on the notion 
in information science. Journal of the American Society for Information Science, Nov- 
Dec:321-342. 

[15] S. Schocken and P.R. Icleindorfer. (1989). Artificial intelligence dialects of the Bayesian 
belief revision language. IEEE Transactions on Systems, Man, and Cybernetics, 
19:1106-1121. 

[16] G. Shafer. (1976). A Mathematical Theory of Evidence. Princeton University Press. 

1171 G. Shafer. (1987). Probability judgement in artificial intelligence and expert systems. 
Statistical Science, 2(1):3-44. 

[I81 R.M. Tong and D.G. Shapiro. (1985). Experimental investigations of uncertainty in 
a rule-based system for information retrieval. International Journal of Man-Machine 
Studies, 223265-282. 

[19] H. Turtle and W.B. Croft. (1991). Evaluation of an inference network-based retrieval 
model. A C M  Transactions on Information Systems, 9(3):187-222. 

[20] J. Yen. (1989). Gertis: a Dempster-Shafer approach t o  diagnosing hierarchical hy- 
potheses. Communications of the ACM, 32(5):573-585. 

[21] L.A. Zadeh. (1986). A simple view of the dempster-shafer theory of evidence and its 
implication on the rule of combination. The .AI Magazine, 85-90. 

Center for Digital Economy Research 
Stem School of Business 
Working Paper IS-92-27 


E X T E R N A L  V A L I D I T Y  

I Shafer I 

Indexing 
Model 

& Retrieval 
( Theory I 

Indexing 
Model 

I N T E R N A L  V A L I D I T Y  

Figure 1: A pictorial description of the paper's methodology. Section 3 uses the termi- 
nology and rationale of the Dempster Shafer theory to derive a DS indexing model for IR 
applications (top arrow). Taking the opposite direction, Section 4 builds a canonical index- 
ing model that is based on the domain specific requirements of IR applications. As it turns 
out, the canonical model provides a probabilistic and domain-independent interpretation 
of the Dempster Shafer theory of evidence. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


art 

. . . modern . . . impressionist . , . 

. . . Cubism . . . Dada . . . 

. . . Braque . . . Picasso . . . Janco . . . 

Figure 2: An excerpt from an art-related taxonomy designed to classify documents on 
major artists and artistic movements. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


Figure 3: An illustration of the relationship that exists among t h e  a mass (top), belief 
(left), and plausibility (right) functions t h a t  represent the same set of primitive degrees of 
support. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


(Braque, P i c a s s o )  (Braque , Janco3 ( P i c a s s o ,  Janco) 

(Braque) ( P i c a s s o )  (Janco) 

(Braque) ( P i c a s s o )  (Janco} 

modern 

cubism 

\ 
Braque P i c a s s o  Janco 

Figure 4: T h e  evolution of a taxonomy from a lexical power set 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27 


The U1 taxonomy The U2 taxonomy ' 
w i t h  m ( ~ , d , q l )  v a l u e s :  w i t h  m(X,d ,q2) valu'es : 

u l :  1 1 0 
u2: 1 0 0 
u3: 0 1 0 
u4: 1 1 0 

The U taxonomy 
w i t h  m(X,d) v a l u e s :  

U2 A B ' C = UlxU2 A B ' C  
---- ------------- ------ ------------- 
u l ' :  0 1 1 u l , u l ' :  0 1 0 
u 2 ' :  1 I o ~ 1 ~ ~ 2 7 :  1 1 0 
u 3 ' :  0 1 0 u l  , u 3 ' :  0 1 0 

I l 2 , u l J  : 0 0 0 

Figure 5: T h e  individual indexing opinions of two groups of catalogers (U, and U2) are 
recorded a t  the bottom of the figure. These opinions induce two different taxonomies 
and two different relevance functions, m(X, d,q,) and m(X, d, g2), depicted at the top of 
the figure. T h e  combination of t h e  relevance taxonomies via Dempster's rule @ a t  t h e  
c l a s s i j e r s  level and t h e  combination of the opinions via t h e  cartesian consensus rule 8 a t  
the catalogers level leads t o  the s a m e  pooled index depicted a t  t h e  top right of t h e  figure. 

Center for Digital Economy Research 
Stem School of Business 
IVorking Paper IS-92-27