Probabilistic Expert Systems for Forensic
Inference from Genetic Markers

A. P. DAWID

University College London

J. MORTERA

Università Roma Tre

V. L. PASCALI

Università Cattolica del ‘‘Sacro Cuore’’, UCSC, Rome

D. VAN BOXEL

Carnegie Mellon University

ABSTRACT. We present a number of real and fictitious examples in illustration of a new

approach to analysing complex cases of forensic identification inference. This is effected by careful

restructuring of the relevant pedigrees as a Probabilistic Expert System. Existing software can

then be used to perform the required inferential calculations. Specific complications which are

readily handled by this approach include missing data on one or more relevant individuals, and

genetic mutation. The method is particularly valuable for disputed paternity cases, but applies also

to certain criminal cases.

Key words: Bayesian networks, DNA profile, forensic identification, incomplete evidence,

mutation, paternity testing

1. Introduction and background

In a simple problem of forensic DNA identification, we have a ‘‘trace’’ biological sample of

unknown provenance, and further ‘‘reference’’ samples obtained from known individuals. All

the samples are typed by DNA profiling, and it is desired to use the resulting data to shed light

on the origin of the trace sample. More complex problems also arise: thus in a case of disputed

paternity, DNA information on the child can be regarded as supplying partial information

about its father’s DNA (Dawid & Mortera, 1998).

DNA profiles as currently used consist of measurements on a number of genetic markers,

typically ‘‘short tandem repeat’’ (STR) markers (Weber & May, 1989), chosen by forensic

geneticists for their usefulness. For each marker we can observe its genotype, comprising two

genes (or bands), one inherited from the mother and the other from the father (although it is

not possible to observe which is which). Each marker has a finite number (up to around 20) of

possible values (alleles), generally taking positive integral values, for each of its two

constituent bands. The markers used for forensic identification are chosen to be located on

different chromosomes, and hence segregate independently. It is often reasonable to assume

random mating within an appropriate population, which induces both Hardy–Weinberg and

linkage equilibrium, so that different markers, as well as different bands of the same marker,

behave entirely independently. We shall assume this throughout.

Databases have been gathered from which the frequency distributions of the different

markers, in various populations, can be estimated. In this paper we use estimates based on

data collected by the Forensic Genetics Laboratory, Catholic University of the Sacred Heart

� Board of the Foundation of the Scandinavian Journal of Statistics 2002. Published by Blackwell Publishers Ltd, 108 Cowley Road,
Oxford OX4 1JF, UK and 350 Main Street, Malden, MA 02148, USA Vol 29: 577–595, 2002


(UCSC), Rome. However, as more data are collected, so these estimates may be refined (up-to-

date information on the UCSC database may be obtained from: http://www.mclink.it/

personal/MD1696/data/freqask.htm). Values used in this paper should not be regarded as

definitive.

We are here concerned with the general statistical problem of inferring identification on the

basis of the DNA and other evidence at hand (Dawid & Mortera, 1996). This can in principle

be solved by determining the relative likelihoods, induced by the full data, for the various

competing hypotheses. However, in many cases samples are not available for one or more

individuals of interest, and instead we only have indirectly relevant information, perhaps

through genetic typing of their relatives. Calculation of the desired likelihoods, on the basis of

such partial or incomplete data, can then become both conceptually and computationally

demanding, particularly when we allow for the possibility of mutation during gene

transmission.

Here we present a new approach to solving such a problem, by reformulating it as a

Probabilistic Expert System (PES): a joint graphical and numerical representation that can be

implemented and processed in general-purpose computer software. In a PES, complex inter-

relationships are broken down into simple modular units, out of which the entire graphical

representation is constructed. The resulting representation then forms a framework for the

application of fast and efficient computational algorithms.

A complex genetic pedigree fits particularly smoothly into the PES model. The nuclear

family relationships constitute natural modular building blocks of the representation, to such

an extent that terms such as ‘‘parent’’ and ‘‘child’’ have become part of the general

terminology of PES, even for entirely non-genetic applications. The conditional probability

tables required are simple and uncontroversial, being given by Mendelian laws of inheritance

and logical relationships between genes and genotypes. And the conditional independence

relations forming the backbone of a PES representation are likewise natural in the setting of

genetic inheritance, since, conditional on its parent’s genes, a child’s genes are entirely

independent of those of all other individuals (more precisely: of those of its non-descendants).

In this paper we exhibit how to pass from an initial pedigree representation of a forensic

identification problem to an appropriate graphical PES representation, and how to use

information on gene frequencies, Mendelian inheritance and mutation processes to set up the

numerical part of the PES specification. We describe how one can apply existing software to

calculate the desired likelihoods. The method is illustrated on a number of DNA identification

problems of varying degrees of complexity, including real paternity testing cases and an

artificial criminal identification example.

2. Disputed paternity

2.1. Basic set-up

In the simplest case of disputed paternity a man is alleged to be the father of a child, but

disputes this. DNA profiles are available on the mother m, the child c, and the putative father

pf. The disputed pedigree can be represented as in Fig. 1, where a square indicates a male and

a circle a female, and tf denotes ‘‘true father’’; grey indicates that a DNA profile is available

for that individual. On the basis of these data, we need to assess the likelihood function over

possible hypotheses as to the true father. Often these hypotheses are reduced to two: the true

father either is the putative father, or else is drawn randomly from the population, being

unrelated to the mother or putative father. Throughout this paper we shall make the latter

simplifying assumption whenever the true father is not otherwise identified.

578 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


Because the markers used are on different chromosomes, and we are assuming random

mating, we have complete independence across markers. It follows that we can consider the

markers one at a time: we simply have to obtain the likelihood ratio on the basis of the data

for each marker separately, and finally multiply these values together to obtain the overall

likelihood ratio based on all the data on the full collection of markers.

So consider now the measured genotypes, from all three parties, for some fixed marker. To

find the associated likelihood function we need to calculate the joint probability of this triplet

of observed genotypes, under either hypothesis as to paternity. Making the reasonable

assumption that, before we have any data on the child, the identity of the true father is

independent of the profiles of the mother and the putative father, we can transfer attention to

the conditional probability of the child’s genotype, given the other two. Under either

hypothesis, this is calculated simply: under paternity, we just apply Mendel’s laws of

segregation; under non-paternity, we require (estimates of) the frequencies of relevant marker

alleles among the population at large. From these likelihoods we readily obtain the desired

likelihood ratio. Using Bayes’s Theorem, this can then be combined with the prior odds of

paternity, based on external evidence, in order to obtain the posterior odds for paternity.

As an illustrative example, suppose that the data, for marker FES, are: child’s genotype ¼
f12; 12g, mother’s genotype ¼ f10; 12g, putative father’s genotype ¼ f10; 12g. The population
frequencies of alleles 10 and 12 are, respectively, 0.28425 and 0.25942. In this case,

conditioning on the genotypes of mother and putative father, we see that the child’s genotype

will be as observed if and only if both the mother and the true father contributed allele 12 to

the child. This event has probability 0:5 � 0:5 if the true father is the putative father, and
probability 0:5 � 0:25942 if the true father is, instead, some unrelated individual from the
population. Thus the likelihood ratio in favour of paternity (based on these data for marker

FES alone) is 0:5=0:25942 ¼ 1:9274.

2.2. From pedigree to expert system

In the above simple problem the calculations are trivial, and have long been widely

implemented much as we have described (Essen-Möller, 1938): we certainly do not need to

develop any clever new methods of solution. However, we wish to extend our analysis to more

complex problems, particularly those with missing data, for which the calculations are by no

means trivial. Purely as a gentle lead-in to this extension, it will be valuable to reformulate the

simple paternity problem of section 2.1 above as a ‘‘Probabilistic Expert System’’, or PES

(Cowell et al., 1999).

A PES is a representation of a complex probability structure by means of a directed acyclic

graph, having a node for each variable, and directed links describing probabilistic causal

relationships between variables. The overall probability structure is completely determined by

specifying, as desired, the conditional probability tables for each variable given its ‘‘parents’’

in the graph. On the basis of probabilistic conditional independence properties embodied in

the graph, the complex global model then decomposes into simpler localized submodels,

providing a framework for the application of fast and efficient computational algorithms for

Fig. 1. Simple paternity pedigree.

Scand J Statist 29 Forensic inference from genetic markers 579

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


exact calculation of marginal and conditional probabilities, and much else beside (Dawid,

1992; Spiegelhalter et al., 1993). The calculations can be described as effected by ‘‘propaga-

tion’’ of information through the network: this involves efficient organization of simple

calculations affecting one local cluster of variables at a time, but spreading throughout the

whole network to yield the correct overall answers. These propagation algorithms have now

been implemented in widely available software, such as HUGIN (http://www.hugin.dk), GENIE

(http://www2.sis.pitt.edu/�genie) or XBAIES (http://www.staff.city.ac.uk/�rgc), thus enabling
many otherwise intractable complex problems to be solved.

Because we are at liberty to choose which unobserved variables to include in a PES

representation, there can be many such representations, some more manageable than others.

Finding an appropriate representation is crucial, as the efficiency, or even the viability, of the

computational routines is highly sensitive to the topology of the graphical structure. PES

construction is to some extent an art-form, but can be guided by scientific and logical

considerations. The main contribution of this paper—beyond advertising the simplicity and

benefits of general PES ideas, methods and technology—is to present and analyse what we

consider to be good representations for the type of paternity and identification problems we

address. However, for other problems, other kinds of representation may be better. The search

for good representations for specific problems is an important task for continuing research in

this area.

Our graphical PES representation of the simple disputed paternity problem (for a single

marker) is displayed in Fig. 2. Purely for presentational purposes we colour-code the nodes to

distinguish different types: grey again marks a node which is observed, while black represents a

disputed hypothesis. The arrows represent (sometimes degenerate) probabilistic influences.

For a detailed description of the semantics of such diagrams, see Cowell et al. (1999).

To attain the most efficient and simplified representation, we have aimed to represent the

problem at as deep and disaggregated a level as possible. First, we need to identify, and create

nodes for, all the interesting variables in the problem. These do not necessarily have to

correspond to observables, although all relevant observables must be represented. In our

model, in order to maximize the efficiency of the calculations as well as the logical clarity of the

representation we choose to disaggregate each individual’s genotype into its constituent,

unobserved, paternally and maternally inherited genes. We thus have a node pfmg

representing ‘‘putative father’s maternal gene’’, etc., as well as nodes for the observed

genotypes: mgt represents ‘‘mother’s genotype’’, etc. Table 16 provides a summary of all the

notation used in this article.

A related representation (see e.g. Thompson, 2000) has nodes for genotypes, but not for

their constituent bands: information as to whether a gene was inherited from the mother or

father is represented by means of additional (unobserved) binary meiosis or segregation

Fig. 2. Simple paternity network.

580 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


indicators. Our own representation is perhaps a little more transparent and straightforward, at

least for present purposes.

An important feature of Fig. 2 is the explicit introduction into the graph of the ‘‘target’’ (or

‘‘hypothesis’’ or ‘‘query’’) node, tf ¼ pf? (‘‘true father ¼ putative father?’’), representing
the hypotheses of interest. We are thus considering this query on exactly the same footing as

observable or latent genetic information. This is a natural step within the general PES

approach, although perhaps less natural from the standpoint of genetics, where graphical

modelling is usually restricted to representing pedigree structure. There is a large number of

powerful specialist pedigree analysis programs available that work with such pedigree

representations (e.g. LINKAGE (http://linkage.rockefeller.edu/soft/linkage/), MENDEL

(http://www.biomath.medsch.ucla.edu/faculty/klange/software.html), among others (see list

at http://linkage.rockefeller.edu/soft/list.html)); and any single one of the problems we

consider in this paper could probably be handled better by some one of those programs.

However, none of these has the degree of generality and extendibility of the more broadly-

based PES technology.

There are alternative ways of building and analysing PES representations, without explicitly

representing the hypothesis node in the graph. Indeed, for some problems, as in section 4.1

below, this may be the only efficient way to proceed. We prefer to include an explicit

hypothesis node wherever possible, since this is simpler to interpret, and allows one to read off

directly, simply by querying the ‘‘target’’ node, the quantity of most interest: the likelihood

ratio (for the relevant marker) in favour of paternity, on the basis of the observed evidence.

To complete our PES representation, we need to supply the numerical part of the

specification. This requires that we give, for each node in the network, the table of

probabilities for its various values, conditional on each configuration of values at its ‘‘parent’’

nodes (if any). We specify these tables as follows. At nodes corresponding to ‘‘founder’’ genes

we use population gene frequencies. A child’s maternal gene is obtained by drawing, at

random, one of it’s mother’s two (paternal and maternal) genes, mmg and mpg, and similarly

for its paternal gene (drawing from its father’s genes). The table for a genotype node is

degenerate, encoding the simple deterministic relationship between an unordered pair of

values and its two constituents. The true father’s paternal (maternal) gene is either identical

with the corresponding gene of the putative father, or else generated from the relevant

population gene-frequency distribution, depending on the value of the hypothesis node tf ¼
pf?.

At the hypothesis node itself, we could set an initial distribution to represent actual prior

beliefs about paternity, on the basis of other evidence in the case. However, we shall instead

insert a purely formal uniform prior at this node (prior probability of paternity ¼ 0.5, as in
Table 2)—for then the formally calculated ‘‘posterior odds’’ on paternity will in fact be

numerically identical with the likelihood ratio in favour of paternity based on this marker.

Under our assumption of independent markers, these values can then simply be multiplied

together to yield the overall likelihood ratio, based on all the markers; and this can then, if

desired, be combined with genuine prior beliefs to obtain the correct overall posterior odds on

paternity. (It would be possible, but more complicated, to handle the multi-marker data

directly in a sequential process: thus we could start with a genuine prior probability

distribution at the hypothesis node, and, introducing one new marker at a time, propagate the

associated evidence to obtain the induced posterior distribution at the hypothesis node, based

on that marker’s data—this then becoming the prior to be used with the next marker.)

Consider again the illustrative example of section 2.1, with case data: cgt ¼ f12; 12g,
mgt ¼ f10; 12g, pfgt ¼ f10; 12g. Tables 1, 2, 3, 4 and 5 show the (conditional) probability
tables corresponding to nodes pfpg, tf ¼ pf?, tfpg, pfgt and cpg, respectively, for

Scand J Statist 29 Forensic inference from genetic markers 581

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


marker FES. Only the actually observed alleles 10 and 12, and the aggregation of all

unobserved alleles, indicated by x, need to be represented in the probability tables.
After entering the case evidence at the relevant genotype nodes, we ‘‘propagate’’ it

throughout the network, using the software. We can then interrogate the hypothesis node to

find its updated probability distribution, conditional on the evidence. The required likelihood

ratio, based on the data for this marker, is obtained from the marginal formal posterior

distribution at node tf ¼ pf?, as given in Table 6: LR ¼ 0:6584=0:3416 ¼ 1:9274—in
agreement with the analysis of section 2.1.

Our purely technical use of an ‘‘artificial’’ uniform prior probability of 0.5, to derive

likelihood ratios, should be clearly distinguished from the common forensic practice of

calculating, and quoting in court, a ‘‘probability of paternity’’ based on this uniform prior

Table 1. Probability table for pfpg

pfpg: 10 12 x

0.28425 0.25942 0.45634

Table 2. Probability table for tf=pf?

tf=pf?: yes no

0.5 0.5

Table 3. Conditional probability table for tfpg given tf=pf? and pfpg

tf=pf? yes no

pfpg 10 12 x 10 12 X

tfpg: 10 1 0 0 0.28425 0.28425 0.28425

12 0 1 0 0.25942 0.25942 0.25942

x 0 0 1 0.45634 0.45634 0.45634

Table 4. Conditional probability table for pfgt given pfmg and pfpg

pfmg: 10 12 x

pfpg: 10 12 x 10 12 x 10 12 x

pfgt: 10–10 1 0 0 0 0 0 0 0 0

10–12 0 1 0 1 0 0 0 0 0

10–x 0 0 1 0 0 0 1 0 0

12–12 0 0 0 0 1 0 0 0 0

12–x 0 0 0 0 0 1 0 1 0

x–x 0 0 0 0 0 0 0 0 1

Table 5. Conditional probability table for cpg given tfmg and tfpg

tfmg: 10 12 x

tfpg: 10 12 x 10 12 x 10 12 x

cpg: 10 1 0.5 0.5 0.5 0 0 0.5 0 0

12 0 0.5 0 0.5 1 0.5 0 0.5 0

x 0 0 0.5 0 0 0.5 0.5 0.5 1

582 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


(Essen-Möller, 1938). Such a prior assumption will often be unreasonable. All the non-DNA

evidence in the case should be used to construct a sensible and defensible prior probability

which is then combined with the likelihood ratio (summarizing all DNA evidence) to form a

posterior probability. However, this is a task that should properly be left to the judge or jury.

We advocate the presentation, along with the overall likelihood ratio, of a table, such as

Table 8 (see section 2.3.1 below), where a range of posterior probabilities is given

corresponding to a range of possible prior probabilities. For a general discussion of the

presentation of statistical evidence in court, see Dawid (2002).

We again stress that we are not recommending the use of a PES for the simple paternity case

illustrated in this section, which can be solved by a couple of lines of simple algebra. It is

presented merely as an illustrative basic model of how a PES can be formulated, which is then

suitable for extension to the more complex cases that we treat below—cases that can not be

handled by simple algebra. In particular, the PES for these more complex problems can be

built out of the same fundamental local modules that we have already described for the simple

problem above.

2.3. Missing data

In certain cases, the DNA profiles of one or more of the ‘‘principal actors’’ in the story are not

available, but there is indirect evidence, in the form of DNA profiles of various known

relatives. Then simple arguments such as those of section 2.1 can no longer be applied, and

their appropriate extension to such a case may not be obvious or practicable. However, it is

still straightforward to construct and apply a Probabilistic Expert System representation of the

problem, this being now expanded to include the other measured individuals and the

relationships between all individuals involved. We illustrate this with two real examples from

the case-work of the UCSC Forensic Genetics Laboratory.

2.3.1. Paternity case 1

In this case, the only DNA samples available were on the disputed child c1, on the putative

father pf’s undisputed child c2 by a different mother, and on pf’s brother b. In particular, no

samples were available for the putative father pf, nor for his parents gf and gm, nor for the

mother of either child. The case data, and relevant allele frequencies, are given in Table 7.

Figure 3 displays the pedigree, and Fig. 4 the corresponding Probabilistic Expert System

network. The necessary conditional probability tables are formed exactly as before. The

Probabilistic Expert System again readily supports entering of the evidence (at the grey

‘‘observation nodes’’), rapid propagation of this evidence through the network, and

interrogation of the black ‘‘target’’ node to obtain the desired inference. Assuming

independence across markers, the overall likelihood ratio in favour of paternity is obtained

as the product of all terms in the last column of Table 7, viz. 13.066. Table 8 shows the implied

posterior probability of paternity (tf ¼ pf? ¼ yes), for various values of the prior
probability. Since the jury may hold or wish to consider a range of prior probabilities, such

a table can be a useful way of presenting the impact of the DNA evidence to the court.

Table 6. Posterior probability table for tf=pf?

tf=pf?: yes no

0.65840 0.34160

Scand J Statist 29 Forensic inference from genetic markers 583

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


Table 7. Observed genotypes, and their frequencies, for incomplete paternity data, Case 1

Individual: b Frequency c1 Frequency c2 Frequency

Likelihood

ratio

Marker

TH01 7 0.165 6 0.263 6 0.263

9 0.200 6 0.263 9 0.200 0.672

VWA 17 0.298 15 0.107 15 0.107

17 0.298 17 0.298 17 0.298 1.788

D3S1358 15 0.278 15 0.278 15 0.278

17 0.220 17 0.220 16 0.233 1.450

FGA 18 0.006 22 0.163 23 0.163

26 0.028 26 0.028 26 0.028 8.954

TPOX 8 0.508 8 0.508 9 0.107

9 0.107 11 0.271 10 0.089 0.459

CSF1PO 10 0.250 10 0.250 10 0.250

12 0.380 12 0.380 12 0.380 1.504

D5S818 11 0.392 11 0.392 11 0.392

11 0.392 13 0.165 11 0.392 1.207

D7S820 9 0.130 10 0.240 9 0.130

12 0.160 13 0.027 10 0.240 0.428

D13S317 11 0.312 11 0.312 11 0.312

11 0.312 11 0.312 12 0.286 2.346

Overall 13.066

Fig. 3. Pedigree for incomplete paternity data. Case 1.

Fig. 4. Network for incomplete paternity data. Case 1.

Table 8. Posterior probability of paternity for Case 1

Prior probability: 0.001 0.01 0.1 0.3 0.5 0.7 0.9

Posterior probability: 0.013 0.117 0.592 0.848 0.929 0.968 0.992

584 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


2.3.2. Paternity case 2

In this problem, DNA profiles were available on the disputed child c1 and its mother m1, on

the putative father’s two full brothers b1 and b2, on the undisputed child c2 of the putative

father by another mother m2, and on m2. No samples were available for the putative father pf,

nor for his parents gf and gm. The pedigree is given in Fig. 5. The network for the

corresponding Probabilistic Expert System is now given by Fig. 6. The detailed case data and

allele frequencies (for 10 markers) are not given here, but can be supplied on request.

Notwithstanding the greatly increased complexity, it is once again completely straightforward,

using PES software, to enter the observed evidence, propagate, and interrogate the target node

to obtain the required likelihood ratio. This was done using GENIE: the overall likelihood

ratio obtained in favour of paternity was 1303.

Networks such as the above can also be used to analyse the possibility that the true father is

some individual in the pedigree other than the putative father. Further relatives, measured or

unmeasured, could be investigated by suitable extension of the pedigree and corresponding

elaboration of the network.

3. Mutation

A problem that can complicate forensic inference from DNA profiles is the possibility of

mutation of the DNA between generations. Indeed, the microsatellite markers typically used

for forensic purposes are known to be particularly prone to mutation, with overall mutation

rates of between 5 � 10�4 and 7 � 10�3 per generation (Brinkmann et al., 1998). It is thus
possible, for example, that a putative father may seem to be excluded by the evidence, whereas

in fact he is the true father, but mutation has led to his passing on a prima facie impossible

Fig. 5. Pedigree for incomplete paternity data. Case 2.

m1mgm1pg

m1gtyc1mg

tf=pf?

gfmggfpg

pfpg

gmmggmpg

pfmg

c1pg

c1gt

b2pg b2mgb1mgb1pg

b2gtb1gt

m2pg m2mg

m2gt c2mg c2pg

c2gt

tfpg tfmg

Fig. 6. Network for incomplete paternity data. Case 2.

Scand J Statist 29 Forensic inference from genetic markers 585

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


allele to the child. Particularly in cases where such ‘‘exclusion’’ is based on a single marker, all

other markers yielding data consistent with paternity, the possibility of a mutation may need

to be taken seriously.

Within a Probabilistic Expert System representation, mutation may be incorporated into the

process whereby a child’s (say) maternal gene is determined by his mother’s two genes. First

one of these is chosen at random, as before, to form the ‘‘original’’ (transmitted) gene; then

this spawns a new ‘‘actual’’ (inherited) gene, according to some specified mutation process. (In

fact our ordering of these two stages is the opposite of Nature’s, but the end result will be

equivalent so long as the same mutation process operates on the male and female germlines.

Our construction involves fewer mutation events, thus yielding a simpler and more efficient

graphical representation.) The ‘‘actual’’ genes are thus those which can be observed as

constituents of genotypes, while the ‘‘original’’ genes are unobservable.

With this extension, and using e.g. pfopg and pfapg to represent ‘‘putative father’s

original paternal gene’’ and ‘‘putative father’s actual paternal gene’’, the network for the

‘‘simple’’ paternity problem of section 2.2 is transformed into that shown in Fig. 7. In fact in

this new problem it is still possible to perform the calculations directly, without using a PES,

although the expressions now become considerably more complex (Dawid et al., 2001). Once

again, we develop the expert system formulation of this problem mainly as a model for

handling more complex cases.

In addition to the previous probability specifications, we need to specify the transition

matrix of mutation rates, whereby an original gene mutates into an actual gene. There are

various sources of data that can be used to supply estimates of overall mutation rates, but data

on rates of transition between specific alleles are sparse, so that we need to make some

tentative assumptions on the structure of the transition matrix. For real applications it will be

important to investigate the sensitivity of the conclusions to varying assumptions about

overall and transition-specific mutation rates (Dawid et al., 2001).

Table 9 gives estimated overall mutation rates obtained from data from UCSC, and from

Brinkmann et al. (1998), who investigated 11000 meioses for nine STR markers. We have used

the estimate l ¼ ðs þ 0:5Þ=ðn þ 1Þ, where s is the number of observed mutations and n the
number of observed meioses, which can be regarded as a Bayesian estimate under the Jeffreys

Beta 1
2
; 1
2

� �
prior distribution for a binomial proportion. In particular, this simple Bayesian

Fig. 7. Simple network with mutation.

586 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


computation avoids zero estimates. For markers where mutation data were not available the

highest observed rate, 2:69 � 10�3, was taken.
For a given marker, let the mutation transition matrix be denoted by Q ¼ ðqijÞ, where qij

denotes the probability of a mutation to ‘‘actual’’ allele j, from ‘‘original’’ allele i. It is
reasonable to assume equilibrium, i.e. that the vector of gene frequencies p ¼ ðp1; . . .; pkÞT is
constant over time. Then p must be a stationary distribution for the associated transition
matrix Q, i.e. pTQ ¼ pT.
Given p and the overall mutation rate l, we can construct a transition matrix Q having

reasonable properties. Stationarity will be assured if Q is chosen to satisfy the detailed balance
condition:

piqij ¼ pjqji: ð1Þ

We can restrict attention to states having pi > 0. Let S ¼ ðsijÞ be a symmetric matrix having
sijP0 for i 6¼ j and

P
j sij ¼ 0 for all i, and let k be an adjustable positive parameter. We define

qij ¼ ksij=pi ði 6¼ jÞ; ð2Þ

and

qii ¼ 1 �
X
j6¼i

qij ¼ 1 þ ksii=pi: ð3Þ

The detailed balance equation (1) will then be satisfied. The overall mutation rate is

l ¼ 1 �
X
i

piqii ¼ k � �
X
i

sii

 !
; ð4Þ

so that we must take

k ¼ l �
X
i

sii

 !
:

,
ð5Þ

In order to ensure qiiP0, all i, we require kO minif�pi=siig. Consequently, for given S there is
an upper limit on the overall mutation rate that can be obtained from this model.

It is biologically reasonable to assume that an allele is more likely to mutate to a

neighbouring allele than to one further away, and we choose the matrix S accordingly.
Specifically, we take sij ¼ aji�jj, for i 6¼ j, where a is a fixed constant: this is similar to the step-
wise mutational model of Valdes et al. (1993). The smaller is a, the greater is the probability
that a mutation will be to a closely neighbouring allele.

Using the overall mutation rates in Table 9 and the procedure described above, we obtain,

for each marker, the mutation transition matrix to be used at all nodes representing the

‘‘actual’’ transmitted gene, such as capg of Fig. 7. Table 10 shows a real example from UCSC

casework, analysed using the PES in Fig. 7. Tables 11 and 12 give the allele frequency

distribution and the corresponding mutation transition matrix, calculated as described above

with a ¼ 0:5, for marker VWA. Again, only the observed alleles in the case at hand, and x,
representing the aggregation of all other alleles, are required. (In principle such aggregation

could now lead to violation of the conditional independence properties represented in the

Table 9. Estimated overall mutation rates per thousand generations

Marker: F13 VWA D21S11 D1S80 TH01 FES MBP APO-B COL2A1

Mutation rate: 2.55 2.23 2.69 2.69 2.49 1.76 2.69 2.69 2.69

Scand J Statist 29 Forensic inference from genetic markers 587

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


Expert System. Full correctness could be ensured by avoiding the aggregation step; however,

because of the low mutation rates, in practice the effect of such aggregation will be entirely

negligible.)

Note that, for markers F13 and VWA, if mutation was not a possibility the putative father

would be excluded by the evidence. For the ‘‘non-excluding’’ markers, i.e. those other than

F13 and VWA, introducing the possibility of mutation does not affect the likelihood ratio, to 3

significant figures. It does, of course, have a profound effect for F13 and VWA, changing a

zero to a non-zero value.

Using the above mutation model, the overall likelihood ratio, obtained as the product of the

entries in the last column of Table 10, is 0.00026, a very small value which effectively excludes

this putative father. However, if the data on F13 had been absent the likelihood ratio would

have been 0.347, which is by no means negligible if there is other incriminating evidence in the

case. We thus see that a single seeming exclusion may not in reality exclude, when the

possibility of mutation is taken into account.

When there are no ‘‘seeming exclusions’’ the effect of accounting for mutation will usually

be negligible—certainly so for simple cases with no missing individuals—and it can then safely

be ignored.

Table 10. Paternity case data

Marker Mother Child

Putative

father

Likelihood

ratio

F13 7 16 5 7 7 7 0.000749

VWA 14 17 17 19 16 17 0.00277

D21S11 28 34.2 32.2 34.2 32 32.2 5.013

D1S80 24 24 18 24 18 24 2.391

TH01 6 7 6 7 6 6 2.360

FES 11 11 11 11 10 11 1.358

MBP 1 5 1 5 1 1 1.491

APO-B 45 47 37 47 37 41 1.341

COL2A1 10 14 10 14 8 14 1.629

Overall 0.00026

Excluding F13 0.347

Table 11. Selected allele frequencies for marker VWA

Allele: 14 16 17 19 x

Frequency: 0.089 0.197 0.298 0.067 0.349

Table 12. Marker VWA: Conditional probability table for capg [resp. camg] given copg [resp. comg]

j

i 14 16 17 19 x

14 0.997272 0.000391 0.000196 0.0000489 0.00209

16 0.000177 0.998652 0.000354 0.0000884 0.000729

17 0.000058 0.000234 0.999109 0.000117 0.000482

19 0.000065 0.000260 0.000520 0.996377 0.00278

x 0.000533 0.000412 0.000412 0.000533 0.998110

588 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


4. Inference about identity

We now turn to consider a problem of a somewhat different nature.

Suppose a DNA trace t belonging to an unknown criminal or from an unidentified body is
found. One wishes to know if the trace belongs to any individual in a given pedigree P, on
some of whom we have DNA data; or to some other individual in the overall population I.

Following Dawid & Mortera (1996), we denote by C the random variable indicating the
unknown individual in I leaving the trace. The scene-of-crime or trace DNA evidence is

denoted by E : vC ¼ t. Furthermore we let EP: va ¼ n denote the measured DNA evidence for
a set a of identified individuals in pedigree P. We suppose that, in the absence of the trace
evidence vC, information on v, the DNA profiles for all members of the population I, would
be irrelevant to the identity of C. Using the notation for conditional independence of Dawid
(1979), this property is expressed as:

C ?? v: ð6Þ

Given all the evidence ðE; EPÞ, the likelihood Li of the hypothesis that the trace belongs to
individual i 2 I is given by:

Li / prðva ¼ n; vC ¼ tjC ¼ iÞ ¼ prðva ¼ n; vi ¼ tjC ¼ iÞ / prðvi ¼ tjva ¼ nÞ; ð7Þ

by (6). For any individual i 2 I unrelated to P, (7) is just the match probability prðvi ¼ tÞ.
For each marker, a single propagation in the probabilistic expert system representing the

pedigree P will calculate (7) for all i 2 P simultaneously. These likelihoods can then be
multiplied across markers (assuming independence), and combined with prior probabilities,

using Bayes’s theorem, to yield the posterior probability that C ¼ i for various i 2 I.

4.1. A murder example

To indicate the complexity which can be handled by this approach, we construct a network for

a fictional criminal case, taken from Egeland et al. (1997b). A mutilated murdered body has

been found, of unknown identity and of male sex. There are a number of individuals who it

could be, labelled I.1, II.1, II.2, III.1 and IV.1. There is also a possibility that it is none of these

(nor related to any of them), represented by i 2 U. DNA profiles are available from the body,
and from living individuals a ¼ fIV.2, IV.3, V.1g. All these individuals are known to be
related according to the pedigree P given in Fig. 8. We construct the associated Probabilistic

Fig. 8. Murder case: Complex pedigree.

Scand J Statist 29 Forensic inference from genetic markers 589

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


Expert System, incorporating mutation, as in Fig. 9. The (fictional) genotype data, for a single

marker, are given in Table 13, and the associated allele frequencies in Table 14.

In the network we merely have to enter and propagate the DNA profile evidence, EP,

observed at the ‘‘data nodes’’ (genotypes for individuals IV.2, IV.3, V.1 in the pedigree), so

obtaining the likelihood Li, as given by (7), for each ‘‘suspect’’ individual i 2 P [ U. In
particular for any i 2 U the likelihood Li is simply the prior probability prðvi ¼ tÞ, i.e. the
(estimated) frequency in the data-base of type t genotype. Note that the network also provides
likelihoods for the female individuals I.2 and III.2, as well as for the living individuals IV.2,

Fig. 9. Network for murder case.

590 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


IV.3 and V.1, although in all these cases the corresponding prior probabilities prðC ¼ iÞ must
be zero. This is not an error, since the likelihood is a measure of the conditional probability of

the DNA data observed, under the hypothesis that a specific individual supplied the body;

under the assumptions incorporated in the model this conditional probability can be positive,

even when the posited hypothesis is known to be false. Of course, in such a case the value of

the likelihood is irrelevant: the posterior probability will be zero because the prior is.

Table 15 gives the likelihoods Li, both in the absence of mutation and allowing for the
possibility of mutation (using a simplistic 9-allele mutation transition matrix Q having
qii ¼ 0:96, qij ¼ 0:005, i 6¼ j).
The representation and analysis adopted for this problem are somewhat different from those

used in section 2 and section 3 above. The same result could in principle have been obtained

here by introducing, as there, a ‘‘query node’’, which now would have to be a ‘‘child’’ of all

individuals to whom the body could belong. However, that creates a large clique of connected

nodes, making such a representation inefficient for computation. Conversely, we could have

addressed the problems of section 2 and section 3 by the same logic as used here; however,

although this would have provided a straightforward and efficient approach to those

problems, we consider that the clarity and simplicity obtained by explicitly representing the

hypothesis node in the graph is a strong reason to do so whenever feasible.

The above example appeared in the unpublished preliminary version (Egeland et al., 1997b)

of the paper Egeland et al. (1997a). These papers describe and illustrate the program PATER

(Mostad & Egeland, 1998), based on a routine for organization of pedigree calcula-

tions—without, however, our more detailed structuring of the network. In Egeland et al.

(1997b) the authors state that PATER had not been able to solve the above problem

incorporating mutation, after running for 12 hours on a Sparc-Solaris workstation (their

calculations involve consideration of about 1014 different configurations). In contrast, when

Table 13. Genotype data, murder case

Individual: IV.2 IV.3 V.1 body

Genotype: ad ac ab ac

Table 14. Allele frequencies, murder case

Allele: a b c d x

Frequency: 0.01 0.15 0.05 0.35 0.44

Table 15. Likelihood Li for pedigree of Fig. 8

Individual i No Mutation Mutation

I.1 0.0084 0.0078

I.2 0.0084 0.0078

II.1 0.0158 0.0199

II.2 0.0084 0.0086

III.1 0.0330 0.0450

III.2 0.0330 0.0362

IV.1 0.4906 0.3859

IV.2 1 1

IV.3 1 1

V.1 0 0

Unrelated 0.001 0.001

Scand J Statist 29 Forensic inference from genetic markers 591

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


formulated as we have done, it can be solved, using the HUGIN software, instantaneously with

just one propagation. This underlines the vital importance of constructing an appropriate

initial representation of the problem, and using good computational algorithms. Mostad and

Egeland have since produced a new program, FAMILIAS (http://www.nr.no/familias), that

solves the above problem more efficiently, using a genetic ‘‘peeling algorithm’’ (Elston &

Stewart, 1971; Cannings et al., 1978) closely related to the general purpose algorithm used by

HUGIN.

5. Conclusions and related problems

This article has attempted to demonstrate the value of applying existing general-purpose

probabilistic expert system methodology and software to address problems of forensic

genetics. Its principal contribution should be seen as comprised in our suggestions for

translating such a problem into a suitable PES representation. Although we have largely

proceeded by example, certain general principles emerge: in particular, wherever possible one

should aim for a fine-grained representation (e.g. incorporating individual genes, not just

genotypes), and build the overall structure out of simple repeatable modules, simply connected

together. However, the best representation of any specific problem can vary from one problem

type to another.

In future work we plan to develop intelligent software to provide a suitable mix of

standardized and problem-tailored routines. We also propose to address departures from

some of the simplifying assumptions we have made here. Thus we have assumed Hardy–

Weinberg and linkage equilibrium, i.e., independence within and across markers; and, further,

that all founders in a pedigree, including unrepresented individuals, can be regarded as drawn

at random from the same homogeneous population. We aim to relax these requirements, e.g.

by considering populations composed of partially distinct subpopulations (Roeder et al., 1998;

Foreman et al., 1997; Dawid & Pueschel 1999). We have also assumed here that we have fully

accurate and relevant figures for gene frequencies and mutation rates. These assumptions will

be relaxed by extending the framework to allow Bayesian and other forms of statistical

learning of relevant parameters from data.

There is a wide range of further identification problems that can be tackled using

Probabilistic Expert System representations and computations, either along the lines set out

above or through extensions, elaborations or variations of them. We plan to study these with

the overall aim of attaching them to a suitable general representational framework.

One such problem, readily handled by the methods presented here, arises when a crime

suspect’s DNA is not available—perhaps he escaped the country—but DNA evidence is

available on some of his relatives. One then needs to compute the likelihood that the crime

trace belonged to the fugitive, given the DNA traces of some of his close relatives. This

problem can be handled in a similar fashion to paternity cases with missing data, thus avoiding

complex algebraic case-by-case computations. Analogous forensic problems arise when

someone has been kidnapped, a body part has been sent by the kidnappers, and, given DNA

profiles from the kidnapped person’s parents, sibling’s or other kin, one wishes to compute the

likelihood that the part belongs to the kidnapped person. The idea of transforming a complex

pedigree into a PES also has less gory applications in cases where a prospective immigrant

claims relationship with one or more citizens.

Another important and complex problem that can easily be formulated and solved using a

PES (Mortera, 2002) is the analysis of DNA profiles containing a mixture of genetic material

from two or more persons (Evett & Weir, 1998, ch. 7). This is common in rape cases, where a

sample may contain biological material from the victim, multiple perpetrators, and one or

592 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


more consensual partners; it can also occur in criminal cases where there has been a scuffle or

brawl, for example. Here again the most appropriate PES representation (generally close in

spirit to those considered here, albeit a little different in detail) may depend on the specific

problem and the identification question that is to be resolved.

A somewhat different kind of problem, unfortunately of increasing importance, is that of

identification of multiple remains following disasters such as wars, fires, or earthquakes. A

celebrated instance was the discovery in Yekaterinburg of human remains thought to be those

of the executed last Tsar Nicholas of Russia, his family and servants (Gill et al., 1994). There

was a collection of skeletons, of ascertainable sex and (in some cases) approximate age, from

which DNA profiles were obtained. A number of possible pedigrees relating the skeletons to

each other and to other known individuals (including Prince Philip, Duke of Edinburgh) can

be entertained, and a probabilistic network developed to describe each of these, allowing

determination of the most likely pedigree given the evidence. The information on sex and age

reduces the very large number of possible pedigrees and establishes plausible terminal nodes,

i.e. children who cannot have had offspring. The program FAMILIAS can be used to automate

the handling of such multiple pedigree problems (Egeland et al., 2000). Further development is

envisaged to integrate its features with the approach presented here.

The exact computational approach embodied in a PES analysis, like that of its forerunners

in algorithms for peeling genetic pedigrees, is widely but not universally applicable. In

particular, for even quite small ‘‘loopy’’ pedigrees exact analysis rapidly becomes computa-

tionally intractable even with the most sophisticated algorithms. In such cases it is usual to

resort to approximate methods, such as Gibbs sampling or similar Monte Carlo Markov chain

techniques (Gilks et al., 1996). Methods such as ‘‘blocking Gibbs’’ (Jensen et al., 1995)

combine features of both approaches, and hold out promise of extending the range of

problems that can be handled by the type of graphical representation we have considered here.

Here too, identification of an appropriate graphical representation will be essential for

computational feasibility.

Table 16. Summary of the notation used in this article

Notation Definition

pf putative father

tf true father

m mother

m1, m2 mother 1, mother 2

c child

c1, c2 child 1, child 2

b brother

b1, b2 brother 1, brother 2

gf grandfather

gm grandmother

mg maternal gene

pg paternal gene

gt genotype

apg actual paternal gene

amg actual maternal gene

opg original paternal gene

omg original maternal gene

pfpg, pfmg, pfgt, etc putative father’s paternal gene, maternal gene, genotype, etc.

mapg, mopg, etc mother’s actual, original paternal gene, etc.

I.x, II.x, etc individual x, x ¼ 1; . . . ; n, of generation I, II, etc.
tf=pf? query node: ‘‘true father = putative father?’’

Scand J Statist 29 Forensic inference from genetic markers 593

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


Acknowledgements

We wish to thank Hugin Expert A/S for allowing free use of their software. Also, since the

graphical interfaces to current PES software are not optimized for the kinds of problem we

treat here, we are indebted to Robert Cowell for access to a preliminary version of his program

FINEX (Cowell, 2001), which eases the specification task, constructs HUGIN networks, and

conducts likelihood analyses. We are grateful to Michael Boehnke, Marina Dobosz, Thore

Egeland, and Steffen Lauritzen for valuable discussions and assistance. The comments of an

anonymous referee helped to improve the paper. The work was supported by grants from

MURST, CNR and ESF (HSSS).

References

Brinkmann, B., Klintschar, M., Neuhuber, F., Hühne, J. & Rolf, B. (1998). Mutation rate in human

microsatellites: influence of the structure and length of the tandem repeat. Amer. J. Human Genet. 62,

1408–1415.

Cannings, C., Thompson, E. A. & Skolnick, M. H. (1978). Probability functions on complex pedigrees.

Adv. Appl. Probab. 10, 26–61.

Cowell, R. G. (2001). FINEX: Forensic identification by network expert systems. Research Report 22,

Department of Actuarial Science and Statistics, The City University, London.

Cowell, R. G., Dawid, A. P., Lauritzen, S. & Spiegelhalter, D. J. (1999). Probabilistic networks and expert

systems. Springer Verlag, New York.

Dawid, A. P. (1979). Conditional independence in statistical theory (with discussion). J. Roy. Statist. Soc.

Ser. B 41, 1–31.

Dawid, A. P. (1992). Applications of a general propagation algorithm for probabilistic expert systems.

Statist. Comput. 2, 25–36.

Dawid, A. P. (2002). Bayes’s theorem and weighing evidence by juries. In Bayes’s theorem (ed. R.

Swinburne). Proc. British Acad. 113, in press.

Dawid, A. P. & Mortera, J. (1996). Coherent analysis of forensic identification evidence. J. Roy. Statist

Soc. Ser. B 58, 425–443.

Dawid, A. P. & Mortera, J. (1998). Forensic identification with imperfect evidence. Biometrika 85, 835–

849.

Dawid, A. P. & Pueschel, J. (1999). Hierarchical models for DNA profiling using heterogeneous

databases. In Bayesian statistics 6 (eds J. M. Bernardo, J. O. Berger, A. P. Dawid & A. F. M. Smith),

pp. 187–212. Oxford University Press, Oxford.

Dawid, A. P., Mortera, J. & Pascali, V. L. (2001). Non-fatherhood or mutation? A probabilistic approach

to parental exclusion in paternity testing. Forensic Sci. Internat. 124, 55–61.

Egeland, T., Mostad, P. F. & Olaisen, B. (1997a). A computerised method for calculating the probability

of pedigrees from genetic data. Sci. Justice 37, 269–274.

Egeland, T., Mostad, P. F. & Olaisen, B. (1997b). Probability assessments of family relations. Technical

Report, Norwegian Computing Center, Oslo, Norway.

Egeland, T., Mostad, P. F., Mevåg, B. & Stenersen, M. (2000). Beyond traditional paternity and

identification cases: selecting the most probable pedigree. Forensic Sci. Internat. 110, 47–59.

Elston, R. C. & Stewart, J. (1971). A general model for the genetic analysis of pedigree data. Human

Heredity 21, 523–542.

Essen-Möller, E. (1938). Die Beweiskraft der Ähnlichkeit im Vaterschaftsnachweis. Theoretische

Grundlagen. Mitt. Anthropol. Ges. 68, 9–53.

Evett, I. W. & Weir, B. S. (1998). Interpreting DNA evidence. Sinauer, Sunderland, MA.

Foreman, L. A., Smith, A. F. M. & Evett, I. W. (1997). Bayesian analysis of deoxyribonucleic acid

profiling data in forensic identification applications (with discussion). J. Roy. Statist. Soc. Ser. A 160,

429–469.

Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice.

Chapman & Hall, London.

Gill, P. E., Ivanov, P. L., Kimpton, C., Piercy, R., Benson, N., Tully, G., Evett, I., Hagelberg, E. &

Sullivan, K. (1994). Identification of the remains of the Romanov family by DNA analysis. Nature

Genetics. 12, 417–420.

594 A. P. Dawid et al. Scand J Statist 29

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.


Jensen, C. S., Kong, A. & Kjœrulff, U. (1995). Blocking Gibbs sampling in very large probabilistic expert

systems. Internat. J. Human–Comput. Stud. 42, 647–666.

Mortera, J. (2002). Analysis of DNA mixtures using probabilistic expert systems. In Highly structured

stochastic systems (eds P. J. Green, N. L. Hjort & S. Richardson). Oxford University Press, Oxford, in

press.

Mostad, P. F. & Egeland, T. (1998). Probability assessments of family relations using the program

‘‘pater’’. Technical Report, Norwegian Computing Center, Oslo, Norway.

Roeder, K., Escobar, M., Kadane, J. B. & Balazs, I. (1998). Measuring heterogeneity in forensic databases

using hierarchical Bayes models. Biometrika 85, 269–287.

Spiegelhalter, D. J., Dawid, A. P., Lauritzen, S. L. & Cowell, R. G. (1993). Bayesian analysis in expert

systems (with discussion). Statist. Sci. 8, 219–283.

Thompson, E. A. (2000). MCMC estimation of multi-locus genome sharing and multipoint gene location

scores. Internat. Statist. Rev. 68, 53–73.

Valdes, A. M., Slatkin, M. & Freimer, N. B. (1993). Allele frequencies at microsatellite loci: the stepwise

mutation model revisited. Genetics 133, 737–749.

Weber, J. L. & May, P. E. (1989). Abundant classes of human DNA polymorphism which can be typed by

the polymerase chain reaction. Amer. J. Human Genet. 44, 388–397.

Received November 2000, in final form October 2001

A. Philip Dawid, Department of Statistical Science, University College London, Gower Street, London
WC1E 6BT, UK.
E-mail: dawid@stats.ucl.ac.uk

Scand J Statist 29 Forensic inference from genetic markers 595

� Board of the Foundation of the Scandinavian Journal of Statistics 2002.