Simultaneous estimation of per cell division mutation rate and turnover rate from bulk tumor sequence data


Simultaneous estimation of per cell division

mutation rate and turnover rate from bulk

tumour sequence data

Gergely Tibély1,2, Dominik Schrempf2, Imre Derényi2,3, Gergely J. Szöllősi1,2,4

1MTA-ELTE “Lendület” Evolutionary Genomics Research Group, Pázmány P. stny.
1A, H-1117 Budapest, Hungary

2Department of Biological Physics, Eötvös University, Pázmány P. stny. 1A, H-1117
Budapest, Hungary

3MTA-ELTE Statistical and Biological Physics Research Group, Pázmány P. stny.
1A, H-1117 Budapest, Hungary

4Institute of Evolution, Centre for Ecological Research, Konkoly-Thege M. út 29-33.

H-1121 Budapest, Hungary

February 12, 2021

Abstract

Tumors often harbor orders of magnitude more mutations than heal thy tis-
sues. The increased number of mutations may be due to an elevated mutation
rate or frequent cell death and correspondingly rapid cell turnover leading to
an increased number of cell divisions and more mutations, or some combina-
tion of both these mechanisms. It is difficult to disentangle the two based on
widely available bulk sequencing data where mutations from individual cells are
intermixed. As a result, the cell linage tree of the tumor cannot be resolved.
Here we present a method that can simultaneously estimate the cell turnover
rate and the rate of mutations from bulk sequencing data by averaging over
ensembles of cell lineage trees parameterized by cell turnover rate. Our method
works by simulating tumor growth and matching the observed data to these
simulations by choosing the best fitting set of parameters according to an ex-
plicit likelihood-based model. Applying it to a real tumor sample, we find that
both the mutation rate and the intensity of death is high.

Author Summary

Tumors frequently harbor an elevated number of mutations, compared
to healthy tissue. These extra mutations may be generated either by an in-
creased mutation rate or the presence of cell death resulting in increased
cellular turn over and additional cell divisions for tumor growth. Sepa-
rating the effects of these two factors is a nontrivial problem. Here we
present a method which can simultaneously estimate cell turnover rate
and genomic mutation rate from bulk sequencing data. Our method is
based on the maximum likelihood estimation of the parameters of a gen-
erative model of tumor growth and mutations. Applying our method to a

1

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


human hepatocellular carcinoma sample reveals an elevated per cell divi-
sion mutation rate and high cell turnover.

1 Introduction

Cancer is an evolutionary phenomenon within a host organism that unfolds on
the timescale of years or more. New mutations can appear with each cell di-
vision, while cells can also die for reasons such as lack of nutrients or immune
reactions. Due to the limitations of bulk sequencing, which only essays muta-
tion frequencies for a population of cells from each tumor sample and does not
resolve individual cells’ genotype, basic evolutionary parameters. In particular,
the cell turnover rate and per cell division mutation rate remain unknown, with
estimated values spanning several orders of magnitudes [1].

While tumors can contain a large number of mutations, it is not clear whether
this is due to an elevated mutation rate or frequent cell death, as frequent cell
death results in more cell divisions, which, in turn, gives rise to more mutations.
There are arguments for both cases [2, 3, 4, 5, 6], but distinguishing between
these two alternatives is difficult becasue we cannot resolve the tumor’s cell
lineage tree from bulk sequencing data.

In previous work [2, 1], an elevated number of mutations was observed,
but only the combined effect of the mutation rate and the death rate could
be estimated. Williams et al. [7] targeted the problem of separating these two
quantities by separately sequencing in bulk multiple samples from the same
tumor thus resolving a coarse grained cell linage tree. However, it is not clear
whether this approach resolves the cell lineage tree in sufficient detail to identify
the regime of frequent cell death when the number of mutations is orders of
magnitude larger than under growth without cell death.

Here, we describe a method to simultaneously estimate the per cell division
mutation rate and the turnover rate (the ratio of death and birth rates) of a
tumor from bulk sequencing data. The estimation is based on a maximum like-
lihood fit of the parameters of a birth-death model to the measured mutant and
wild-type read counts. While requiring only a single tumor-normal sample pair,
the fitting procedure can differentiate between death rates, which are extremely
close to the critical value where the birth rate equals the death rate, resulting
in accurate estimation of the mutation rate across orders of magnitudes. The
rest of the paper is structured as follows. After introducing our model and the
fitting procedure in Sec. Methods, we assess model accuracy on simulated data
in Sec. Results on synthetic data. Results on empirical data are described in
Sec. Results on empirical data, and conclusions are given in Sec. Discussion.

2 Methods

Model

We describe the evolution of tumor cells with the cell linage tree, i.e. the bi-
furcating tree traced out by cell divisions. As cells that have died cannot be
observed by sequencing we consider the tree spanned by surviving cells. The
leaves of this tree correspond to extant cells and internal nodes to observed cell

2

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


divisions. To model the descendance of the extant cells, we employ the con-
ditioned birth-death process with birth rate α and death rate β and a fixed
number of cells n sampled [8]. We measure branch lengths in numbers of ex-
pected birth events (the product of birth rate and time). Consequently, the role
of the birth rate α can be considered as a scaling constant that sets the unit of
time and we consider it to be equal to 1 without loss of generality. As result the
death rate determines to the turnover rate: t = β/α = β.

Mutations occur with a rate µ per site per cell division, mutations are con-
sidered neutral and we neglect the probability that a site is hit by a mutation
by more than one time, in accordance with the infinite site hypothesis.

The data available from bulk sequencing is the mutant and wildtype read-
counts of sites. Therefore, we will use the site frequency spectrum, which can
be estimated from readcount data, to separate the effects of the mutation rate
and cell death intensity. The site frequency spectrum reflects the branch length
distribution of the tumors cell lineage tree. The tree’s leaves are the sequenced
cells, and its root is the most recent common ancestor of these cells. Chang-
ing the turnover rate modifies the shape of the cell linage tree by changing the
relative lengths of branches closer to the root compared to terminal ones and
as a result modifying the site frequency spectrum. Changing the mutation rate,
on the other hand, simply results in more mutations, thus leaving the shape
of the tree, and by extension the overall shape of the site frequency spectrum
unchanged (see Fig. 1). It should be noted, however, that the information we
will use is more detailed than the site frequency spectrum, namely, the read
count pairs of mutated sites, which contain more information than just one
rational number. E.g., 10 mutant reads out of 30 reads and 1 out of 3 both
contribute 1 mutation to the frequency 1/3, while the uncertainty of the first
case is significantly lower than for the second case. It also makes quite straight-
forward to include nucleotide-dependent transition probabilites, or trinucleotide
context-based effects.

Site frequency spectra derived from tumors alos contain the effects of the
ploidy of the sites and the contamination of the sample by normal cells. The
corresponding spectrum is termed Variant Allele Frequency (VAF) spectrum.
VAF frequencies are also affected by the finite sequencing depth, which gives
rise to a stochastic variation in the observed allele frequencies.

Throughout the paper, the following notation is used: Branches of the cell
lineage tree are denoted by the index k, the length of branch k is denoted by lk
and L =

∑
k lk denotes the sum of all branch lengths in the tree. The numbers

of mutations per site from cell divisions along branch k are Poisson distributed
and their sum is also Poisson distributed. The number of expected cell divisions
along branch k is lk, therefore, the distribution of the number of mutations per
site on branch k is a Poisson distribution with parameter lkµ. Similarly, the
total number of mutations is Poisson distributed with parameter Lµnsites.

Inference

To compare different combinations of mutation and turnover rates describing
the observed empirical data we employ a maximum likelihood approach. First,
we derive the likelihood of the observed data, L(D|µ,t), as a function of the
mutation rate and the turnover rate. As described below we maximize this
likelihood function averaged over a random sample of cell lineage trees with

3

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


Figure 1: Two possible scenarios for the generation of mutations along cell lin-
eage trees. a): different turnover rates lead to different lineage tree shapes. Bifur-
cations are cell divisions, leaves are cells comprising the bulk sequencing sample.
Note that the (surviving) tree topologies are the same, only branch lengths dif-
fer. b): mutations, symbolized by purple stars, accumulate at cell divisions. High
turnover rate and low mutation rate can lead to the same number of observed
mutations as low turnover rate and high mutation rate, however, the mutation
spectrum of the trees are different. c): For simulated trees of 10000 leaves, the
differences in the branch length distribution are clearly visible. d): VAFs of the
mutation spectra. Fractions of mutant cells are binned (note the logarithmic
scale). Ploidy is set to two, contamination is zero. Simulated sequencing depth
is 1000.

fixed turnover rate t in order to estimate the parameters t and µ that are most
likely to have generated the observed data.

4

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


First we derive L(D|µ,T ) the likelihood of the observed data for a fixed cell
lineage tree T . It is assumed that sites collect mutations independently of each
other, consequently, L(D|µ,T ) takes the form of a product over sites:

L(D|µ,T ) =
∏
i in
sites

p(mi|µ,T ,ri) (1)

where mi is the number of reads exhibiting a mutation at site i, and ri is the
total number of reads covering site i. To calculate the probability of observing mi
mutant reads out of a total of ri reads we consider the following alternatives: i) if
mi = 0 either a mutation occurred, with probability Pmut(µ,L) = 1−exp(−µL)
(see also Sec. Methods), but no mutant read, i.e. mi = 0 was observed out of
ri reads with probability F[0,ri,T ], or no mutation occurred with probability
1 − Pmut(µ,L) or ii) a mutation occurred with probability Pmut(µ,L) and mi
mutant reads where observed out of ri, with probability F(mi,ri,T ):

p(mi|µ,T ,ri) =

=

{
Pmut(µ,L) ·F(0,ri,T ) + (1 −Pmut(µ,L)) , mi = 0
Pmut(µ,L) ·F(mi,ri,T ) mi > 0

(2)

To compute the probability F(m,r,T ) of observing m mutant reads out of r
total reads given the cell linage tree T , we assume that the mutant reads descend
from a single mutation that occurred at somepoint along branch k, which has a
length lk and from which a fraction fk of sequenced cells descend, and take the
sum over all branches:

F(m,r,T ) =
∑
k

lk
L
· Binom(m,r,fk) =

∑
k

lk
L
·
(
m

r

)
·fmk (1 −fk)

r−m, (3)

where L =
∑

k lk and Binom(m,r,fk) is the probability mass function of the
binomial distribution, i.e. the probability of getting exactly m successes in r
independent Bernoulli trials with a probability of success fk. We consider mul-
tiple mutations at the same site as a single mutation, and neglect all subsequent
mutations after the first one. In all our applications we verified that µL � 1 is
fulfilled.

To take into consideration sequencing errors, we must consider that they can
lead to an excess of false mutation reads. To account for sequencing error, we
introduce a parameter ε denoting the probability of a sequencing error at each
position of each read. For ε > 0 new possibilities arise: mutant reads can be real
mutants or false mutants, due to sequencing errors. It is also possible that all
mutant reads at a site are false mutants, and there may or may not be a real
mutation. We neglect the case when two or more mutations happen at the same
site. Each position in each read can now be either wild type, false wild type,
real mutant or false mutant,

5

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


p(mi|µ,T ,ri,ε) ≈

≈ Pmut(µ,L)
∑
k

lk
L

ri!

(ri −mi)!mi
(
fk(1 −ε) + (1 −fk)ε

)mi·
·
(
(1 −fk)(1 −ε) + fkε)

)ri−mi
+ (1 −Pmut(µ,L))

(
ri
mi

)
εmi (1 −ε)ri−mi (4)

Note that Eq. 4 contains the mi = 0 case.
Finally, we introduce the ability to differentiate mutant types, to conform

the case of real DNA, which has 3 possible mutant types. So far, it was assumed
that each site can have 2 states, wild type or mutant, corresponding to a DNA
consisting of only two types of nucleotides, instead of four. Therefore, instead of
the mutant read count m, we introduce three mutant read counts, corresponding
to the three possible mutant types, m(1),m(2),m(3). Consequently, the input
data now consists of triplets of mutant read counts, instead of one scalar mutant
readcount. This leads to the use of a multinomial distribution, with four states:
wild type and 3 mutant types. The possibility of more than one real mutant types
at the same site is still neglected, being very rare, technically a second-order
process in the mutation probability of one site. We also neglect the probability
of more than one error hitting the same site. The likelihood function at a single
site is then

p(m
(1)
i ,m

(2)
i ,m

(3)
i |µ,T ,ri,ε) ≈

≈ Pmut(µ,L)
∑
k

lk
L

1

3

3∑
j=1

Mult

((
m

(j)
i ,m

(j+1)
i ,m

(j+2)
i ,ri −

∑
j′

m
(j′)
i

)
;

ri;
(
p

(j)
m (fk,ε),p

(j+1)
m (fk,ε),p

(j+2)
m (fk,ε),pw(fk,ε)

))
+

+ (1 −Pmut(µ,L)) · Mult

((
m

(1)
i ,m

(2)
i ,m

(3)
i ,ri −

∑
j

m
(j)
i

)
; ri;

(
p

(j)
m (0,ε),p

(j+1)
m (0,ε),p

(j+2)
m (0,ε),pw(0,ε)

))
(5)

and

p
(j)
m (fk,ε) = fk(1 −ε) + (1 −fk)ε/3 (6)

p
(j+1)
m (fk,ε) = fkε/3 + (1 −fk)ε/3 (7)

p
(j+2)
m (fk,ε) = fkε/3 + (1 −fk)ε/3 (8)
pw(fk,ε) = fkε/3 + (1 −fk)(1 −ε) (9)

where (j + 1) and (j + 2) denote the other two possible mutant types with cyclic
notation (j) = (j + 3), and Mult is a multinomial distribution the arguments
of which denote (random variables), ndraws, (event probabilities). The factor of
1/3 is due to the assumption that only one true mutation can be present at one

6

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


site, and each of the 3 possible mutated forms has the probability 1/3. Each
Mult(m(j) . . .) term is a conditional probability conditioned on the true mutant
type being (j).

The straight-forward approach for treating the unknown cell linage tree as
a nuisance parameter would be to average over all trees T :

L(D|µ,t) =
∑
T
L(D|µ,T ) ·pBD(T |t), (10)

where pBD(T |t) is the probability of the cell lineage tree T given conditioned
birth-death process with turnover rate t. Due to the very large number of possi-
ble trees the above average, however, is intractable and we must results to sam-
pling a finite number of trees drawn from the conditioned birth-death process
with fixed t. Based on empirical experience we found that using the geometrical
mean of L(D|µ,T ) for a finte sample of trees sampled according to pBD(T |t)
results in more robust inference. The geometric mean approximates the aver-
age probability of inference [29] or equivalently the average surprisal [30] of cell
linage trees given the turnover rate t, which we denote

L̄(D|µ,t) =
∏
T
L(D|µ,T )pBD(T |t) = exp

(∑
T
pBD(T |t) lnL(D|µ,T )

)
. (11)

In practice, during inference of the turnover rate t and mutation rate µ the
log-average over a finte number of trees drawn from the conditioned birth-death
process with turnover rate t is maximized:

lnL̄(D|µ,t) =
1

ntrees

∑
T

lnL(D|µ,T ) (12)

Generating trees

To generate cell division trees from birth-death conditioned process, we use the
ELynx software suite [11], which allows freely adjustable birth rates, death rates,
and tree sizes.

Generating synthetic samples

For generating synthetic samples of read counts of mutated DNA sites, trees
simulated by ELynx are used as genealogical trees of hypothetical tumors. For
each site, first we determine the total readcount at that site. Then, we draw
random numbers to check whether any of the branches contributes a mutation,
according to the Poisson process described in Sec. Inference. If there is a muta-
tion, the true mutant readcount is drawn from a hypergeometric distribution.
The number of successes of the hypergeometric distribution is the number of
leaves of the selected branch. The number of failures is the total number of
leaves multiplied by ploidy and divided by the hypothetical purity of the sam-
ple, minus the number of successes. The number of trials is the total readcount.
Finally, errors are introduced by drawing a quadruplet of readcounts (“wildtype
errors”) from a multinomial distribution, with probabilities (ε/3,ε/3,ε/3, 1−ε),
the number of trials is the wildtype readcount. “Mutant errors” are drawn from
another multinomial distribution, with probabilites (1 − ε,ε/3,ε/3,ε/3), the

7

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


number of trials is the mutant readcount. The final readcounts are given by the
sum of the two drawn quadruplets.

The mutation rate for different turnover rates is chosen such that the total
number of observed real mutations should remain close to each other, i.e., the
estimation algorithm should have a similar amount of input data.

Calculating the likelihood

The goal is to find the maximum of the likelihood as the function of the mutation
rate, turnover rate, and the error rate, to be able to use it for estimating the
mutation and turnover rates by Eq. 10. The input is the read counts of the
DNA sites. We use pre-generated division trees from the ELynx suite at pre-
determined turnover rate values. Between these pre-determined turnover rate
values, the likelihood function is interpolated using cubic splines. In the case
of synthetic input datasets, the tree used to generate the test dataset is never
included in the likelihood calculation.

The maximum of the likelihood function for each fixed turnover rate value
is obtained by optimizing the error rate using Brent’s method, implemented
in Julia’s Optim package, and estimating the mutation rate from the number
of input mutations and the branch lengths of the currently fitted tree. Only
mutations having read counts high enough to exclude sequencing errors are
taken into account in the mutation rate estimation. The estimated mutation
rate is averaged over trees, using uniform weights. Therefore, the likelihood
function is optimized for ε at different t values, which come from a pre-defined
set, for which the trees can be generated in advance, avoiding the need for new
trees at each step of the optimization process.

3 Results on synthetic data

No sequencing errors

Figs. 2 shows the estimated turnover rates and mutation rates as functions of
the true turnover rates and mutation rates. The method can reasonably differ-
entiate between datasets with different true turnover rates-mutation rates and
estimate their values. Fig. 3 shows the joint estimation of mutation rate-turnover
rate parameter pairs. The data points are arranged into lines, corresponding to
constant numbers of observed mutations, obeying

Nobs mut(µ,t) = µ ·E

(∑
k

l(k) ∗ ccdf(Binom(nseq,fk), 0)

)
trees(t)

(13)

where ccdf(. . . ) is the complementary cumulative distribution function of a bi-
nomial distribution, evaluated at 0. The parameters of the binomial distribution
are the average sequencing depth nseq and the fraction of leaves under branch
k, fk. The expected value is taken over the trees generated with turnover rate t.
The lines defined by Eq. 13 can be numerically approximated for any µ,t points,
averaging over a number of trees.

We checked the dependency of the results on the sizes of the trees. According
to Fig. 4, estimation of large true turnover rates becomes increasingly harder
as the tree sizes decrease. This can be attributed to the fact that differences

8

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


  10-5

  10-4

  10-3

  10-2

  10-1

  100

10-4 10-3 10-2 10-1 100

e
st

im
a
te

d
 1

-t

dataset 1-t

 10-9

 10-8

 10-7

10-9 10-8 10-7

e
st

im
a
te

d
 µ

dataset µ

Figure 2: Estimated turnover rates (left) and mutation rates (right) for different
true values. 10 synthetic datasets for each true value. The trees used for fitting
have 10000 leaves, 10000 trees were used for each t value of the loglikelihood(t)
measurements (see Fig. 7). The continuous line is a guide for the eye, correspond-
ing to y = x. Points are slightly dispersed horizontally for clarity. Horizontal
ordering of the data points is the same for both subplots, e.g., the rightmost
point in each group of points corresponds to dataset no. 10 in both plots.

Figure 3: Joint estimations of mutation rate-turnover rate parameter pairs. 10
synthetic datasets for each true parameter pair, each of which is denoted by one
color. True parameter values are indicated by large full circles. Solid lines show
the numerical approximation of µ(1 − t), for Nobs mut = 5 · 105, 5 · 104, 5 · 103.

between the effects of very high turnover rates are observable on branches having
a very small relative number of leaves, therefore large trees are required to

9

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


distinguish between high turnover rates.

Figure 4: The effect of tree sizes on the estimations. Estimated turnover rates
for different fitted trees sizes: 100 (top left), 1000 (top right), 10000 (bottom
left), 100000 (bottom right).

The accuracies of the estimates for different datasets are not equal, besides
the effect of the size of the trees in the case of high turnover rates. Differences
between estimates for different datasets can be due to 3 possible factors: the
trees used in the fitting process, the generated input data, or the tree used for
generating the data. To check these factors, we chose a dataset which resulted
in a turnover rate estimation deviating from the true value (dataset no. 7 for
1 − t = 1 on Fig. 2, 1 − t = 0.47). We calculated the estimated turnover rate
values using 10 independent sets of 1000 trees. The estimations ranged from
1−t = 0.39 to 0.54. Therefore, the deviation of the estimate from 1.0 cannot be
attributed to the sample of fitting trees. Then, we generated 10 more datasets
using the same tree as for the original dataset. The estimated turnover rate
values were between 1−t ∈ [0.44, 0.49], even more closely matching the original
estimate. Consequently, the effect does not depend on the generated data but on
the tree used to generate the data. It seems that the deviation of the estimates
from the true turnover rates is due to the fluctuation of the shapes of the trees
used for sample generation.

10

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


Effects of sequencing errors

To estimate the effect of sequencing errors, we calculated the estimations of the
turnover rates, applying different amounts of errors to the same data (exactly
the same mutant and wild type readcounts for each mutation). In this case,
the error rate of the data was also estimated by the fitting procedure, along
with the mutation and turnover rates. The influence of sequencing errors on the
estimation of the turnover rates is shown on Fig. 5. For an error rate of 10−3,
which is frequently cited as the error rate of the Illumina sequencing technology
[13], the estimated turnover rates can have significant deviations from the true
values.

For lower error rates, the estimations approach the true values, however,
outliers remain even for ε = 10−8. We note that the loglikelihoods of the false
estimations were better than those of the true values. We tried estimating the
parameters while leaving out the least frequent mutations, to reduce the effect
of errors, but the estimated parameters deviated significantly more from their
true values.

4 Results on empirical data

To estimate the turnover and mutation rates of real tumors, a real human tumor
sample is required. Due to the fitting method’s sensitivity to high sequencing
error rates, we need a sample which is sequenced using a very low error rate
technology. Such samples are much less ubiquitous than those by the standard
technology, and are usually restricted to very short genome segments, mostly
nonhuman. Nevertheless, we found a sample of a human hepatocellular carci-
noma (HCC) [14], which was sequenced using the o2n sequencing technology
[15], providing error rates between 10−5-10−8, which is significantly lower than
the 10−3 rate of the standard Illumina process. Besides the low error rate, the
amount of sequenced positions is enough to cover the targeted region 3400x [15],
which is also much better than those of standard quality datasets (typical se-
quencing depth is around 30). High sequencing depth results in more identified
mutations and more precise mutation frequencies. Sequencing was targeted at
a 410k basepair wide region of the genome, which is much narrower than the
whole human genome. This is a typical shortcoming of low error rate sequencing
methods. Still, as our fitting procedure is much more sensitive to the sequencing
error rate than to the amount of the input mutations (compare the deviations
on Figs. 3 and 5), the dataset provides significantly better input than typical
sequencing data.

The raw sequencing data was preprocessed according to [15], using the code
provided by the authors. The DNA contents of 10000 cells were sequenced [14],
along with a sample of neighboring normal tissue. Mutations were called using
VarScan 2 [16], which is flexible and easy to adapt to the requirements of the
fitting procedure. 923383 sites remained after preprocessing, with sequencing
depth being at least 8 (default for VarScan 2) in both tumor and normal samples.
The distribution of sequencing depths is wide, ranging from 8 to 10853, with a
mean of 904. For mutation calling by VarScan 2, the minimum number of mutant
reads was set to 1 and the strand filter switched off. Although the number of false
positives increases with these parameter choices, the resulting called mutations

11

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


Figure 5: The effect of varying the error rate. Sequencing error rates are ε = 10−3

(top left), 10−4 (top right), 10−5 (middle left), 10−6 (middle right), 10−7

(bottom left), 10−8 (bottom right). 10000 trees, tree size = 10000. X coordi-
nates are slightly dispersed for clarity. Open circles are results corresponding to
error rates used in the fit fixed to their true values, crosses correspond to error
rates estimated by the parameter fit. Vertical lines show the ranges between the
first and last 10-quantiles, based on 10000 beta value estimations. Each open
circle-cross pair corresponding to the same dataset is vertically aligned.

correspond better to the error model of our fitting procedure than an error
rate which changes sharply with threshold frequency or readcount values. The
minimum variant frequency was set to 10−6 to include even the least frequent
mutation. Purity was set to 0.85, in accordance with [14]. We also checked that
the default somatic p-value threshold does not exclude any candidate somatic

12

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


mutations. Other parameter settings were unaltered from their default values.
Mutation frequencies were corrected for copy number variation (CNV), using
VarScan 2 with default parameters, and for ploidy of the sex chromosomes.
CNV detection for targeted sequencing data is a more difficult task than for
whole genome data, and VarScan 2 was found to be a stable performer [17].
Sites having multiple variant types (i.e., number of reads of wildtype plus most
frequent mutant type being lower than the sequencing depth) were checked
manually. Readcounts of all 4 possible genotypes were identified for all variant
sites.

After all these steps, 2284 mutations were identified. The variant allele fre-
quency spectrum is shown of Fig. 6.

 0

 10

 20

 30

 40

 50

 0  0.2  0.4  0.6  0.8  1

c
o
u
n
t

mutation frequency

 1

 10

 100

 1000

 10000

 0  0.2  0.4  0.6  0.8  1
mutation frequency

Figure 6: Variant allele frequency spectrum of a human hepatocellular carcinoma
sample [14], obtained by o2n sequencing [15].

To estimate the sequencing error rate, the fitting procedure was applied to
the mutation data with various fixed error rates in the range 10−8-10−4. The
maximum of the loglikelihood corresponded to ε = 10−7. It is a plausible value,
as [15] estimated the error rate between 10−5-10−8. Based on estimating 10000
best error rates, the error rate should be between 9.3 · 10−8 and 1.2 · 10−7.

Having determined the error rate, we estimated the mutation and turnover
rates, with the error rate fixed, using 105 trees. Fig. 7 shows the estimated
turnover rate, 1−t = 1.09·10−3, and the mutation rate, µ = 9.26·10−8 per site
per cell division. Corresponding to the range of the error rate, the turnover rate
ranges within 1.09·10−3-1.10·10−3. The mutation rate ranges between 9.23·10−8-
9.29 ·10−8. Neglecting mutations over frequency 0.46 does not alter the results.
For illustration, Fig. 8 shows the VAF of a synthetic sample, generated using
the tree fitting best the empirical data.

The estimated mutation rate is rather high, compared to estimations of the
order of 10−9-10−10 per site per cell division for healthy human somatic tissues
[18]. In comparison with mutation rates of tumors, it is not an outstanding
value [1]. Meanwhile, the turnover rate is also high, being very close to the birth
rate. Possible causes include the effect of the immune system, the deleterious
nature of driver mutations or competition for resources among tumor cells. In
conclusion, for this tumor sample, the high number of mutations is due to a
combination of an elevated mutation rate and a high turnover rate.

The results allow estimating the number of cell division rounds from the
founding cell to the biopsied tumor. The average height of simulated trees with
the estimated parameters is 2009 cell divisions. It should be noted that a naive

13

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


-34000

-33500

-33000

-32500

-32000

-31500

-31000

1e-7 1e-6 1e-5

lo
g
li
k
e
li
h
o
o
d

µ

Figure 7: Loglikelihood-turnover rate and loglikelihood-mutation rate curves of
the HCC data. Interpolation between data points is by cubic splines.

 0

 10

 20

 30

 40

 50

 0  0.2  0.4  0.6  0.8  1

c
o
u
n
t

mutation frequency

Figure 8: VAF of a synthetic sample, generated using the tree which fits the
empirical data the best. The grey outline shows the empirical VAF.

estimation of tree height using log2(2.7 · 109) successive branches of average
length 1/(1.1·10−3) is wrong, due to the very different shapes of surviving trees
compared to all trees, most of which go extict before reaching 1/(1.1·10−3) size.

It is also possible to estimate the lifetime of the HCC sample and the cell
division rate of the HCC tumor. The diameter of the tumor is 35 mm, while
the length of a HCC cell is 25 µm [14]. This gives a total number of 2.7 · 109
cells in the whole tumor. The median HCC tumor volume doubling time is 86
days [19]. Based on these figures, the lifetime of the analysed sample is around
7 years, and the cell division rate is estimated to be around 1/32 1/hour.

14

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


5 Discussion

In summary, we described a method to simultaneously estimate the mutation
rate and the turnover rate, making it possible to answer the question which of
them is responsible for the elevated number of mutations in tumors. In par-
ticular, the mutated sites’ read counts, which are closely related to the shape
of the site frequency spectrum, contain useful information about the turnover
rate (death rate, relative to the birth rate), even in the presence of a moderate
sequencing error rate. The sequencing error rate can also be estimated. It is also
quite straightforward to elaborate the model by including nucleotide-dependent
transition probabilites, or trinucleotide context-based effects.

The accuracy of the estimation is influenced by 4 factors. First, the sharp-
ness of the peak of the loglikelihood function, which tends to be narrow, for the
expected amount of input data. Second, the finite amount of trees used in the
fit, causing only a slight dispersion for 104 trees. Third, the shape of the true
lineage tree, which, for death rates extremely close to the birth rate, can distort
the estimation by one order of magnitude. Finally, the assumptions behind the
model (birth-death process with constant rates, neutral mutations from a Pois-
son process) also contribute to the uncertainty of the estimations. According
to the results, the estimation method works sufficiently well to discern cases of
small difference between the birth and death rates (α − β � α) and cases of
the death rate being much lower than the birth rate. Without such capability,
the answer to the question “Is it the mutation rate or the death rate?” would
always be “mutation rate”.

Although the method is presented in the context of human tumors, it can
handle healthy tissues, and samples from other species, too. In theory, any pop-
ulation, descending from one ancestor and possessing genetic material, can be
analyzed, however, lengthier genomes giving rise to more mutations are easier,
due to the increase in the input data for the estimation.

On the more practical side, we also discovered that averaging the loglikeli-
hoods over trees, instead of the likelihoods, gives a significant improvement in
the robustness of the results.

Concerning sequencing errors, the noise level in the standard Illumina tech-
nology makes applying the method to typical samples impractical. One solution
is to use a sequencing technology with much lower error rates, e.g., [21, 22], or
even below the 10−6 error rate of the PCR process, [13, 15]. It should be noted,
however, that these technologies have been applied to short DNA segments only,
resulting in a reduced number of mutations as input data. Another possibility
is to apply noise filtering to standard sequencing data, e.g., deepSNV [23], and
modify the error model of the fitting process accordingly.

Furthermore, when there is no estimation of the order of magnitude of the
sequencing error rate, variance in the accuracy of the method can be quite large
in individual cases, despite the much better behavior of the averaged results.

Despite the shortcomings, it is clear that the signal does exist in the site
frequency spectrum, the mean of the estimated turnover rates monotonically
changes with the testing datasets true turnover rates, and are clearly not inde-
pendent of them.

Besides successes on synthetic data, we were also able to analyze an em-
pirical sample of a hepatocellular carcinoma. We simultaneously estimated the
mutation rate and the turnover rate. Both quantities were estimated to be much

15

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


higher than for healthy tissues, mutation rate being 9.3 · 10−8 per site per cell
division, and turnover rate t = 0.9989. In other words, the high number of mu-
tations in this tumor is caused by a combination of both high mutation and
turnover rates. Using the turnover rate, we also estimated the number of cell
division rounds in the tumor’s lifetime, and its cell division rate. The results
suggest that tumor cells are constantly dividing, but the growth of the tumor
is limited by other factors changing on a much longer timescale, e.g., securing
sufficient blood supply, which cause most new tumor cells to die and slow down
tumor growth. With such a high turnover rate, the ability of limitless replication
is essential for tumor growth.

It is interesting to note that high turnover rates are able to reproduce sub-
clonal peaks in the VAF, using a purely neutral birth-death process. In other
words, subclonal peaks are not necessarily the consequence of selection, neutral
processes can also produce them, indicating strong cell death. On this basis, it
is possible to give a definition of subclones as branches in the lineage tree, close
to the root and long enough to appear as peaks on the VAF spectrum. Using
this definition, there is no need to explain different parts of the VAF spectrum
with different models[24].

In this work, the turnover rate was held constant during the evolutionary
process. There are signs that it is more realistic to assume a turnover rate
which changes during tumor growth [2]. In our case, the estimated strong cell
death suggests that the tumor reached a slowly growing phase, in line with a
Gompertzian model of tumor growth [25, 26], which is corroborated by the large
sizes of observed tumors (diameter ≥ 1cm) used in the doubling time estimation
[19]. It is possible that in the earlier stages of tumor development, cell death was
less frequent and doubling time was shorter. It might be the case that the rate
of cell division is constant during tumor growth, and doubling time is set by the
turnover rate, which is, in turn, limited by external factors. It is an interesting
direction for future work to extend the model by allowing the turnover rate to
be time-dependent.

The combined effect of the estimated mutation and turnover rates is a very
high effective mutation rate between cell divisions where both daughter branches
survive, µeff being in the order of 10

−4-10−5 per site per surviving cell division.
While this value looks suprisingly large, it is logical that the combination of a
slowly growing tumor and fast dividing tumor cells leads to a very large number
of mutations.

Currently, the method uses a simple birth-death model for tumor growth. In
the future, a more realistic growth model, including e.g., spatial effects [27, 28],
would enhance the applicability of the method. Another possibility for improve-
ment is to model spatial sampling of tissues, in which the measured mutation
frequencies intertwine the correlated ancestry of sampled cells with the preva-
lence of the mutations.

Author Contributions

(According to https://journals.plos.org/ploscompbiol/s/authorship#loc-author-
contributions.)

GJSz conceptualized the research project. GT, ID and GJSz performed the
formal analysis. Funding was acquired by ID and GJSz. GT carried out the

16

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


investigation. GT, ID and GJSz contributed to the methodology. GT and DS
developed the necessary software. GT provided the visualization. GT wrote the
draft. DS, ID and GJSz reviewed and commented on the manuscript.

Acknowledgements

GT and GJSz received funding from the European Research Council under
the European Unions Horizon 2020 research and innovation programme under
grant agreement no. 714774. GJSz was also supported by the grant GINOP-
2.3.2.–15–2016–00057.

References

[1] Williams MJ, Werner B, Barnes CP, Graham TA, Sottoriva A. Iden-
tification of neutral tumor evolution across cancer types. Nat Gen.
2016;48:238–244.

[2] Bozic I, Gerold JM, Nowak MA. Quantifying Clonal and Sub-
clonal Passenger Mutations in Cancer Evolution PLoS Comput Biol.
2016;12(2):e1004731.

[3] Tomlinson I, Sasieni P and Bodmer W. How Many Mutations in a Cancer?
Am J Pathol. 2002;160:755-758.

[4] Araten DJ, Golde DW, Zhang RH, Thaler HT, Gargiulo L, Notaro R et
al. A Quantitative Measurement of the Human Somatic Mutation Rate.
Cancer Research 2005;65:8111.

[5] Loeb LA, Bielas JH and Beckman RA. Cancers Exhibit a Mutator Pheno-
type: Clinical Implications. Cancer Research 2008;68:3551.

[6] Williams MJ, Werner B, Heide T, Curtis C, Barnes CP, Sottoriva A, et al.
Quantification of subclonal selection in cancer from bulk sequencing data.
Nat Genet. 2018;50:895-901.

[7] Werner B, Case J, Williams MJ, Chkhaidze K, Temko D, Fernández-Mateos
J, et al. Measuring single cell divisions in human tissues from multi-region
sequencing data. Nat Comm. 2020;11:1035.

[8] Gernhard T. The conditioned reconstructed process. J Theor Biol.
2008;253:769.

[9] Maruvka YE, Kessler DA, Shnerb NM. The Birth-Death-Mutation
Process: A New Paradigm for Fat Tailed Distributions. PLoS ONE
2011;6(11):e26480.

[10] Kessler DA, Levine H: Scaling solution in the large population limit of
the general asymmetric stochastic Luria-Delbrück evolution process. J Stat
Phys. 2015;158:783–805.

[11] Schrempf D. The ELynx Suite; 2019 [cited 2020 Sept 01] Repository:
GitHub [Internet]. Available from: https://github.com/dschrempf/elynx

17

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


[12] Höhna S, May MR, Moore BR. TESS: an R package for efficiently sim-
ulating phylogenetic trees and performing Bayesian inference of lineage
diversification rates. Bioinformatics 2016;32(5):789-791.

[13] Kennedy SR, Schmitt MW, Fox EJ, Kohrn BF, Salk JJ, Ahn EH, et al.
Detecting ultralow-frequency mutations by Duplex Sequencing. Nat Protoc.
2014;9:2586.

[14] Ling S, Hu Z, Yang Z, Yang F, Li Y, Lin P, et al. Extremely high ge-
netic diversity in a single tumor points to prevalence of non-Darwinian cell
evolution. PNAS 2015;112:E6496-E6505.

[15] Wang K, Lai S, Yang X, Zhu T, Lu X, Wu C, et al. Ultrasensitive and
high-efficiency screen of de novo low-frequency mutations by o2n-seq. Nat
Comm. 2017;8:15335.

[16] Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al.
VarScan 2: somatic mutation and copy number alteration discovery in can-
cer by exome sequencing. Genome Res. 2012;22:568-76.

[17] Zare F, Dow M, Monteleone N, Hosny A, Nabavi S. An evaluation of copy
number variation detection tools for cancer using whole exome sequencing
data. BMC Bioinformatics 2017;18:286.

[18] Lynch M. Evolution of the mutation rate. Trends Genet. 2010;28:345.

[19] An C, Chou YA, Choi D, Paik YH, Ahn SH, Kim M-J, et al. Growth rate of
early-stage hepatocellular carcinoma in patients with chronic liver disease.
Clin Mol Hepatol. 2015;21:279.

[20] Stadler T. On incomplete sampling under birth-death models and connect-
sion to the sampling-based coalescent. J Theor Biol. 2009;261:58-66.

[21] Lou DI, Hussmann JA, McBee RM, Acevedo A, Andino R, Press WH,
et al. High-throughput DNA sequencing errors are reduced by orders of
magnitude using circle sequencing. PNAS 2013;110:19872-19877.

[22] Kinde I, Wu J, Papadopoulos N, Kinzler KW, Vogelstein B. Detection and
quantification of rare mutations with massively parallel sequencing. PNAS
2011;108:9530-9535.

[23] Gerstung M, Beisel C, Rechsteiner M, Wild P, Schraml P, Moch H, et
al. Reliable detection of subclonal single-nucleotide variants in tumour cell
populations. Nat Comm. 2012;3:811.

[24] Caravagna G, Heide T, Williams MJ , Zapata L, Nichol D, Chkhaidze K,
et al. Subclonal reconstruction of tumors by using machine learning and
population genetics. Nat Gen. 2020;52:898–907.

[25] Laird AK. Dynamics of Tumour Growth: Comparison of Growth Rates and
Extrapolation of Growth Curve to One Cell. Br J Cancer 1965;19:278–291.

[26] Lo CF. A modified stochastic Gompertz model for tumour cell growth.
Comp Math Methods Medicine 2010;11:3-11.

18

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/


[27] Antal T, Krapivsky PL, Nowak MA. Spatial evolution of tumors with suc-
cessive driver mutations. Phys Rev E 2015;92:022705.

[28] Noble R, Burri D, Kather JN, Beerenwinkel N. Spatial structure governs
the mode of tumour evolution. BioRxiv [Preprint]. 2019 bioRxiv 586735
[posted 2019 Mar 23; revised 2019 Apr 13, cited 2020 Sept 24]: [19 p.].
Available from: https://doi.org/10.1101/586735

[29] Nelso KP. Assessing Probabilistic Inference by Comparing the Generalized
Mean of the Model and Source Probabilities. Entropy 2017;19:286.

[30] Nelso KP. Inference assessment on a probability scale. 51st Annual
Conference on Information Sciences and Systems (CISS) 2017; 1-
5.10.1109/CISS.2017.7926106.

19

.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430830doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430830
http://creativecommons.org/licenses/by-nc/4.0/

	Introduction
	Methods
	Results on synthetic data
	Results on empirical data
	Discussion