LiquidCNA: tracking subclonal evolution from longitudinal liquid biopsies using somatic copy number alterations LiquidCNA: tracking subclonal evolution from longitudinal liquid biopsies using somatic copy number alterations Eszter Lakatos1⇤, Helen Hockings2,3, Maximilian Mossner1, Weini Huang4, Michelle Lockley2,5, Trevor A. Graham1⇤ 1 Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK 2 Centre for Cancer Cell and Molecular Biology, Barts Cancer Institute, Queen Mary University of London, London, UK 3 Barts Health NHS Trust, St Bartholomew’s Hospital, West Smithfield, London, UK 4 School of Mathematical Sciences, Queen Mary University of London, London, UK 5 Department of Gynaecological Oncology, Cancer Services, University College London Hospital, London, UK ⇤ Correspondence: e.lakatos@qmul.ac.uk; t.graham@qmul.ac.uk Abstract Cell-free DNA (cfDNA) measured via liquid biopsies provides a way for minimally-invasive monitoring of tumour evolutionary dynamics during therapy. Here we present liquidCNA, a method to track subclonal evolution from longitudinally collected cfDNA samples based on somatic copy number alterations (SCNAs). LiquidCNA utilises SCNA profiles derived through cost-e↵ective low-pass whole genome sequencing to automatically and simulta- neously genotype and quantify the size of the dominant subclone without requiring prior knowledge of the genetic identity of the emerging clone. We demonstrate the accuracy of liquidCNA in synthetically generated sample sets and in vitro and in silico mixtures of cancer cell lines. Application in vivo in patients with metastatic lung cancer reveals the progressive emergence of a novel tumour sub-population. LiquidCNA is straightfor- ward to use, computationally inexpensive and enables continuous monitoring of subclonal evolution to understand and control therapy-induced resistance. 1 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction Liquid biopsies, primarily the analysis of cell free DNA (cfDNA) present in blood samples, o↵er the potential for regular longitudinal and minimally invasive monitoring of cancer dynamics [1, 2, 3, 4, 5, 6, 7]. Circulating cfDNA is released into the blood via apoptosis or necrosis of cells. Tumour-derived cfDNA in the blood is detectable from tumours as small as 50 million cells [8], it shows correlation with disease stage [9, 10], and o↵ers the same diagnostic potential as tissue-based biopsies [7]. cfDNA is an aggregate of DNA shed from multiple locations and multiple malignant cells across the body and hence a single sample can provide a comprehensive overview of systemic disease. Consequently, cfDNA is an exceptional resource for non-invasive tracking of tumour composition and for monitoring response to therapy or clinical relapse. Typically, cfDNA analysis has focused on the detection of driver gene single nucleotide variants (SNVs), with the size of mutation-bearing clones inferred from the relative se- quencing read count at the mutation site. For instance, in high-grade serous ovarian cancer (HGSOC) the frequency of TP53 mutation in cfDNA is a measure of tumour burden and is predictive of treatment response [11]. In colorectal cancer, KRAS mutation frequency in cfDNA is predictive of response to anti-EGFR therapy [4]. Somatic copy number alterations (SCNAs) are widespread in cancers [12, 13, 14], and have been used extensively to track tumour composition and dynamics over time [15, 16, 17, 18]. SCNAs can be detected in cfDNA without prior knowledge of the tumour SCNA profile, through measurement of the relative number of reads mapping within ‘bins’ spaced across the genome [19]. Relative di↵erences in read count between bins can be sensitively detected even when the total read count is low [20, 21, 22], meaning that SCNAs can be detected with a fraction of the sequencing depth required for SNV detection. Therefore SCNA profiling o↵ers a high-throughput and cost-e↵ective means to evaluate cfDNA samples [23, 24, 25, 26, 27, 28]. Whilst measuring clone sizes based on the frequency of SNVs is straightforward, de- riving quantitative information on the proportion of tumour population that carries a particular SCNA is challenging. Tumour cells are not the only contributors to the cfDNA pool, and an SCNA can in theory change the copy number to any non-negative integer value. Thus total read count per bin is a noisy compound function of the relative tumour cell contribution to the total cfDNA pool, and the specific copy number of the alteration. Here we present a new method to identify and track tumour subclonal evolution based solely on measurement of SCNAs from longitudinal cfDNA samples. Our algorithm, named liquidCNA, firstly determines the contribution of tumour DNA to the total cfDNA pool (i.e. cellularity/purity) and then uses SCNA data to characterise and quantify the size of the most pervasive (putatively resistant) subclone emerging or contracting over time. The e�cacy of the method is demonstrated using synthetic datasets, in vitro cell line mixtures, and in vivo via longitudinal analysis of cfDNA from lung cancer patients undergoing targeted treatment. 2 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Results Emergent subclone tracking from copy number information First, we derive a mathematical definition of the problem of tracking an emergent (pu- tatively resistant) tumour subclone from longitudinal cfDNA samples, typically taken throughout the course of treatment. We consider a tumour cell population undergoing continuous evolution characterised by two cell types, ancestral tumour cells (A) and an emerging subclone (S). We assume that liquid biopsies contain DNA originating from an- cestral and subclonal tumour cells, as well as contaminating DNA from normal cells (N). The proportion of DNA arising from cells of the emergent subclone within the tumour is expressed by the subclonal-ratio, ri, while the overall proportion of tumour-originating DNA is termed the purity or tumour fraction of the sample, denoted by pi. We consider that the copy number (CN) profile of each sample has been measured – for example using low-pass whole genome sequencing (lpWGS) – and so the genome can be divided into segments, contiguous regions of constant CN. Each measured segment CN in sample i (C j i ) is the combination of each cell population’s CN at the jth genomic location (2 for normal cells and C(A) and C(S) for ancestral and subclonal tumour cells, respectively), weighted by the proportions of the three populations (Fig. 1). C j i = 2 + pi � (1 � ri)C(A)j + riC(S)j � 2 � . (1) We assume that each segment can fall into one of three categories depending on its CN in ancestral and subclonal tumour cells. Clonal alterations (and unaltered segments) are at the same CN in both tumour populations, and their measured CN is only a↵ected by the purity of a sample. Subclonal segments represent SCNAs that are unique to the emerging subclone. Their measured CN is influenced by the subclonal-ratio of a sample, as well as sample purity. Finally, segments that do not follow either of these patterns – due to uncertain measurements or ongoing instability – are termed unstable. Our aim is to estimate the underlying purity and subclonal-ratio, pi and ri, from longitudinal CN measurements of clonal and subclonal segments (Fig. 1). Estimation of subclonal-ratio Estimation is carried out in three steps (Fig. 2a and Methods). First, the purity of each sample is assessed using the distribution of segment CN values. We assume that the majority of segments have integer CN in all tumour cells, hence the distribution is expected to have distinct peaks at regular intervals of pi, corresponding to clonal segments with CN of 1, 2, 3, etc. (Fig. 2b). We derive the purity estimate as the value that minimises the squared error between observed and expected peaks (Fig. 2c). The inferred purity values are used to correct the segment CN values, thus estimating the tumour-specific CN of each segment. LiquidCNA does not require a mainly diploid tumour genome (i.e. major peak at CN= 2) to derive correct estimates, but will derive erroneous conclusions if the CN values 3 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ – as measured by the CN quantification software, e.g. QDNAseq [19] – are incorrectly centred (e.g. major peak is defined as copy number 2, but the true value is copy number 3). To control for this an initial manual check of the CN profile is recommended prior to applying liquidCNA and renormalisation to the correct ploidy if required. Next, for every segment we compute the change in CN, �CN, between each sample and a baseline sample that is assumed to have negligible proportions of the emerging (putatively resistant) subclone – for example a sample taken upon diagnosis or before start of therapy. �CN values naturally highlight subclone-associated segments altered in non-baseline samples, as these segments display markedly positive (CN gain compared to baseline) or negative (CN loss) values (Fig. 2d). From these �CNs we then establish the set of segments that are subclonal and the sample ordering that reflects increasing subclonal proportions. To do this, we examine each possible order of samples, classifying each segment as clonal (if the variance of its �CNs across samples is below a pre-defined threshold), subclonal (if it shows monotone change in �CN value along the order of the samples - i.e. if the �CNs are consistent with an emerging subclone) or unstable (if it does not correlate with sample order) according to that order (Fig. 2e). The order with the highest proportion of segments classified as subclonal is selected, and these subclonal segments are used for downstream computation of tumour composition (Fig. 2f). The methodology ensures that the dominant subclone associated with the most pervasive SC- NAs is evaluated and that subclonal-ratio inference is robust to segments with unstable CN. Finally, we compute the relative and absolute subclonal-ratio of each sample using the identified set of subclonal segments. Relative subclonal-ratios are defined as the median ratio of segment �CNs compared to the sample with the maximum subclonal proportion (Fig. 2g). The absolute subclonal-ratio is computed based on the assumption that sub- clonal segment CN values correspond to distinct SCNAs that di↵er between ancestral and subclonal cells. The subclonal-ratio of sample i is therefore derived as the shared mean (ri) of a mixtures of Gaussian distributions with constrained means �ri, +ri, etc., fitting the �CN distribution of subclonal segments (Fig 2h). We also provide the 95% confidence interval of the absolute subclonal-ratio estimate based on the shared variance of the fitted Gaussians (Fig 2i). LiquidCNA outputs both relative and absolute subclonal-ratio measures, since for most applications the relative value holds su�cient information on how the subclonal (putative resistant) population changes between time-points. Relative proportions are also less susceptible to the measurement noise in the measured segment CNs, while a combination of low subclonal proportion and high sequencing noise can cause the fitting of absolute subclonal-ratio estimates to fail to converge. Synthetic mixed populations We first evaluated the performance of liquidCNA using synthetic datasets where input values of subclonal proportion and purity were known. We generated synthetic datasets 4 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ characteristics matching typical longitudinal measurements of patients. In order to simu- late imperfect measurements, we added varying levels of normally distributed measurement noise (defined by the dimensionless parameter �) to bin-wise CN values (Fig. 3a-c and Methods). We evaluated the accuracy of the purity estimation on 250 synthetic samples (Fig. 3d), and found that purity p could be estimated within 2% of the true tumour fraction in 90% of samples at noise levels �  1. The error on the purity estimation was greater when the noise was increased (Fig. 3e), and was most pronounced in samples with high noise and low tumour fraction. Consequently, we restricted our subsequent analysis to only cases of higher purity (pi � 0.1). Next, we derived subclonal-ratios using purity-corrected CN profiles on the higher purity subset of synthetic mixtures. We set a threshold to filter out clonal segments (see Fig. 2e) such that at least 10 segments were retained and the proportion of retained segments classified as subclonal was maximal following segment classification. Fig. 3f shows the true and estimated subclonal-ratios for 50 synthetic experiments. Overall, we found that subclonal-ratio was estimated with ⇠ 5% error, and the accuracy was influenced by measurement noise (Fig. 3g). Relative subclonal-ratios (calculated compared to the sample with highest subclonal proportion) were estimated with higher accuracy (error ⇠ 3%, Fig. S1a-b). We found that computing absolute subclonal-ratios in a two-step process from these values yielded similar results to direct estimation by fitting a Gaussians mixture model, and provided an estimate even in cases where the direct estimation did not converge (Fig. S1c and Methods). The proportion of unstable segments, unlike noise, had little e↵ect on the estimation accuracy (Fig. S2). Mixtures of ovarian cancer cell lines Next, we evaluated liquidCNA on real data derived from in vitro mixtures of two paired high grade serous ovarian cancer (HGSOC) cell lines [29] (see Method and Table S1). HGSOC cells were ideally suited for this evaluation as high levels of chromosomal insta- bility are a hallmark of the disease [30, 31]. We anticipated that liquidCNA will be most applicable for the tracking of subclonal evolution in malignancies with high CNA burden [26]. We divided a population of OVCAR4 cells into two aliquots, and the first aliquot was untreated and classified as ‘sensitive’. In a process described in detail by Hoare et al. [29], cells from the second aliquot were cultured so that they evolved resistance to platinum- containing chemotherapy and thus were termed ‘resistant’. In addition to the high SCNA burden inherited from the ancestral sensitive cell line, resistant cells acquired new SCNAs during the in vitro evolution of resistance (Figure 4a). We then mixed, in varying known proportions, the genomic DNA extracted from the two cell lines, with sensitive cells representing the ancestral and resistant cells the emerging subclonal population. The mixtures were further diluted with DNA from blood samples of healthy volunteers assumed to have a diploid genome; this modelled the e↵ect of normal 5 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ contamination in patient samples (Table S1). These DNA mixtures were sequenced to mean depth 1.3x and composite SCNA profiles were generated (see Methods). In addition, we generated further in silico mixtures by sampling and mixing genome-aligned reads from sequencing data from each of the three cell types sequenced individually. In these mixtures, we controlled the total number of reads per sample to study the e↵ect of variable read depth and associated measurement noise. First, we used liquidCNA to estimate the purity of in vitro mixed samples (samples S0-S5). The purity of each sample was estimated to be lower than the theoretical mixing proportion (Fig. 4b). In the in silico mixed samples, we found that there was a strong linear relationship between estimated and true purity (Fig. 4c). The underestimation of purity in the samples might be explained by our definition of theoretical purity in the in vitro and in silico mixing procedure (respectively defined as proportion of DNA weight versus the proportion of read counts). A highly aneuploid genome will likely have a higher weight than a diploid genome, therefore mixing of equal weights results in a higher pro- portion of normal genomes than expected. Our purity estimates were in agreement with observed peaks of the CN distribution (Fig. S3a), further confirming that there was no bias in the estimation. By fitting a linear model to the estimates, the theoretical tumour fraction could be fully recovered, as illustrated by the ‘corrected’ estimates of samples S0-S5 (Fig. 4b). The number of reads (sequencing depth) did not systematically influ- ence the accuracy of estimating tumour fraction, but purity estimates of samples with low tumour fraction were noisier at low read depth (Fig. 4c). In summary, liquidCNA pro- vided an accurate estimate for purity values when true purity was above 10%. Decreased measurement accuracy below 10% purity is consistent with our observations on synthetic data and is similar to reported limitations of other methods quantifying tumour fraction from lpWGS cfDNA [24, 20, 22]. Therefore, for samples below 10% predicted purity, we advise to discard the sample from downstream analysis, although low-purity samples may be usable if a very accurate purity estimate can be derived by other means. Next, we inferred the subclonal-ratio for cell line mixtures using purity-corrected �CN values, with sample S0 used as the baseline sample for both in vitro and in silico sample sets. We could correctly order cell line mixtures according to subclonal-ratios without any a priori information (Fig. S3b), and both absolute subclonal-ratio and relative subclonal changes were estimated on average within 2% and 3% of the true subclonal percentage (Fig. 4d,f). In particular, we noted that samples S4 and S3 were accurately estimated as having an equal subclonal-ratio, despite originating from di↵erent biological replicates with di↵erent tumour purity, which was reflected in the small confidence intervals of their estimates. We also note that even though there were no truly unstable segments in this dataset as measurements were not taken over time, three non-clonal segments were clas- sified as such, probably due to higher noise in their measured CN value. Using datasets of randomly selected in silico samples with 50 million reads, we con- firmed that our algorithm could accurately infer the subclonal-ratio of samples, in partic- ular when considering relative proportions (Fig. 4e,g). Although the estimation quality 6 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ decreased with lower read counts (Fig. S4), in most cases the estimated absolute and relative subclonal-ratio was within 15% and 10% of the true subclonal proportion, re- spectively. Furthermore, we found that cases with high estimation error were typically caused by low-purity samples, which could be easily identified and removed without a priori information, as demonstrated in Fig. S5. Using the known theoretical mixing values of tumour-DNA content – instead of data- derived estimates – to derive purity-corrected CN values increased the estimation error, especially in low read count samples (Fig. S6). This finding emphasises that non-diploid genomes might bias alternative measurement methods and internal consistency in the method of deriving sample characteristics (purity and subclonal-ratio) is crucial when assessing the dynamics of the subclonal population. Subclonal analysis of patient samples We used liquidCNA to analyse emergent subclones in longitudinal cfDNA samples from pa- tients with non-small cell lung cancer (NSCLC) undergoing therapy, as previously reported by Chen and colleagues [24]. The liquid biopsies were collected as part of the FIGARO study (GO27912, NCT01493843), a randomised phase II trial designed to evaluate the e�cacy of pictilisib, a selective inhibitor of phosphatidylinositol 3 kinase [32]. Pictilisib or placebo was given in combination with standard chemotherapy regimen which was de- termined based on the subtype of NSCLC. Blood samples were taken at baseline (day 1 of the first treatment cycle) and at 6-week intervals up to the end of treatment (EOT). DNA was isolated from the plasma of liquid biopsies and sequenced using lpWGS to an average depth of 0.5x, as described in details in [24]. Chen et al. [24] identified several SCNAs in EOT samples that were absent at baseline and described several genes within these regions that might be associated with resistance. We sought to apply liquidCNA to these cases to corroborate their observations, and further to quantify the size of emergent subclones over time in these patients. We obtained the lpWGS data (fastq files) and performed CN profiling (see Methods) on patients with cfDNA samples from �3 time-points (n = 32). We identified three patients (1306, 2760 and 3209) whose sample series fulfilled the following criteria: (i) had a cfDNA sample taken on the first day of therapy with purity above ⇠20%; (ii) and had at least two non-baseline samples with purity above ⇠20%. Patients 1306 and 3209 were in the experimental arm of the study, while patient 2760 was assigned to the control arm; and all three patients have progressed during the course of the trial. We ran liquidCNA on data from the three selected patients (discarding samples with purity below 10% (Fig. S7)) and examined the genomic segments that liquidCNA identified as subclonal relative to baseline samples (Fig. 5). While we observed a good overlap with the CNs previously reported to be associated with subclonal evolution through therapy (Figures 5 and S8 of [24]), we also found a few segments that were missed or additionally identified by liquidCNA. The original study focused on the comparison of pre- and post-treatment and highlighted SCNAs occurring 7 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ between the first and last time-points. As our analysis put equal focus on all time-points, it classified some of the previously identified segments as unstable if the CN progression was not consistently with subclone evolution. Furthermore, some segments were too small to pass our initial filtering. On the other hand, liquidCNA was able to identify subclonal segments which were at an abnormal CN in the baseline sample, and subsequently showed diploid CN or a further gain/loss in subclonal tumour cells. For example, in the samples from Patient 1306, whilst liquidCNA identified subclonal SCNAs on chromosomes 2, 6 and 8 that overlapped with the findings of the original study; it also detected additional subclonal changes on chromosomes 17 and 18. However, we did not observe the previously described focal loss on chromosome 7 (harbouring the gene MLL3), probably due to its small size. Overall, we identified 10, 12 and 17 subclone-associated SCNAs in patients 1306, 2760 and 3209, respectively. A further 6 segments in patient 2760 were classified as non-clonal but ’unstable’ as the CN over time was not consistent with the pattern defined by the emerging subclone. As samples from patient 2760 had lower purity, these inconsistent CN changes might have resulted from measurement noise. We found that the emerging subclone accounted for 10 to 30% of the tumour derived DNA in the cfDNA in the three patients evaluated. Patient 2760 showed evidence of a subclonal proportion consistently around 30%, which could be explained by samples from this patient taken at later time-points. Samples from patient 3209 obtained at 18 weeks and end of therapy contained below 20% DNA derived from subclonal tumour cells (Fig. 5). Patient 1306, on the other hand, showed a contracting subclone that reduced in proportion from 20% presence at week 18 to <10% at the end of therapy. In case the total population size was known – which might be accessible from additional measurements of the tumour-associated cfDNA pool –, the tumour subclone fractions established here could also be converted into growth rates to enable future predictions of the tumour dynamics. 8 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Discussion We present liquidCNA, a computational algorithm to infer longitudinal subclonal dynam- ics using copy number measurements. Our algorithm performs simultaneous analysis of several longitudinal samples to identify sample purity, subclonal SCNAs and the abun- dance of an emerging subclone. LiquidCNA distinguishes between SCNAs that are associ- ated with the emerging subclone and those showing unstable behaviour, and consequently is not confounded by uncertain CN measurements. We validate our method both on synthetic SCNA datasets, and in vitro and in silico mixtures of two ovarian cancer cell lines. We successfully infer the proportion of the dominant subclone in all of the above datasets, with good accuracy across a range of sample qualities defined by the noise level or sequenced reads. In patients with lung cancer, liquidCNA applied to lpWGS data derived from longitudinal liquid biopsies (cfDNA) shows the emergence of subclones during therapy and identifies genomic regions associated with the emergent tumour cells. We demonstrate that liquidCNA can identify and quantify emerging subclones from cfDNA samples, therefore enabling tracking of tumour subclone evolution through the course of therapy. Deciphering the evolutionary trajectory of cancer can aid prognostic and therapeutic decision-making and further our understanding of therapy-induced drug resistance [33]. Measuring the dynamics of tumour composition is particularly crucial for prospective monitoring during an adaptive therapy regime aiming to control resistant subclones [34, 35, 36]. Furthermore, the proportion of cfDNA that is tumour-derived (what we term ’purity’) in itself is a promising biomarker for determining initial therapy response and prognosis [5, 37], as well as for tracking tumour progression during and after therapy [2, 10, 6, 24]. We note that there are limitations in our liquidCNA method. Since our inference relies on heterogeneous copy number profiles and subclone-specific SCNAs, we cannot analyse cancer (sub)types with very low chromosomal instability, for example microsatellite un- stable tumours. Conversely, extremely high levels of ongoing instability might bias our analysis due to the lack of stable subclone-associated SCNA profile, and therefore liq- uidCNA is not suitable for oligo-metastatic disease if spatially separate metastases carry distinct karyotypes. Furthermore, the accuracy of our estimation reduces at low purity (below 10%). However, a tumour fractions above this regime were observed in a sub- stantial number of patients, especially in late stage disease where liquidCNA can o↵er the largest benefit, [24, 20, 6, 9, 22, 38]. In addition, recent studies have shown that the unique fragment length of tumour-derived cfDNA can be utilised to enrich for tumour purity either experimentally or bioinformatically [39, 40, 41]. Finally, liquidCNA tracks a single dominant subclone associated with the largest set of subclone-specific SCNAs, and if there are multiple smaller subclones (with less or no associated SCNAs), these will be ignored by the algorithm. In summary, we provide a robust tool to derive quantitative information about dy- 9 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ namic changes in clonal composition from SCNA measurements derived from cfDNA. LiquidCNA enables real-time non-invasive tracking of subclonal tumour evolution, which can provide new insights into the evolution of SCNAs and the dynamical emergence of therapy-associated resistance. Acknowledgements We thank Ann-Marie Baker for reviewing the clarity of the text, and Steve Gendreau and Craig Cummings from Genentech, Inc. for providing access to patient cfDNA sequencing results and for their critical comments on the presentation of the data. This work was supported by the Wellcome Trust (grant 202778/Z/16/Z to T.A.G.) and Cancer Research UK (grant A19771 to T.A.G. supporting E.L.; Advanced Clinician Scien- tist Fellowship C41405/A19694 to M.L.; Clinical Research Training Fellowship to H.H.). M.L. also received support from a Barts and The London Charity Strategic Research Grant (467/2244). T.A.G. also received founding from the National Institutes of Health, National Cancer Institute (grant U54 CA217376). Author contributions E.L., W.H., M.L. and T.A.G. conceived and designed the study. M.L. and T.A.G. acquired funding for the study. E.L. developed the inference method and performed bioinformatic analysis. H.H. and M.M. performed in vivo experiments and sequencing. E.L. and T.A.G. wrote the original draft, and all authors reviewed and approved the manuscript. Competing interests The authors declare no competing interest. 10 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 1 2 3 4 Genomic segment clonal subclonal unstable Purity (pi) =Subclonal-ratio (ri) = Normal contamination Tumour cell mixture Sample 2 Sample 3 Sample 1 Sample 1 Sample 2 Sample 3 1 2 3 1 2 3 1 2 3 Genomic segment M ea su re d co py n um be r1 2 3 1 2 3 1 2 3 Genomic segment Tu m ou r c op y nu m be r C op y nu m be r 1 2 3 4 Subclonal/resistant tumour cells Ancestral/sensitive tumour cells Figure 1: Schematic of copy number measurements. The first panel shows the SCNA profile of ancestral (in yellow) and subclonal (in red) tumour cells. At di↵erent sampling time-points, the overall tumour SCNA profile is a mixture of these profiles (second panel), influenced by the composition of tumour-derived DNA depicted on the pie-charts. Clonal, subclonal and unstable segments are indicated in yellow, red and blue, respectively. Note that the CN of clonal segments remains the same. In the liquid biopsies taken at each time-point, contamination from normal cells leads to ’flattened’ measured SCNA profiles (last panel) due to normal cells having a neutral karyotype. This contamination a↵ects the CN of each segment. Our aim is to estimate purity (pi) and subclonal-ratio (ri) based on clonal and subclonal SCNAs. 11 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0 1 2 3 1.0 0.5 0.0 0.5 1.0 ΔCN in Sample4 N um be r o f s eg m en ts 0.0 0.2 0.4 0.6 Sample2 Sample3 Sample4 Sample5 S ub cl on al -r at io 0 1 2 3 1.5 2.0 2.5 3.0 Segment CN D en si ty 0.00 0.05 0.10 0.15 0.1 0.2 0.3 0.4 Purity estimate E rr or o f f it b c d Ordered samples clonal/normal unstable subclonale X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 0.0 0.5 1.0 1.5 S eg m en t C N Segment CN distribution Purity p1, p2, … Purity-corrected segment CNs baseline sample Segment classification Maximal Relative subclonal-ratio r1,N, r2,N, … 0.25 0.50 0.75 1.00 Sample2 Sample3 Sample5 S ub cl on al -r at io c om pa re d to S am pl e4 0 1 2 Sample1 Sample2 Sample3 Sample5 Sample4 C N a Sample order 0 1 2 Sample1 Sample3 Sample2 Sample4 Sample5 S eg m en t C N Score: 1 (16%) Order 24 Order 3 Order 2 Order 1 Optimal: Order 3 (Score = 17) Order 3f g subclone sample Subclonal-ratio r1, r2, … Optimal subclonal-ratio: 0.65 h i Subclonal segments ΔCN compared to Figure 2: Illustration of the estimation algorithm. (a) Outline of the steps of the estima- tion algorithm. (b) Purity estimation based on the peaks of the distribution of segment CNs. Green lines show the peaks expected at an example purity of 0.21. (c) The error of a range of purity estimates, computed from the distance of observed and estimated peaks in (b). Each line corresponds to a smoothing kernel applied to the raw segment CN distribution. The optimal purity is indicated with arrow. (d) Change in segment CN values (�CNs) plotted according to an example sample order. The number of subclonal segments computed in (e) is indicated below. (e) Classification of segments based on the sample order in (d). Segments with low variance are classified as clonal (in grey). Non- clonal segments are evaluated whether they follow a quasi-monotone pattern (indicated by the shaded regions) and classified as unstable (outside of shaded region, in blue) or subclonal (in red). (f) �CN values plotted according to the optimal sample order max- imising subclonal segments. Line colours indicate the class of each segment as in (e). (g) Relative subclonal-ratio estimation compared to maximal subclonal-ratio sample (right- most in (f)). Points show individual segment-wise estimates, with an example segment highlighted in black. Black line shows the median. (h-i) Subclonal-ratios and confidence intervals inferred by fitting a Gaussian mixture model to the �CN distribution of sub- clonal segments. The components of the best fit with means �r and r are shown in green and magenta in (h). 12 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0.0 0.2 0.4 0.6 0.5 1 2 4 Noise level (sigma) E rr or in s ub cl on al -r at io e st im at io n 0.00 0.05 0.10 0.15 0.20 0.5 1 2 4 Noise level (sigma) E rr or in p ur ity e st im at io n sigma=2 sigma=4 sigma=0.5 sigma=1 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 True subclonal-ratio E st im at ed s ub cl on al -r at io d e f g x 5 r: 5-80% p: 4-45%-2,-1,0,+1,+2 1,2,3,4,5 1 2 3 4 5 6 1 2 3 4 Ancestral CN S ub cl on al C N 0 5 10 15 20 25 number of segments c p = 0.26 r = 0.35 x 80 a b 0.5-4 sigma=2 sigma=4 sigma=0.5 sigma=1 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 True purity E st im at ed p ur ity Figure 3: Estimation of mixtures of synthetic cell populations. (a) Parameters used to randomly sample synthetic datasets including simulated measurement noise. The font- size of copy number states indicates their probability. (b) A randomly generated sample. The heatmap depicts the distribution of segment CNs in ancestral and subclonal cells, and the proportion of cell populations is shown on the pie-chart (red: subclonal, yellow: ancestral, grey: normal). (c) Copy number profile of the sample in (b), with raw bin-wise and segmented copy number values shown in black and red, respectively. (d) Estimated purity of 1,000 synthetic samples with varying levels of noise (�), plotted against the true theoretical purity. The y = x line is indicated with dashes. (e) Error of purity estimation (absolute di↵erence to true purity) for samples with noise level indicated on the x axis. (f) True and estimated subclonal-ratio of 200 synthetic datasets (1,000 samples) with varying levels of noise (�). (g) Error in subclonal-ratio estimation for datasets with increasing noise level. Box-plot elements in (e)(g) stand for: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. 13 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0.00 0.25 0.50 S1 S2 S3 S4 S5 Sample S ub cl on al ra tio 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 True purity E st im at ed p ur ity 0.2 0.4 0.6 0.2 0.4 0.6 True subclonal-ratio E st im at ed s ub cl on al -r at io ba f g Ancestral/sensitive cell line (B0) Subclonal/resistant cell line (B1) c d e Figure 4: Estimation of mixtures of high grade serous ovarian cancer cell lines. (a) Copy number profile of the ancestral/sensitive and subclonal/resistant HGSOC cell lines. Raw bin-wise and segmented copy number values are shown in black and red, respectively. Resistant-specific subclonal SCNAs are highlighted. (b) Purity estimates of samples S0- S5. Corrected values are computed using the linear fit in (c). Theoretical purity values are indicated by maroon diamonds. (c) True (theoretical) and estimated tumour purity of 120 in silico HGSOC cell line mixtures. y = x and the linear fit of the estimates (y = 0.81x) are shown with dashed and solid lines, respectively. Point shape and shade indicate total number of reads per sample. (d) Subclonal-ratio estimates for samples S1-S5. Shaded and empty bars indicate estimates derived using direct (Gaussian fit) and two-step (from relative ratios in (f)) methods, respectively. Error bars show 95% confidence interval of the direct estimate, maroon diamonds indicate theoretical values. (e) True and estimated subclonal-ratio of 50 in silico datasets constructed of samples from (c) with 50 million reads. (f) Relative subclonal-ratio estimates for samples S1-S4, compared to S5. Estimates from each subclonal segment are shown with dots, the median estimates are indicated by black lines, and true values with maroon diamonds. (g) True and estimated relative subclonal-ratio in the 50 datasets shown in (g). 14 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ 0 10 20 % subclonal cells 0 10 20 30 % subclonal cells 0 10 20 % subclonal cells baseline SCNA subclone-associated SCNA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 18 20 22 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Chromosome C op y nu m be r 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 18 20 22 1 2 3 1 2 3 1 2 3 1 2 3 Chromosome C op y nu m be r Patient 1306 Patient 2760 Patient 3209 Baseline Week 18 Week 24 End of therapy Baseline Week 24 Week 30 End of therapy Baseline Week 18 End of therapy a b c 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 18 20 22 1 2 3 1 2 3 1 2 3 Chromosome C op y nu m be r baseline SCNA subclone-associated SCNA baseline SCNA subclone-associated SCNA Figure 5: Estimation in cfDNA samples from patient data. Subclone-specific copy number changes and subclonal-ratio in lung cancer patients (a) 1306, (b) 2760, and (c) 3209 from [24]. Left: purity-corrected SCNA profiles. Yellow bars show the CN of each segment in the baseline sample, and red bars indicate subclonal deviations from this value in non-baseline samples. Regions of subclone-specific CNAs are also indicated by darker shades. Right: estimated resistant proportion of each sample with 95% confidence intervals. Note that only samples with >10% purity were analysed (c.f. S7). A bar of CN=6 on chromosome 3 (indicated by asterisk) has been omitted from (c) for better visualisation. 15 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Methods Formal definition of the problem Copy number measurements We consider a tumour that consists of two distinct cell populations, ancestral (A) and subclonal (S) tumour cells, and continuously sheds cell-free DNA (cfDNA) into the blood circulation. A typical scenario would be ancestral cells representing drug-sensitive tumour cells present before cancer therapy, and subclonal cells denoting the emerging subclone with resistance to therapy. The proportion of DNA originating from these two cell types changes over time as we take measurements via blood samples (Fig. 1). Since cell-free DNA found in blood can also originate from normal (non-tumour) cells of the body, the measured DNA is contributed by a mixture of the two tumour cell populations (A and S) and normal cells (N). At each time-point i the proportion of these three populations in the measured sample, si, depends on the proportion of all tumour-derived DNA (the purity of the sample, pi) and the proportion of subclone-derived DNA from the tumour (the subclonal-ratio, ri): Ni = 1 � pi; Ai = pi · (1 � ri); Si = pi · ri. (2) Our aim is to track the dynamics of the subclonal (putatively resistant) population by determining the subclonal-ratio for each time-point, ri, or the change in subclonal-ratio between time-points, ri/rk = rik. To this end, we use the copy number values as typically measured by lpWGS of the sequential cfDNA samples. Let us consider distinct genomic regions with homogeneous copy number state, seg- ments. We assume that the copy number (CN) state of most segments stays constant over time in a particular population. Therefore the jth segment is characterised by a set of three time-independent absolute CN states, C(N)j, C(A)j, C(S)j, corresponding to the local CN in normal, ancestral and subclonal cells, respectively. The copy number of segment j as measured in the ith sample, C j i , is the combination of these three absolute CNs, weighted by the proportions of DNA derived from the three cell populations at that time-point (Ni, Ai, Si). We know that normal cells are at a diploid state, hence C(N) j = 2 for all j. Therefore, using the purity and subclonal-ratio defined in Eq. (2), C j i = 2 + pi � (1 � ri)C(A)j + riC(S)j � 2 � . (3) Since all cells in a cell population share the absolute CN for a given segment, the values C(S)j and C(A)j are always integers. Therefore in theory, measured CNs from a given sample should be limited to a discrete set of values defined by these integer states, making it possible to solve the set of equations formed by Eq. (3) for pi and ri using linear algebra. However, we have to take into account that all real sequencing measurements have a level of imprecision introducing variation on top of this relationship. Using the term �ij to represent the noise in the ith measurement of segment j, Eq. (3) becomes, C j i = 2 + pi � (1 � ri)C(A)j + riC(S)j � 2 � + �ij. (4) 16 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ with the magnitude and family of this noise depending on the specifics of the technology used for CN measurement, especially the sequencing depth [19]. This measurement noise – associated with a continuous distribution – broadens the set of C j i values, rendering a linear algebra solution impossible. Hence, our aim becomes to derive an inference of pi and ri despite this unknown noise, �ij. Segment classification Each segment can fall into three categories depending on their respective copy number states in the two types of cells. (i) Clonal segments have the same absolute CN in ancestral and subclonal tumour cells, C(A)j = C(S)j. A special case of clonal segments are segments of neutral CN, where C(A)j = C(S)j = 2. (ii) Subclonal segments have di↵erent absolute CNs in the ancestral and subclonal tumour population, C(A)j 6= C(S)j. These segments represent SCNAs that distinguish the subclone from its ancestor, even though they are not necessarily associated with a selective/phenotypic di↵erence (e.g. drug-resistance) directly. (iii) Unstable segments are neither clonal nor associated with the emergent subclone, and therefore are best described by a time-dependent tumour-wide CN value, ⇣(T) j i , that does not depend on ri. These segments can arise if a genomic region cannot be measured reliably or if on-going genomic instability introduces novel SCNAs during the time tracked by our samples. We can assume that the number of such segments is small compared to the total number of measured segments. Depending on whether segments are clonal, subclonal or unstable, their measured CN across samples will change according to the subclonal-ratio and purity of each sample. For simplicity, we omit the term �ij and its derivatives, but the reader should keep in mind that all equations are subject to measurement noise: C j i = 2 + pi(C(A) j � 2), if the segment is clonal, (5) C j i = 2 + pi � C(A)j � 2 + ri(C(S)j � C(A)j) � , if the segment is subclonal, (6) C j i = 2 + pi(⇣(T) j i )), if the segment is unstable. (7) Figure 1 illustrates how the measured CN of segments depend on the parameters ri and pi highlighted above. In the following sections, we use Eqs. (5) & (6) to estimate the underlying parameters, pi and ri, via three steps (Fig. 2). Estimation algorithm Purity estimation Purity estimation is carried out based on clonal (including neutral) segments. In general, we expect the majority of segments to fall into this category. Consequently, for the ma- jority of segments their measured copy number follows Eq. (5). Since C(A)j can take only integer values, the distribution of segment CNs is expected to have distinct peaks at regular intervals of pi. 17 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Using a peak-finder algorithm on the smoothed distribution of measured CN values, we directly compare the peaks to the values expected at a given purity, {2 � pi, 2, 2 + pi, 2 + 2pi, . . . }, as shown in Fig. 2b. The error of the fit to a purity, pi, is evaluated as the summed squared distance between each peak and the closest observed peak, X C(A) min � (2 + pi(C(A) � 2)) � peaks)2 � . (8) As the detected peaks of the data depend on the smoothing kernel used on the distribution, we perform this computation for a wide range of smoothing bandwidths (0.5 ⇥ �2.5⇥ the default value) and derive the purity estimate, p̂i, as the value that minimises the mean and/or median error across the range (Fig. 2c). Then, we use the derived p̂i to re-normalise the measured copy number values and thus eliminate normal contamination. We gain an estimate of the tumour-specific CN (C(T) j i ), a mixture of ancestral and subclonal CNs: Ĉ(T) j i = 1 p̂i · (Cji � 2) + 2 ⇡ C(A) j + ri(C(S) j � C(A)j). (9) Note that, due to the noise in measurements, peaks from close absolute CNs can become indistinguishable in low-purity samples. Therefore we expect purity values below 5% to be indistinguishable (unless high sequencing depth is available) and also advise to discard samples with low purity (typically pi < 0.1) as erroneous purity estimations can bias downstream computation. Identifying subclonal segments and sample order Next, we aim to identify the subset of segments with subclone-specific subclonal SCNAs that reflect the changes in subclonal-ratio over time. To easily assess the change in segment CNs, we designate a sample as baseline, and compute the change in segment CN, �CN, between each sample and this baseline sample. Typically, the sample taken upon diagnosis or before start of therapy (usually the first time-point, s1) can be used. We can assume that this sample has no or only negligible population of the emerging subclone, and therefore represents a pure ancestral population: r1 ⇡ 0 �! C(T) j 1 ⇡ C(A) j. Hence the change in CN of a subclonal segment compared to the baseline becomes, �C(T) j i = C(T) j i � C(T) j 1 = ri � C(S)j � C(A)j � . (10) Furthermore, Eq. (10) provides an informative quantity even if the baseline sample is not pure, as �C(T) j i nonetheless describes the change in subclone-specific SCNAs. In order to uncover which segments are truly subclonal, and how the subclonal-ratio changes over measurements, we need to identify a pervasive pattern across samples, and the subset of segments that consistently follows it. If the samples were taken so that 18 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ the subclonal population increases over time-points, this pattern would be a monotone increase or decrease for all segments with subclone-specific SCNAs. While we cannot assume that the samples are taken in order of increasing subclonal proportions (e.g. a change of treatment between sampling times might lead to fluctuating population size in a resistance-associated subclone), we can aim to re-arrange them to follow this rule. Consequently, we rephrase our aim as deriving (i) a set of subclonal segments that follow a monotone pattern across ordered samples; and (ii) an ordering of samples that is correlated with by the maximum number of (subclonal) segments. Formally, we are looking for a subset of segments, {j1, j2, . . . } and a permutation of samples (starting from the designated baseline sample), s1, si, . . . , sN, where for every segment j 2 {j1, j2, . . . } either �C(T) j i+1 � �C(T) j i > �✏, 8i or (11) �C(T) j i+1 � �C(T) j i < ✏, 8i holds for all i for a pre-defined accuracy level, ✏. We use an ✏ > 0 accuracy level to allow for samples with near-equal subclonal-ratio measured with uncertainty. We find that, for typical lpWGS datasets, ✏ ⇡ 0.02 � 0.05 works well to account for the underlying measurement noise. Figs. 2d-f illustrate the derivation of optimal sample order and subclonal segment set. We first separate clonal segments: since these have relative CN values of 0, apart from some measurement noise, we filter out any segment that has a standard deviation below a pre-defined threshold. We then evaluate Eq. (11) over all remaining segments and over all orderings of the samples. As we expect 4-6 time-points per dataset, an exhaustive search of all possible permutations is feasible. Given a permutation, each segment is classified according to whether it follows Eq. (11) – these are candidate subclone-specific and unstable segments, respectively (Fig. 2e). The optimal sample order is defined as the permutation that maximises the number of subclonal segments (Fig. 2f). Subclonal-ratio estimation Finally, we use the set of segments identified as subclonal, and compute the subclonal- ratio of each time point. We derive the (absolute) subclonal-ratio, ri, for each sample using Eq. (10). As both C(A)j and C(S)j are assumed to be integers, and we know that C(A)j 6= C(S)j, �C(T) j i 2 {. . . , �2ri, �ri, ri, 2ri, . . . }, 8 j 2 {j1, j2, . . . }. (12) To take into account that the measured �CNs compared to the baseline, �Ĉ(T) j i , are influenced by noise, we fit these values with a mixture of Gaussian distributions where the mean of the Gaussians follows Eq. (12), as illustrated in Fig 2h. The subclonal-ratio of a sample is derived as the constrained mean parameter, ri, of the Gaussian mixture 19 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ optimising the fit (Fig. 2i). The 95% confidence interval of the inferred subclonal-ratio is computed based on the (shared) variance of the fitted constrained Gaussians. The measurement noise propagated from segment CNs can lead to high spread in values, making estimates less robust and rendering the resolution of low subclonal-ratios (ri  0.1) challenging, occasionally leading to the Gaussian-fitting step to fail. Therefore we also derive relative subclonal-ratios, which allow for a more general application not limited to good quality samples. In particular, relative values are compared to the maximal sample since its subclonal-ratio is assumed to be the most robust against measurement noise. We compute the relative deviation of each normalised subclonal tumour segment CN, �C j iN = �C(T) j i �C(T) j N = ri(C(S) j � C(A)j) rN(C(S)j � C(A)j) = ri rN , (13) giving rise to a distribution of relative subclonal-ratio estimates (Fig. 2g). We derive a point estimate for the relative ri of each sample as the median of this set, r̂iN = median ⇣ �C j iN ⌘ , j 2 {j1, j2, . . . }. (14) Absolute subclonal-ratio estimates can then be derived using these relative estimates in a two-step estimation process (as opposed to the direct estimation above): we derive rN based on Eq. (12), and subsequently compute riN · rN to retrieve ri. Generating synthetic datasets We constructed synthetic datasets of 80 segments (of length varying between 120 and 800 bins) and 5 time-points as illustrated in Fig. 3a. For each segment, we generated sensitive segment copy number states (C(S)j) by randomly sampling from {1, 2, 3, 4, 5}, with neutral and close-to-neutral states occurring with higher frequency. Subclone-specific absolute CNs (C(S)j) were assigned by randomly sampling from C(A)j +{�2, �1, 0, 1, 2}, with no change (giving rise to clonal segments) having a higher weight. For each sample, si, we assigned purity and subclonal-ratio randomly from the ranges 0.04 < pi < 0.45 and 0.05 < ri < 0.8, with the exception of the baseline samples, where r1 < 0.04. We then recreated the measurement procedure of computing noise-ridden raw CN values in a given segment, j, by adding a normally distributed noise. The magnitude (standard deviation) of the noise was controlled by the noise level parameter, � (representing di↵erences arising from e.g. sequencing depth) and the CN of the segment (reflecting higher variance in higher CN states): rawCbini = 2 + pi � (1 � ri)C(A)j + riC(S)j � 2 � + Normal(0, f(�, C j i )). The final CN value of each segments, Ĉ j i , was computed as the mean of all rawC bin i contained in the segment. In addition, we selected 2.5-15% of segments as unstable, and re- sampled their tumour-specific CN value to be independent of ri. Fig. 3b-c show parameters of a synthetic sample and its copy number profile. 20 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ Generating in vitro and in silico cell line mixtures HGSOC cell line OVCAR4 was obtained from Prof Fran Balkwill (Barts Cancer Institute, UK) and grown in DMEM media containing 10% FBS and 1% penicillin/streptomycin. A resistant/subclonal HGSOC cell line (Ov4Cis) was generated by culturing an aliquot of the ancestral OVCAR4 cell line in increasing concentrations of cisplatin. For further details on cell culture and the celll lines, see [29]. We then extracted genomic DNA from both cell lines and from blood samples from healthy volunteers using QIAamp DNA Micro Kit (Qiagen, Hilden, Germany). Genomic DNA from the three sources was mixed in varying proportions (Table S1), measured as the mass of DNA inputted from each source, to a total of 20 ng DNA per sample and subjected to sonication with the Covaris M220 system. Libraries were prepared using the NEBNext Ultra II kit (New England Biolabs, Hitchin, United Kingdom) with 4 cycles of PCR amplification, indexed with unique dual indexing primers and sequenced on Illumina NovaSeq 6000 to a mean depth of 1.3x. In silico mixtures were generated by bioinformatically mixing sequencing reads of DNA derived from the ancestral/sensitive, subclonal/resistant tumour cell lines and healthy blood cells. Similarly to synthetic samples, for each in silico sample we randomly assigned purity, 0.1 < pi < 0.45, and subclonal-ratio, 0.05 < ri < 0.8. We then sampled reads (using samtools view -s) from aligned read (bam) files of ‘pure’ ancestral, subclonal and normal samples (B0, B1 and N0) in proportions to match pi(1 � ri), piri and 1 � pi, respectively. We also varied the total number of reads per sample (as a proxy for sequencing depth and consequently measurement noise), and generated 30-30 samples with 50, 20, 10, and 5 million total reads each. Processing lpWGS samples Fastq files derived from lpWGS samples (generated via sequencing cell line mixtures or obtained from [24]) were aligned to the human reference genome (version hg19, using bwa). We then processed bam files using the QDNAseq R package [19] employing DNAcopy for segmentation [42]. QDNAseq produced two copy number values for each genomic bin: a raw pre-segmentation and a segmented value grouping bins of equal CN together. The CN of bins on the pre-defined blacklist of QDNAseq and of those with <75% mappability was set to NA. Raw and segmented CN values for all cell line samples are available from https://github.com/elakatos/liquidCNA_data. Since QDNAseq returns normalised CN values (with neutral state at 1), we multiplied all values by 2 before proceeding with the estimation algorithm and re-normalised segment CN values to be centred at 2 exactly. We then re-defined segment boundaries using the ensemble of samples as regions of constant CN in all samples. This way break-points present in only a sub-set of samples (such as a subclone-specific SCNA) gave rise to segments handled separately for all samples. Updated segments with length below 6 mega- bases (120 bins of 50kb (cell line mixtures) or 12 bins of 500kb (patient cfDNA samples)) 21 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://github.com/elakatos/liquidCNA_data https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ were excluded from the downstream analysis to filter out short segments sensitive to localised measurement biases. Finally, we curated each segment CN by discarding bins with the most extreme 2.5% of raw segment values, and re-calculating the segment CN value as the mean of normal distribution fitted to the remaining raw CNs. We found that this curation had negligible e↵ect for most segments, but successfully improved assigned segment CN values for more error-prone genomic regions. Data availability Aligned sequencing data from HGSOC cell lines and in vitro mixtures (listed in table S1) are available from the European Nucleotide Archive (accession PRJEB42332). Raw and post-segmentation copy number values for these samples are available from https: //github.com/elakatos/liquidCNA_data. Code availability Estimation functions of liquidCNA implemented in R (version 4.0.3), an illustrative ex- ample in a Jupyter notebook and code generating and analysing synthetic and in silico data are available from https://github.com/elakatos/liquidCNA. 22 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://github.com/elakatos/liquidCNA_data https://github.com/elakatos/liquidCNA_data https://github.com/elakatos/liquidCNA https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ References [1] Siravegna, G., Marsoni, S., Siena, S. & Bardelli, A. Integrating liquid biopsies into the management of cancer. Nature Reviews Clinical Oncology 14, 531–548 (2017). URL https://doi.org/10.1038/nrclinonc.2017.14. [2] Ng, S. B. et al. Individualised multiplexed circulating tumour dna assays for monitor- ing of tumour presence in patients after colorectal cancer surgery. Scientific reports 7, 40737–40737 (2017). URL https://pubmed.ncbi.nlm.nih.gov/28102343. [3] Rothwell, D. G. et al. Utility of ctdna to support patient selection for early phase clinical trials: the target study. Nat Med 25, 738–743 (2019). [4] Khan, K. H. et al. Longitudinal liquid biopsy and mathematical modeling of clonal evolution forecast time to treatment failure in the prospect-c phase ii colorectal cancer clinical trial. Cancer Discov 8, 1270–1285 (2018). [5] Fernandez-Garcia, D. et al. Plasma cell-free dna (cfdna) as a predictive and prognostic marker in patients with metastatic breast cancer. Breast Cancer Research 21, 149 (2019). URL https://doi.org/10.1186/s13058-019-1235-8. [6] Conteduca, V. et al. Plasma tumour dna as an early indicator of treatment response in metastatic castration-resistant prostate cancer. British Journal of Cancer (2020). URL https://doi.org/10.1038/s41416-020-0969-5. [7] Nakamura, Y. et al. Clinical utility of circulating tumor dna sequencing in advanced gastrointestinal cancer: Scrum-japan gi-screen and gozila studies. Nature Medicine (2020). URL https://doi.org/10.1038/s41591-020-1063-5. [8] Diaz, L. A. J. et al. The molecular evolution of acquired resistance to targeted egfr blockade in colorectal cancers. Nature 486, 537–540 (2012). [9] Bettegowda, C. et al. Detection of circulating tumor dna in early- and late-stage human malignancies. Sci Transl Med 6, 224ra24 (2014). [10] Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor dna with broad patient coverage. Nature Medicine 20, 548–554 (2014). URL https: //doi.org/10.1038/nm.3519. [11] Parkinson, C. A. et al. Exploratory analysis of tp53 mutations in circulating tumour dna as biomarkers of treatment response for patients with relapsed high-grade serous ovarian carcinoma: A retrospective study. PLoS medicine 13, e1002198 (2016). URL https://europepmc.org/articles/PMC5172526. [12] Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899–905 (2010). URL https://doi.org/10.1038/ nature08822. 23 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1038/nrclinonc.2017.14 https://pubmed.ncbi.nlm.nih.gov/28102343 https://doi.org/10.1186/s13058-019-1235-8 https://doi.org/10.1038/s41416-020-0969-5 https://doi.org/10.1038/s41591-020-1063-5 https://doi.org/10.1038/nm.3519 https://doi.org/10.1038/nm.3519 https://europepmc.org/articles/PMC5172526 https://doi.org/10.1038/nature08822 https://doi.org/10.1038/nature08822 https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ [13] Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: The next generation. Cell 144, 646–674 (2011). URL https://doi.org/10.1016/j.cell.2011.02.013. [14] Sansregret, L., Vanhaesebroeck, B. & Swanton, C. Determinants and clinical impli- cations of chromosomal instability in cancer. Nature Reviews Clinical Oncology 15, 139–150 (2018). URL https://doi.org/10.1038/nrclinonc.2017.198. [15] Li, X. et al. Temporal and spatial evolution of somatic chromosomal alterations: a case-cohort study of barrett’s esophagus. Cancer Prev Res (Phila) 7, 114–127 (2014). [16] Hieronymus, H. et al. Tumor copy number alteration burden is a pan-cancer prog- nostic factor associated with recurrence and death. Elife 7 (2018). [17] Rubin, C. E. et al. Dna aneuploidy in colonic biopsies predicts future development of dysplasia in ulcerative colitis. Gastroenterology 103, 1611–1620 (1992). [18] Zaccaria, S. & Raphael, B. J. Accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data. Nature Commu- nications 11, 4301 (2020). URL https://doi.org/10.1038/s41467-020-17967-y. [19] Scheinin, I. et al. Dna copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. Genome Res 24, 2022–2032 (2014). [20] Adalsteinsson, V. A. et al. Scalable whole-exome sequencing of cell-free dna reveals high concordance with metastatic tumors. Nature Communications 8, 1324 (2017). URL https://doi.org/10.1038/s41467-017-00965-y. [21] Van Roy, N. et al. Shallow whole genome sequencing on circulating cell-free dna allows reliable noninvasive copy-number profiling in neuroblastoma patients. Clin Cancer Res 23, 6305–6314 (2017). [22] Hovelson, D. H. et al. Rapid, ultra low coverage copy number profiling of cell-free dna as a precision oncology screening strategy. Oncotarget 8, 89848–89866 (2017). [23] Chin, S.-F. et al. Shallow whole genome sequencing for robust copy number profil- ing of formalin-fixed para�n-embedded breast cancers. Experimental and Molecular Pathology 104, 161–169 (2018). URL http://www.sciencedirect.com/science/ article/pii/S0014480017306470. [24] Chen, X. et al. Low-pass whole-genome sequencing of circulating cell-free dna demonstrates dynamic changes in genomic copy number in a squamous lung cancer clinical cohort. Clinical Cancer Research 25, 2254–2263 (2019). URL https://clincancerres.aacrjournals.org/content/25/7/2254. https:// clincancerres.aacrjournals.org/content/25/7/2254.full.pdf. 24 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1016/j.cell.2011.02.013 https://doi.org/10.1038/nrclinonc.2017.198 https://doi.org/10.1038/s41467-020-17967-y https://doi.org/10.1038/s41467-017-00965-y http://www.sciencedirect.com/science/article/pii/S0014480017306470 http://www.sciencedirect.com/science/article/pii/S0014480017306470 https://clincancerres.aacrjournals.org/content/25/7/2254 https://clincancerres.aacrjournals.org/content/25/7/2254.full.pdf https://clincancerres.aacrjournals.org/content/25/7/2254.full.pdf https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ [25] Belic, J. et al. mfast-seqs as a monitoring and pre-screening tool for tumor-specific aneuploidy in plasma dna. Adv Exp Med Biol 924, 147–155 (2016). [26] Vanderstichele, A. et al. Chromosomal instability in cell-free dna as a highly specific biomarker for detection of ovarian cancer in women with adnexal masses. Clin Cancer Res 23, 2223–2231 (2017). [27] Taylor, F., Bradford, J., Woll, P. J., Teare, D. & Cox, A. Unbiased detection of somatic copy number aberrations in cfdna of lung cancer cases and high-risk controls with low coverage whole genome sequencing. Adv Exp Med Biol 924, 29–32 (2016). [28] Wei, T. et al. Genome-wide profiling of circulating tumor dna depicts landscape of copy number alterations in pancreatic cancer with liver metastasis. Mol Oncol 14, 1966–1977 (2020). [29] Hoare, J. et al. Platinum resistance induces diverse evolutionary trajecto- ries in high grade serous ovarian cancer. bioRxiv (2020). URL https: //www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378. https:// www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378.full.pdf. [30] Nelson, L. et al. A living biobank of ovarian cancer ex vivo models reveals profound mitotic heterogeneity. Nature Communications 11, 822 (2020). URL https://doi. org/10.1038/s41467-020-14551-2. [31] Network, C. G. A. R. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011). URL https://pubmed.ncbi.nlm.nih.gov/21720365. [32] Soria, J.-C. et al. A phase ib dose-escalation study of the safety and pharmacoki- netics of pictilisib in combination with either paclitaxel and carboplatin (with or without bevacizumab) or pemetrexed and cisplatin (with or without bevacizumab) in patients with advanced non–small cell lung cancer. European Journal of Cancer 86, 186 – 196 (2017). URL http://www.sciencedirect.com/science/article/pii/ S0959804917312522. [33] Housman, G. et al. Drug resistance in cancer: an overview. Cancers (Basel) 6, 1769–1792 (2014). [34] Gatenby, R. A., Silva, A. S., Gillies, R. J. & Frieden, B. R. Adaptive therapy. Cancer Res 69, 4894–4903 (2009). [35] Enriquez-Navas, P. M., Wojtkowiak, J. W. & Gatenby, R. A. Application of evolu- tionary principles to cancer therapy. Cancer Res 75, 4675–4680 (2015). [36] Zhang, J., Cunningham, J. J., Brown, J. S. & Gatenby, R. A. Integrating evo- lutionary dynamics into treatment of metastatic castrate-resistant prostate can- cer. Nature Communications 8, 1816 (2017). URL https://doi.org/10.1038/ s41467-017-01968-5. 25 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378 https://www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378 https://www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378.full.pdf https://www.biorxiv.org/content/early/2020/07/24/2020.07.23.200378.full.pdf https://doi.org/10.1038/s41467-020-14551-2 https://doi.org/10.1038/s41467-020-14551-2 https://pubmed.ncbi.nlm.nih.gov/21720365 http://www.sciencedirect.com/science/article/pii/S0959804917312522 http://www.sciencedirect.com/science/article/pii/S0959804917312522 https://doi.org/10.1038/s41467-017-01968-5 https://doi.org/10.1038/s41467-017-01968-5 https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/ [37] Choudhury, A. D. et al. Tumor fraction in cell-free dna as a biomarker in prostate can- cer. JCI Insight 3 (2018). URL https://doi.org/10.1172/jci.insight.122109. [38] Phallen, J. et al. Direct detection of early-stage cancers using circulating tumor dna. Science translational medicine 9, eaan2415 (2017). URL https://pubmed.ncbi. nlm.nih.gov/28814544. [39] Mouliere, F. et al. High fragmentation characterizes tumour-derived circulating dna. PLOS ONE 6, 1–10 (2011). URL https://doi.org/10.1371/journal.pone. 0023418. [40] Underhill, H. R. et al. Fragment length of circulating tumor dna. PLOS Genetics 12, 1–24 (2016). URL https://doi.org/10.1371/journal.pgen.1006162. [41] Mouliere, F. et al. Enhanced detection of circulating tumor dna by fragment size analysis. Science Translational Medicine 10 (2018). URL https://stm.sciencemag. org/content/10/466/eaat4921. https://stm.sciencemag.org/content/10/466/ eaat4921.full.pdf. [42] Venkatraman, E. S. & Olshen, A. B. A faster circular binary segmentation algorithm for the analysis of array cgh data. Bioinformatics 23, 657–663 (2007). 26 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 7, 2021. ; https://doi.org/10.1101/2021.01.05.425414doi: bioRxiv preprint https://doi.org/10.1172/jci.insight.122109 https://pubmed.ncbi.nlm.nih.gov/28814544 https://pubmed.ncbi.nlm.nih.gov/28814544 https://doi.org/10.1371/journal.pone.0023418 https://doi.org/10.1371/journal.pone.0023418 https://doi.org/10.1371/journal.pgen.1006162 https://stm.sciencemag.org/content/10/466/eaat4921 https://stm.sciencemag.org/content/10/466/eaat4921 https://stm.sciencemag.org/content/10/466/eaat4921.full.pdf https://stm.sciencemag.org/content/10/466/eaat4921.full.pdf https://doi.org/10.1101/2021.01.05.425414 http://creativecommons.org/licenses/by-nc-nd/4.0/