HaVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for SARS-CoV-2 sequences. 1 Title 1 HaVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage 2 assignment for SARS-CoV-2 sequences. 3 4 Authors and institutional addresses 5 Phuoc Truong Nguyen 1, Ilya Plyusnin 2,3, Tarja Sironen 1,3, Olli Vapalahti 1,3,4, Ravi Kant †1,3, 6 Teemu Smura †1,4 7 8 1. Department of Virology, Faculty of Medicine, University of Helsinki, Helsinki, Finland 9 2. Institute of Biotechnology, University of Helsinki, Helsinki, Finland 10 3. Department of Veterinary Biosciences, University of Helsinki, Helsinki, Finland 11 4. Department of Virology, University of Helsinki and Helsinki University Hospital, Helsinki, 12 Finland 13 †Correspondence to: Ravi.Kant@helsinki.fi or Teemu.Smura@helsinki.fi 14 15 Abstract 16 Background: SARS-CoV-2 related research has increased in importance worldwide since 17 December 2019. Several new variants of SARS-CoV-2 have emerged globally, of which the 18 most notable and concerning currently are the UK variant B.1.1.7, the South African variant 19 B1.351 and the Brazilian variant P.1. Detecting and monitoring novel variants is essential in 20 SARS-CoV-2 surveillance. While there are several tools for assembling virus genomes and 21 performing lineage analyses to investigate SARS-CoV-2, each is limited to performing singular 22 or a few functions separately. 23 24 Results: Due to the lack of publicly available pipelines, which could perform fast reference-25 based assemblies on raw SARS-CoV-2 sequences in addition to identifying lineages to detect 26 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 2 variants of concern, we have developed an open source bioinformatic pipeline called HaVoC 27 (Helsinki university Analyzer for Variants Of Concern). HaVoC can reference assemble raw 28 sequence reads and assign the corresponding lineages to SARS-CoV-2 sequences. 29 30 Conclusions: HaVoC is a pipeline utilizing several bioinformatic tools to perform multiple 31 necessary analyses for investigating genetic variance among SARS-CoV-2 samples. The 32 pipeline is particularly useful for those who need a more accessible and fast tool to detect and 33 monitor the spread of SARS-CoV-2 variants of concern during local outbreaks. HaVoC is 34 currently being used in Finland for monitoring the spread of SARS-CoV-2 variants. HaVoC user 35 manual and source code are available at https://www.helsinki.fi/en/projects/havoc and 36 https://bitbucket.org/auto_cov_pipeline/havoc, respectively. 37 38 Keywords 39 SARS-CoV2, variant detection, reference assembly, lineage identification, coronavirus, 40 sequence analysis. 41 42 Background 43 Emerging pathogens pose a continuous threat to mankind, as exemplified by the Ebola virus 44 epidemic in West Africa in 2014 [1], Zika virus pandemic in 2015 [2], and the ongoing 45 Coronavirus disease 2019 (COVID-19) pandemic. These viruses are zoonotic, i.e. have crossed 46 species barriers from animals to humans, alike the majority of emerging human pathogens [3, 47 4]. The likelihood of this host switching is enhanced by several factors, e.g. global movement of 48 people and animals, environmental changes, increased proximity of humans, wildlife and 49 livestock, and population expansion into new environments [5]. 50 51 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 3 The mutation and evolution rate of RNA viruses is considerably higher than their hosts, which is 52 advantageous for viral adaptation. Mutations in the viral genome are most of the time silent or, if 53 affecting phenotype, related to attenuation, although mutations can also lead to more 54 pathogenic strains. A new virus variant may have one or more mutations that separate it from 55 the wild-type virus already circulating among the general population. 56 57 Coronaviruses (family Coronaviridae) are enveloped single-stranded RNA viruses, which cause 58 respiratory, enteric, hepatic, and neurological diseases of a broad spectrum of severity among 59 different animals and humans. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-60 2), a novel evolutionary divergent virus responsible for the present pandemic, has devastated 61 societies and economies globally. The SARS-CoV-2 pandemic has already infected more than 62 100 million people in 221 countries, causing over 2.2 million global deaths as of 3rd February 63 2021 [6]. In autumn 2020, a new variant of SARS-CoV-2 known as 20B/501Y.V1 (B.1.1.7) was 64 detected in south-eastern England, Wales, and Scotland [7]. This variant has since spread 65 globally to more than 80 countries. The variant has undergone 23 mutations with 13-66 nonsynonymous mutations, four amino acid deletions, and six synonymous mutations making 67 the virus more transmissible [8]. Another variant 20C/501Y.V2 (B.1.351) was detected in South 68 Africa which was genetically distant from the UK 20B/501Y.V1 variant [9]. This South African 69 variant with its two mutations in the receptor-binding motif that mainly forms the interface with 70 the human ACE2 receptor has also been widely spreading to circulate globally. It has been 71 noticed that some existing vaccines against SARS-CoV-2 are less effective against the 72 20C/501Y.V2 variant [10–12]. A third variant being closely monitored is P.1 detected first in 73 Brazil [13]. Interestingly, all these three variants have a mutation in the receptor binding domain 74 (RBD) of the spike protein at position 501, where the amino acid asparagine (N) has been 75 replaced with tyrosine (Y) enabling specific PCR to detect the N501Y mutation [14]. 76 77 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 4 As more transmissible coronavirus variants are circulating worldwide, the role of researchers 78 and technology specialists in controlling the pandemic has received more emphasis. The 79 surveillance of virus variants by sequencing the SARS-CoV-2 genomes would provide a fast 80 way to monitor variants and their spread, however, there are only few publicly available 81 methods for quick reference-based consensus assembly and lineage assignment for SARS-82 CoV-2 samples. For this purpose, we have developed a simple pipeline, called HaVoC (Helsinki 83 university Analyzer for Variants Of Concern), for quick reference-based consensus assembly 84 and lineage assignment for SARS-CoV-2 samples. This will provide the end user a quick and 85 accessible method of variant identification and monitoring. The pipeline was developed to be 86 run on Unix/Linux operating systems, and thus can also be used in remote servers, e.g. CSC – 87 IT Center for Science, Finland. 88 89 Implementation 90 HaVoC consists of a single shell script, which performs reference-based consensus assemblies 91 to query SARS-CoV-2 fastq sequence libraries and assigns lineages to them individually in 92 succession. The script can be started by typing the following line into your command line 93 terminal: 94 95 sh HaVoC.sh [FASTQ directory] 96 97 The computing of consensus sequences starts with the tool detecting FASTQ files generated 98 via paired end sequencing in a given input directory and checking that each query FASTQ file 99 has its corresponding counterpart, i.e. mates file. The names of the files are modified to be more 100 concise, e.g. Query-Seq:1_X123_Y000_R1_000.fastq.gz to Query-Seq:1_R1.fastq.gz. The 101 pipeline accepts FASTQ files both in gzipped and uncompressed format. 102 103 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 5 For the analyses, the user can choose which bioinformatic tools to utilize. This can be done by 104 typing the tool wanted (tools_prepro, tools_aligner and tools_sam) within the options section in 105 the beginning of the script file. For example, if the user wants to deploy Trimmomatic to pre-106 process FASTQ files, the following line can be changed as follows: 107 108 From 109 tools_prepro="fastp" 110 To 111 tools_prepro="trimmomatic" 112 113 Other options include the number of threads, minimum coverage below which a region is 114 masked (min_coverage), and whether to run Pangolin to assign lineages to the consensus 115 genome (run_pangolin). An additional option allows HaVoC to be run in the CSC servers 116 (run_in_csc). 117 118 The pre-alignment quality control, e.g. removing and trimming low quality reads and bases, 119 removing adapter sequences, can be done with either fastp [15] or Trimmomatic [16]. The reads 120 are then aligned to a reference genome of SARS-CoV-2 isolate Wuhan-Hu-1 (Genbank 121 accession code: NC_045512.2) with BWA-MEM [17] or Bowtie 2 [18]. The resulting SAM and 122 BAM files are processed (includes sorting, filling in mate coordinates, marking duplicate 123 alignments, and indexing reads) with Sambamba [19] or Samtools [20] and the low coverage 124 regions are masked with BEDtools [21]. After masking a variant call is done with Lofreq [22] 125 before computing the consensus sequence via BCFtools of Samtools [20]. Finally, the 126 consensus sequence is analyzed with pangolin [23] to assign a lineage. The whole process is 127 depicted in figure 1. 128 129 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 6 130 Fig. 1 Flowchart describing processes and steps performed by HaVoC pipeline. The pipeline 131 constructs consensus sequences from all FASTQ files in an input directory and then compares 132 the resulting sequences to other established SARS-CoV-2 genomes to assign them the most 133 likely lineages. The pipeline requires a FASTA file of adapter sequences for FASTQ pre-134 processing and a reference genome of SARS-CoV-2 in a separate FASTA file. The adapter file 135 is not required when running the pipeline with fastp option. Input files are highlighted in green 136 and the outputs in red. 137 138 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 7 Usage example 139 We are going to demonstrate a common use case for HaVoC with FASTQ files containing reads 140 for SARS-CoV-2 sequences, provided by the Viral zoonoses research unit at University of 141 Helsinki, Finland. The test files within the Example_FASTQs folder contain paired-end FASTQ 142 files for the UK variant (UK-variant-1) and the South African variant (S-Africa-variant-1). To 143 analyse these example files, the aforementioned command needs to be deployed as follows: 144 145 sh HaVoC.sh Example_FASTQs 146 147 Results 148 The FASTQ files are processed and analyzed with the default options utilizing faster 149 bioinformatic tools (fastp, BWA-MEM and Sambamba) in ca. 2–4 minutes, depending on the 150 performance of the platform (local or server). After HaVoc has finished the analyses, each 151 FASTQ file is moved to their respective result folders within the FASTQ directory. Each result 152 folder contains a FASTA file for the consensus sequence (e.g. UK-variant-1_consensus.fa) and 153 a CSV file with the lineage information produced by pangolin (e.g. UK-variant-154 1_pangolin_lineage.csv). In addition to these main result files, each directory contains the 155 original FASTQ files, BAM files (original, indexed and sorted), variant call files (VCF) with 156 mutation data, BED file used for masking regions, and fastp report files with the results of 157 FASTQ processing. The resulting directory and file structure with the example files will look as 158 follows: 159 Example_FASTQs/ 160 UK-variant-1/ 161 UK-variant-1.bam 162 UK-variant-1_R1.fastq.gz 163 UK-variant-1_R2.fastq.gz 164 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 8 UK-variant-1_consensus.fa 165 UK-variant-1_fixmate.bam 166 UK-variant-1_indel.bam 167 UK-variant-1_indel.vcf 168 UK-variant-1_indel_flt.vcf 169 UK-variant-1_lowcovmask.bed 170 UK-variant-1_markdup.bam 171 UK-variant-1_namesort.bam 172 UK-variant-1_pangolin_lineage.csv 173 UK-variant-1_sorted.bam 174 fastp.html 175 fastp.json 176 S-Africa-variant-1/ 177 S-Africa-variant-1.bam 178 S-Africa-variant-1_R1.fastq.gz 179 S-Africa-variant-1_R2.fastq.gz 180 S-Africa-variant-1_consensus.fa 181 S-Africa-variant-1_fixmate.bam 182 S-Africa-variant-1_indel.bam 183 S-Africa-variant-1_indel.vcf 184 S-Africa-variant-1_indel_flt.vcf 185 S-Africa-variant-1_lowcovmask.bed 186 S-Africa-variant-1_markdup.bam 187 S-Africa-variant-1_namesort.bam 188 S-Africa-variant-1_pangolin_lineage.csv 189 S-Africa-variant-1_sorted.bam 190 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 9 fastp.html 191 fastp.json 192 193 Each of the example UK variants should have been categorized as B.1.1.7 and the South 194 African variants as B.1.351 (with pangoLEARN release 2021-02-06). It is important to note 195 however, that as more sequences are uploaded and the pangolin lineage nomenclature 196 updated, the assigned lineages may differ from the expected ones described in this paper. 197 Regions with low coverages (with default setting under 30) are marked with the letter N during 198 masking and represent gaps in the final consensus sequences. 199 200 HaVoC is comparable to alternative combinations of tools, e.g. Jovian and pangolin, in both 201 speed and accuracy. These tools however operate separately, and as of publishing, there are 202 no single public tools that can both perform a reference-based consensus assembly and a 203 lineage identification in an easily accessible manner. 204 205 Conclusions 206 Early detection and understanding of the potential impact of emerging variants of SARS-CoV-2 207 is of primary importance and can assist in more efficient surveillance and control of the disease. 208 The likelihood of emergence of novel SARS-CoV-2 variants of concern is increased and 209 accelerated by the high mutation rates typical in RNA viruses and the growing number of 210 transmissions and infections both locally and globally. 211 212 With the rising number of variants detected worldwide and with many of them associated with 213 increased transmissibility and lower vaccine efficacy, there is an emerging need for fast, 214 efficient and reliable pipelines to help detect, identify and trace SARS-CoV-2 lineages. These 215 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 10 pipelines should in addition be accessible to researchers who may not be familiar with utilizing 216 complex bioinformatic tools or scripting pipelines. 217 218 Due to these challenges, we have developed HaVoC, a simple, reliable and user-friendly 219 pipeline, which can be simply downloaded from our repository and run without being installed. 220 All its dependencies can be installed via existing package managers, of which we recommend 221 Bioconda. HaVoC could help in the current pandemic situation by detecting variants of concern 222 in the sequencing centers and public health or other organisations currently running and tracing 223 variants of concern worldwide. HaVoC is currently utilized for detecting and tracing SARS-CoV-224 2 variants of concern, mainly B.1.1.7, B1.351 and P.1, in Finland. 225 226 Availability and requirements 227 Project name: HaVoC (Helsinki university Analyzer for Variants Of Concern) 228 Project home page: https://www.helsinki.fi/en/projects/havoc and 229 https://bitbucket.org/auto_cov_pipeline/havoc 230 Operating system(s): Linux, Mac 231 Programming language: Shell script 232 Other requirements: Trimmomatic or Fastp, BWA-MEM or Bowtie2, Samtools, BEDtools, 233 BCFtools, Lowfreq and Pangolin. 234 License: GNU GPL 235 Any restrictions to use by non-academics: license needed 236 237 List of abbreviations 238 SARS-CoV-2 - Severe acute respiratory syndrome coronavirus 2 239 COVID-19 - Coronavirus disease 2019 240 HaVoC - Helsinki university Analyzer for Variants Of Concern 241 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 11 242 References 243 1. Dixon MG, Schafer IJ, Centers for Disease Control and Prevention (CDC). Ebola viral 244 disease outbreak--West Africa, 2014. MMWR Morb Mortal Wkly Rep. 2014;63:548–51. 245 2. Kindhauser MK, Allen T, Frank V, Santhana RS, Dye C. Zika: the origin and spread of a 246 mosquito-borne virus. Bull World Health Organ. 2016;94:675-686C. 247 doi:10.2471/BLT.16.171082. 248 3. Taylor LH, Latham SM, Woolhouse ME. Risk factors for human disease emergence. Philos 249 Trans R Soc Lond B Biol Sci. 2001;356:983–9. doi:10.1098/rstb.2001.0888. 250 4. Woolhouse MEJ, Gowtage-Sequeria S. Host range and emerging and reemerging 251 pathogens. Emerging Infect Dis. 2005;11:1842–7. doi:10.3201/eid1112.050997. 252 5. Morens DM, Fauci AS. Emerging Pandemic Diseases: How We Got to COVID-19. Cell. 253 2020;182:1077–92. doi:10.1016/j.cell.2020.08.021. 254 6. Worldometer - COVID-19 Virus Pandemic. https://www.worldometers.info/coronavirus/. 255 Accessed 3 Feb 2021. 256 7. Rambaut A, Loman N, Pybus O, Barclay W, Barrett J, Carabelli A, et al. Preliminary genomic 257 characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike 258 mutations. Virological. 2020. https://virological.org/t/preliminary-genomic-characterisation-of-an-259 emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563. 260 Accessed 2 Feb 2021. 261 8. Leung K, Shum MH, Leung GM, Lam TT, Wu JT. Early transmissibility assessment of the 262 N501Y mutant strains of SARS-CoV-2 in the United Kingdom, October to November 2020. Euro 263 Surveill. 2021;26. doi:10.2807/1560-7917.ES.2020.26.1.2002106. 264 9. Tegally H, Wilkinson E, Giovanetti M, Iranzadeh A, Fonseca V, Giandhari J, et al. Emergence 265 and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-266 CoV-2) lineage with multiple spike mutations in South Africa. medRxiv. 2020. 267 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 12 doi:10.1101/2020.12.21.20248640. 268 10. Mahase E. Covid-19: Novavax vaccine efficacy is 86% against UK variant and 60% against 269 South African variant. BMJ. 2021;:n296. doi:10.1136/bmj.n296. 270 11. Kupferschmidt K. Vaccine 2.0: Moderna and other companies plan tweaks that would 271 protect against new coronavirus mutations. Science. 2021. doi:10.1126/science.abg7691. 272 12. Edwards E. J&J says vaccine effective against Covid, though weaker against South Africa 273 variant. NBC News. 2021. https://www.nbcnews.com/health/health-news/j-j-vaccine-effective-274 against-covid-though-weaker-against-south-n1255400. Accessed 10 Feb 2021. 275 13. Faria NR, Claro IM, Candido D, Franco LAM, Andrade PS, Coletti TM, et al. Genomic 276 characterisation of an emergent SARS-CoV-2 lineage in Manaus: preliminary findings. 277 Virological. 2021. https://virological.org/t/genomic-characterisation-of-an-emergent-sars-cov-2-278 lineage-in-manaus-preliminary-findings/586. Accessed 3 Feb 2021. 279 14. Centers for Disease Control and Prevention (CDC). Emerging SARS-CoV-2 Variants. 280 https://www.cdc.gov/coronavirus/2019-ncov/more/science-and-research/scientific-brief-281 emerging-variants.html. Accessed 12 Feb 2021. 282 15. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. 283 Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560. 284 16. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. 285 Bioinformatics. 2014;30:2114–20. doi:10.1093/bioinformatics/btu170. 286 17. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 287 arXiv. 2013. 288 18. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 289 2012;9:357–9. doi:10.1038/nmeth.1923. 290 19. Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of NGS 291 alignment formats. Bioinformatics. 2015;31:2032–4. doi:10.1093/bioinformatics/btv098. 292 20. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence 293 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 13 Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9. 294 doi:10.1093/bioinformatics/btp352. 295 21. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. 296 Bioinformatics. 2010;26:841–2. doi:10.1093/bioinformatics/btq033. 297 22. Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, et al. LoFreq: a sequence-298 quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from 299 high-throughput sequencing datasets. Nucleic Acids Res. 2012;40:11189–201. 300 doi:10.1093/nar/gks918. 301 23. pangolin. https://github.com/cov-lineages/pangolin. Accessed 12 Feb 2021. 302 303 Declarations 304 Ethics approval and consent to participate 305 Not Applicable. 306 307 Consent for publication 308 Not Applicable. 309 310 Availability of data and materials 311 Publicly available at https://bitbucket.org/auto_cov_pipeline/havoc. 312 313 Competing interests 314 The authors declare that they have no competing interests. 315 316 Funding 317 This study was supported by the Academy of Finland (grant number 336490), VEO - European 318 Union’s Horizon 2020 (grant number 874735) and the Jane and Aatos Erkko Foundation. 319 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/ 14 320 Authors' contributions 321 Conceptualization: PTN IP RK TS TSi OV. Development: PTN IP RK TS. Testing/Formal 322 analysis: PTN IP RK TS. Funding acquisition: TSi OV. Investigation: PTN IP RK TS. 323 Methodology: PTN IP RK TS. Project administration: RK TS OV. Resources: PTN RK IP TS TSi 324 OV. Validation: PTN IP RK TS. Writing – original draft: PTN RK. Writing – review & editing: IP 325 TS TSi OV. 326 327 Acknowledgements 328 None. 329 330 Authors' information 331 None. 332 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.431018doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.431018 http://creativecommons.org/licenses/by-nc/4.0/