key: cord-0965070-0gmz34fs authors: Zrelovs, N.; Ustinova, M.; Silamikelis, I.; Birzniece, L.; Megnis, K.; Rovite, V.; Freimane, L.; Silamikele, L.; Ansone, L.; Pjalkovskis, J.; Fridmanis, D.; Priedite, M.; Caica, A.; Gavars, M.; Perminovs, D.; Storozenko, J.; Savicka, O.; Dimina, E.; Dumpis, U.; Klovins, J. title: First report on the Latvian SARS-CoV-2 isolate genetic diversity date: 2020-09-09 journal: nan DOI: 10.1101/2020.09.08.20190504 sha: b34fe78aae7824579067c7b32a9b1d03f755f54a doc_id: 965070 cord_uid: 0gmz34fs Remaining a major healthcare concern with nearly 27 million confirmed cases worldwide at the time of writing, novel severe acute respiratory syndrome coronavirus - 2 (SARS-CoV-2) has caused more than 880 thousand deaths since its outbreak in China, December 2019. First case of a person testing positive for SARS-CoV-2 infection within the territory of the Republic of Latvia was registered on 2nd of March 2020, nine days prior to the pandemic declaration by WHO. Since then, more than 230 000 tests were carried out confirming a total of 1330 cases of COVID-19 in the country as of 20th of August 2020. Rapidly reacting to the spread of the infection, an ongoing sequencing campaign was started mid-March in collaboration with the local testing laboratories, with an ultimate goal in sequencing as much local viral isolates as possible, resulting in first full-length SARS-CoV-2 isolate genome sequences from the Baltics region being made publicly available in early April. With 109 viral isolates representing ~8.2% of the total COVID-19 cases in the country being completely sequenced as of today, here we provide a first report on the genetic diversity of Latvian SARS-CoV-2 isolates. Current novel coronavirus disease (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which was formerly known as 2019 novel coronavirus (2019-nCoV), and is often referred to as Human coronavirus 2019 (hCoV-19), responsible for a sudden rise in pneumonia cases in Wuhan, China, late December 2019, was preventively deemed a Public Health Emergency of International Concern by WHO as early as 30th January, 2020 with only as few as 7836 cases confirmed worldwide back then. With rapidly growing number of confirmed positive cases throughout the world, SARS-CoV-2 quickly became arguably the most sequenced virus in history with nearly 55 thousands viral isolate near complete genome sequences of high quality available publicly at the time of writing at GISAID repository thanks to the unprecedented rate of collaborations between researchers and unpublished data sharing with the goal of effectively tackling the novel disease [1, 2] . First reported genomic sequence of SARS-CoV-2 was deduced from a metagenomic RNA of bronchoalveolar lavage fluid specimen sampled from a patient who worked at Wuhan seafood market, where the epidemiological onset of human-to-human transmission of a novel zoonotic coronavirus is thought to have taken place [3] , although evidence of an earlier contraction of the disease that was not associated with the seafood market has been documented, leading to the conclusion that the primary spill-over event has taken place elsewhere [4, 5] . The sequence of a 29903 base long non-segmented positive-sense single-stranded RNA molecule representing complete genome of the aforementioned isolate Wuhan-Hu-1 was deposited in GenBank [6] on 5 th of January, 2020 and is now known as a SARS-CoV-2 reference sequence available under accession numbers NC_045512.2 or MN908947.3. While viral family Coronaviridae, that comprises α/β/Δ/γ coronavirus genera, representatives are somewhat unique in comparison with most other RNA viruses in regards to their large genome size of ~30 kb, genomic organization of individual species does not differ much among other lower taxa within the family, while boasting variable number of open reading frames. The genome of SARS-CoV-2 begins with a 265 base long 5' UTR region starting with a leader sequence followed by a 21 290 base long ORF1ab, comprising about 70% of the genome length, that translates into two polyproteins via -1 ribosomal frameshift and encodes 16 non-structural proteins (nsp1-nsp16). The remaining part of the genome comprises ORFs coding for structural and accessory proteins of unknown function, sequentially: Spike glycoprotein (S), ORF3a, envelope protein (E), membrane glycoprotein (M), ORF6, ORF7a, ORF7b, ORF8, nucleocapsid phosphoprotein (N), ORF10, followed by 3' UTR ending in poly(A) tail. However, no evidence that would support the expression of SARS-CoV-2 ORF10 encoded protein of unknown function is yet found in the literature [7] . compared to other, milder symptoms causing, community-acquired human coronaviruses (HCoV-229E, HCoV-OC43, HCoV-HKU1 and HCoV-NL63) [8] . Studies on the origin of novel coronavirus have revealed that complete genomic sequence of SARS-CoV-2 suggests a more close, although not a direct parental, ancestral relationship with bat (~96% overall nucleotide homology with RaTG13[9]) and pangolin coronaviruses (up to ~92% homology, with S protein ACE2 receptor binding domain amino acid sequence being 97.4% identical to SARS-CoV-2 [10]), than to those of humans(~79% and ~50% identity to SARS-CoV and MERS-CoV, respectively [11] , and, while bats are already a long-time acknowledged reservoir of SARS-CoV-like β-coronaviruses [12, 13] , the assumption that pangolins can serve as a natural host for CoVs has been made only recently before the emergence of SARS-CoV-2 [14, 15] . Although the current risk of animal-human transmission of COVID-19 is considered low, a number of felines [16] , canines [17] and minks [18] worldwide have been reported to be infected with SARS-CoV-2. With a steadily growing number of complete hCoV-19 genome sequences, early efforts to classify novel isolates based on their genetic make-up have resulted in numerous proposals of different SARS-CoV-2 isolate classification systems [19] [20] [21] , some of which (e.g. PANGOLIN lineages) are complementary. However, with more than 84 000 of hCoV-19 genome sequences being available publicly as of now, ongoing efforts to aid in the classification of newly sequenced viral isolates have resulted in the general acceptance of GISAID's team developed SARS-CoV-2 major clade and lineage nomenclature system based on the specific combinations of 9 hCoV-19 genetic markers [1] . In accordance with this system, hCoV-19 isolates can be classified in at least six distinct major clades, namely: S, L (containing reference sequence Wuhan-Hu-1), V, G, GH, GR and O (other) isolate clades (Table 1) . Mid-August, 2020, the most represented clades Worldwide are GR, G and GH, roughly corresponding to 31.72%, 23.54% and 22.03% of total hCoV-19 isolates, respectively. All three of these clades are characterized by C241T base substitution in 5' UTR region, C3037T silent mutation in ORF1a and missense A23403G mutation that causes Aspartic acid at position 614 of spike glycoprotein (S) to change to Glycine (S-D614G), that is associated with higher viral loads and, in turn, is hypothesized to increase the infectivity of these genotypes, with isolates bearing this mutation quickly becoming dominant ones in various regions throughout the world [22] [23] [24] . More recent clades GR and GH are further distinguished from the ancestral G genotype by G25563T mutation resulting in position 57 of ORF3a protein to change from glutamine to histidine for clade GH, and G28882A that changes glycine at nucleocapsid phosphoprotein (N) aa position 204 to arginine for clade GR. While the exact effect of GH clade-defining G25563T change in apoptosis-inducing transmembrane ORF3a protein (Q57R) remains unknown, it does not seem to affect any of the conserved functional domains distinguishable within the protein [25, 26] . Whereas, G28882A mutation associated with GR genotype is almost always a trinucleotide mutation of neighboring loci resulting in GGG to AAC change at positions 28881, 28882 and 28883, respectively. This trinucleotide mutation results in two (R203K and G204R) consecutive amino acid changes in N protein, which, in turn, might have potential implications on nucleocapsid phosphoprotein structure and/or function via reduction of conformational entropy and changes in inter-residue interactions in the proximity of the mutated amino acid positions (elaborated on in [27] ). The currently estimated mutation rate of SARS-CoV-2 is around 9.86 x 10 −4 to 1.85 x 10 −4 substitutions per position per year [28] , and, based on the isolates sequenced worldwide up to date, there is evidence that mutations in nearly every position in the genome of SARS-CoV-2 have already been documented [29] . In this study, we are reporting the first results of an ongoing massive sequencing campaign that allows us to elaborate on the genetic diversity of SARS-CoV-2 isolates from Latvian patients. For viral genome analysis either oropharyngeal or nasopharyngeal swabs obtained from COVID-19 patients or already extracted RNA samples were provided to Latvian Biomedical Research and Study Centre by the three accredited diagnostic laboratories (E. Gulbis Laboratory, Central Laboratory and Latvian Centre of Infectious Diseases) covering diagnostics of all officially reported cases of SARS-CoV-2. RNA extraction from oropharyngeal and nasopharyngeal swabs and the following SARS-CoV-2 detection was performed by multiple different methods according to standard procedures of each laboratory. These included manual Trizol-based RNA extraction (TRI reagent, Sigma) and automated purification methods with STARMag 96 X 4 Universal Cartridge Kit (Seegene Inc.), NucliSENS easyMAG (bioMérieux), QIAamp 96 Virus QIAcube HT Kit (QIAGEN). The presence of SARS-CoV-2 in the purified RNA samples was estimated by, either commercial or in-house RT-QPCR (Allplex™ 2019-nCoV Assay, Seegene Inc) methods. Samples showing amplification (ct < 40) of at least one viral gene (RdRp, E, N) were considered as positive and directed to next-generation sequencing. Metatranscriptome sequencing was the first-choice method for the SARS-CoV-2 genome analysis. Nevertheless, since the majority of samples showed an insufficient number of sequencing reads mapping to the SARS-CoV-2 genome and could not be reliably analyzed, targeted sequencing approaches were considered. A . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint methodological strategy plan was developed in order to apply the most effective nextgeneration sequencing method for each sample according to the quantity of SARS-CoV-2 (Supplementary figure 1). At first, RT-QPCR was repeated for each sample in order to evaluate the quantity of viral RNA with a common approach for all samples. Three SARS-CoV-2 genome-specific primer pairs and probes targeting different regions of the nucleocapsid protein (N) gene implemented in the 2019-nCoV RUO Kit (IDT), and SOLIScript® 1-step CoV Kit (Solis Biodyne) were used for the amplification. Probes N1 and N2 specifically detected SARS-CoV-2, while the N3 probe universally detected all currently recognized clade 2 and 3 viruses within the subgenus Sarbecovirus [30] . To evaluate the RNA extraction and PCR efficiency, simultaneous amplification of the human RNase P gene was performed and a control sample containing a plasmid with the SARS-CoV-2 nucleoplasmid protein gene (2019-nCoV_N_Positive Control, IDT) was added to each reaction set. The potential contamination was evaluated by a negative control (nuclease-free water instead of RNA) added to each sample set. RT-QPCR was conducted on the ViiA 7 Real-Time PCR System (Thermo Fisher Scientific), and only the samples showing amplification (ct < 40) of all three SARS-CoV-2 nucleoplasmid protein genes were further directed to metatranscriptome sequencing. Samples exhibiting poor amplification of viral genes (ct > 40 for at least one target region) were considered for one of targeted sequencing approaches: hybridization capture or amplification of SARS-CoV-2. In order to eliminate contaminating DNA, DNase I treatment (NEB) of RNA samples was performed, followed by rRNA depletion with MGIEasy rRNA Depletion Kit (MGI Tech Co. Ltd). Complementary DNA libraries were prepared using MGIEasy RNA Library Prep Set (MGI Tech Co. Ltd). Quantity and quality of both RNA and cDNA were evaluated using the Qubit 2.0 fluorometer and Agilent 2100 Bioanalyzer system, respectively. The presence of the SARS-CoV-2 genome was repeatedly tested in each cDNA library by Q-PCR before sequencing, using the same primers and probes (2019-nCoV_N_Positive Control, IDT) together with TaqMan™ Gene Expression Master Mix (Thermo Fisher Scientific). After multiple experimental tests, a ct value threshold of 25 was chosen for N1 and N3 probes for cDNA libraries to be forwarded to metatranscriptome sequencing (N2 probe appeared to be unstable and therefore uninformative). Metatranscriptome cDNA libraries were sequenced on the DNBSEQ-G400RS sequencing platform with DNBSEQ-G400RS High-throughput Sequencing Set (PE150) (MGI Tech Co. Ltd), obtaining at least 100 million 150-bp-paired-end sequencing reads per each sample. Those libraries that failed to pass the ct threshold (ct > 25 for N1 and N3) were directed to a targeted approach. One of the targeted sequencing strategies was based on the enrichment of the SARS- is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint sequenced on Illumina MiSeq system with MiSeq Reagent Kit v3 (150-cycle), obtaining at least 1 million of around 75-bp-paired-end reads per sample. The second targeted approach involved multiplexed primers for amplification of the whole SARS-CoV-2 genome. QIAseq SARS-CoV-2 Primer Panel (QIAGEN) [31] was used together with QIAseq FX DNA Library Kit (QIAGEN) for cDNA library preparation. Nextgeneration sequencing was performed on Illumina MiSeq system with MiSeq Reagent Kit v2 (300-cycles), obtaining at least 1 million of around 150bp paired-end reads per sample. Adapter clipping was performed with cutadapt 1. 16 [32] . Subsequent read trimming was performed with fastp 0.20.0 [33] using 5 base sliding window trimming from both ends with quality threshold 20. Reads with length less than 75 bp or an average quality of less than 20 were removed. Quality controlled reads were then aligned against SARS-CoV-2 isolate Wuhan-Hu-1 reference genome (Accession number: NC_045512.2) with bowtie2 2.3.5.1 [34] . Variant calling and consensus sequence construction were implemented using bcftools 1.10. 2 [35] . Average coverage for each of the genomes was calculated using samtools and in-house awk [36] scripts. Less than 1% of the missing bases were allowed for a genome to be considered successfully sequenced and missing bases were treated as reference bases from the Wuhan-Hu-1 genome. Consensus sequences of the successfully sequenced isolates were then proceeded to the manual variant quality inspection by sequence alignment map visualization in IGV [37], sequences that have passed the manual variant quality check were immediately publicly shared by deposition to GISAID database [1] . Variant annotations were performed using coronapp SARS-CoV-2 genome autoannotation web server by comparisons to reference sequence [38] and the results were summarized with the help of custom R scripts, ggplot2 R library was used for plot visualizations [39] [40] . Sequences of the Latvian SARS-CoV-2 isolates and Wuhan-Hu-1 reference sequence were aligned using Clustal-Omega v 1. Bayesian phylogenetic trees were estimated using BEAST v1.10. 4 [46] , employing GTR nucleotide substitution model with empirical base frequencies and invariant site proportion assuming strict molecular clock. Taxon sets corresponding to samples belonging to clades L, G, GR and GH were set to be monophyletic. Coalescent is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint exponential growth prior (growth rate prior: Laplace with scale 100; population size prior: Lognormal with mu 1 and sigma 2) with growth rate parametrization [47, 48] was selected and Markov chain Monte Carlo (MCMC) was run for 30 million states sampling log parameters and trees every 3 000 states. Tracer v 1.7.1 [49] was used for MCMC trace (log file) inspection to evaluate sufficiency of sampling (all parameters had an ESS of more than 500) and infer substitution rate and date of the most recent common ancestor estimates. To summarize Bayesian phylogenetic inference, maximum clade credibility time-scaled tree was generated in TreeAnnotator v 1.10.4 (distributed with BEAST package) using 10% of the states (3 million) as the burn-in and visualized using FigTree v 1.4.4 [50] . With 1330 cumulative positive cases as of 20 th of August 2020 (1093 people recovered, 204 active cases of the disease and 33 COVID-19 associated deaths), 109 hCoV-19 isolates representing ~8.2% of the total local COVID-19 cases have been completely sequenced as of today, making Latvia one of the leading countries not only in regards to the containment of the spread of COVID-19 disease, but in the number of sequenced SARS-CoV-2 isolates to the cumulative number of positive COVID-19 cases ratio as well ( Figure 1 ). . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint Figure 1 . Daily numbers of positive COVID-19 cases (A) and tests performed (B) in Latvia. X axis is the same for both tiles and represents daily time series from 28 th of February, 2020 to 18 th of August, 2020. The red vertical line indicates the date of the first COVID-19 case registered in Latvia. A) Y value represents the total number of positive cases registered on a given day. Blue area shows the number of successfully sequenced isolates, while the red area represents the positive cases not sequenced during this study. B) Y value represents the number of tests carried out on a given date in Latvia. Reacting to the emergence of the SARS-CoV-2 in Latvia, a high-throughput framework for SARS-CoV-2 isolate sequencing and data analysis with capabilities of near real-time tracking of the epidemiological situation in Latvia was built to aid the governmental decision-making and study the molecular epidemiology of hCoV19. One of the challenges to obtain good quality sequences for maximal number of samples is the variable quality of input material that can be caused by highly variable viral loads, different collection, storage and RNA isolation methods. Although for the current study we did not have the information on the severity of COVID-19 symptoms for particular cases, it should be noted that the absolute majority of cases in Latvia are with low symptom severity expected to have lower concentration of virus in diagnostic samples. We therefore developed an approach to verify sample quality and select appropriate sequencing method to recover maximal available information from . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint existing samples ensuring cost efficacy of the process ( Figure S1 ). According to this strategy developed during the implementation of the study, complete SARS-CoV-2 genome sequence was successfully obtained by metatranscriptome approach for 37 viral isolates, 56 samples were analyzed by amplification of SARS-CoV-2 genome with multiplexed primers, and for 16 isolates enrichment of SARS-CoV-2 genome was performed by hybridization capture method prior the sequencing. As of now it could be cautiously speculated that the obtained results on the SARS-CoV-2 genotype distributions might be somewhat representative of a whole Baltics region, taking the geographical proximity, travel habits and mild governmental travel regulations between the Baltic states during the most of the pandemic into the account. However, the extent of similarity between the isolates circulating in different Baltic states currently cannot be reliably established due to SARS-CoV-2 isolate undersequencing in neigboring Estonia and Lithuania, and the founder effect of multiple independent (re-)introductions of different SARS-CoV-2 genotypes, as well as containment effectivity of respective COVID19 cases, in each of the countries should not be overlooked. Major isolate clade distributions across distinct geographical regions show clear spatial differences of the epidemic ( Figure 2 ) and a trend of "older" isolate clades L and S losing their initial prevalence to the dominance of the more recently-emerged G-associated clades (G, GH, GR) that seem to be accountable for the majority of the cases worldwide since the middle of March 2020. GR, which is the most common isolate clade in Latvia (54.13% of cases), is also a dominant clade in Europe and South America. Currently, GH still seems to be the most common isolate clade circulating throughout North America, but a rise in the number of GR isolates can be observed since the middle of May 2020. The prevalence of GR and, in particular, G clade isolates is also currently on the rise in Africa, and, to a very moderate amount in Oceania and Asia. The relatively high number of isolates not corresponding to any of the currently recognized major SARS-CoV-2 clades (dubbed "Other" or belonging to the "O" clade as of now) in Asia and Oceania makes it possible to speculate about it being indicative of either (but not mutually exclusive), poor quality of the sequences obtained or the possibility of novel clade emergence originating from these regions in the future, should their spread not be effectively contained. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint Figure 2 . Distribution of sequenced hCoV-19 isolates by clades in major regions of the world, worldwide and in Latvia. Y axis depicts cumulative complete hCoV-19 genome count (with unambiguous collection date) from a particular region and has different scale within the subplots. X axis is the same for all subplots and depicts sampling time-series from 24 th of December, 2019 till 5 th of August, 2020. After joining of the neighboring loci, among 109 local isolates, 188 different unique mutational events (125 nonsynonymous, 57 synonymous, 6 substitutions in extragenic regions, and a single deletion) that affected 186 positions of the SARS-CoV-2 genome were registered from a total of 1009 variants that were identified. 113 out of 188 distinct mutational events were registered only in one of the 109 samples, while 75 were present in two or more samples (Supplementary table 1 and supplementary figure 2 ). NSP3 was found to be the mature peptide most frequently affected by nonsynonymous substitutions (24 distinct variants resulting in an amino acid change), followed by an N protein that had 15 non-synonymous SNVs documented among Latvian SARS-CoV-2 isolates. Among the most frequently mutated proteins, NSP2, S and NSP12b mature peptides harbored 13, 12 and 10 different amino acid altering mutations, respectively. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint Based on the current coronapp web-server [38] report updated at 2020-08-14 (n=73889), most frequent mutational events worldwide are: A23403G corresponding to S:D614G, C3037T silent mutation, C14408T resulting in NSP12b:P314L, C241T extragenic substitution and GGG28881ACC trinucleotide mutation of neighboring loci resulting in N:RG203KR, G25563T -ORF3a:Q57H. All five of these mutations were also among most frequent mutational events registered in Latvian samples: 5' UTR C241T extragenic substitution that was present in 105 out of 109 sequenced genomes, while X axis is discrete and shows genome position corresponding to one of the ten most frequent mutated positions (ordered in descending order). Y axis represents mutation occurrence among the samples (same as numbers within the respective upper boundary of bars) at a given position. Labels in the middle of bars represent the nucleotide change at a given position and its effect on the respective protein amino acid sequence. Color coding is based on the variant class and bars are colored according to the legend. Note: bars representing different mutations of the same locus are stacked (position 28881). It was noted that four of out of five aforementioned mutations (with the exclusion of C14408T) are in the genome positions serving as markers for current SARS-CoV-2 isolate major clade definition and correspond to GR clade, that is the most represented clade Worldwide and hosts more than a half of the sequenced isolates in is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint Latvia ( Figure 3 and table 1 ). The C14408T substitution resulting in NSP12b:P314L amino acid change has been previously reported to co-occur with C241T, C3037T and A23403G mutations [51] , which is consistent with our data, where four of these SNPs were simultaneously present 104/109 of the Latvian SARS-CoV-2 isolates sequenced up to date. While no experimental evidence of C14408T substitution implications on the NSP12b (RdRp) activity is yet present, isolates bearing this variant were previously speculated to have more mutations, and elaborations about possible implications of RdRp mutations on antiviral drug resistance were made [52] . The fitness of G and Gderived strains, as denoted by the recent rise of their prevalence throughout different regions of the world, is hardly explainable only by the founder effect alone, thus highlighting the fact that further evidence on molecular and clinical implications of the most common substitutions in the genomes of currently circulating SARS-CoV-2 is urgently needed to improve the measures of containment of COVID-19 and develop effective antiviral therapies and vaccines, that would help to not only combat the present virus of immediate concern, but also be of vital importance for other coronaviruses to yet emerge. Root-to-tip regression analysis with the "best-fitting root" and "correlation" function options resulted in a correlation coefficient of the analysis being estimated at 0.6098 and a determination coefficient (R 2 ) equaling to 0.3718 (Supplementary Figure 3) . Although having some of the sequences that diverged more or less than expected at their sampling date, the dataset had a moderate association between sequence divergence and sampling date, implying suitability for phylogenetic molecular clock analysis. Following Bayesian phylogenetic inference, mean mutation rate derived from Latvian SARS-CoV-2 isolates was found to be 7.126 x 10 -4 substitutions per site per year (5.5086 x 10 -4 -8.7086 x 10 -4 , 95% highest posterior density interval), roughly corresponding to an average of 21 mutational events in genome per year (95% HPD: ~16 -~26), and lies within mutation rate ranges predicted by other researchers. Based on the analysis, the estimated most recent common ancestor of the isolates has emerged on 18 th of November, 2019 (7 th October, 2019 -25 th December, 2019, 95% interval). Our molecular clock analysis further supported the more recent divergence of clades GR and GH with the most common ancestor for both clades dating back to 6 th of February, 2020 (95% HPD GR: 12 th of January, 2020 -26 th of February, 2020; 95% HPD GH: 12 th of January, 2020 -29 th of February, 2020). The 95% HPD date ranges are consistent with the collection dates of unambiguously dated genomes belonging to clades GH and GR deposited at GISAID (accessed 2020-08-19). Earliest reported SARS-CoV-2 genome belonging to clade GH was collected on 27 th of January 2020 in Cardiff, Whales (GISAID accession: EPI_ISL_474597), while earliest reported GR clade genome was collected on 16 th of February 2020 in London, England (GISAID accession: EPI_ISL_466615). While the TMCRA of GR and GH ancestral clade G was dated to 25 th of January, 2020 (95% HPD: 24 th of December, 2019 -20 th of February, 2020) and earliest reported sequences with unambiguous collection date belonging to clade G were collected on 24 th of January, 2020 in China, cities of Zhejiang and Chengdu (GISAID accessions: EPI_ISL_422425, EPI_ISL_451345). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint Our phylogenetic analysis of the local isolates suggests multiple unlinked initial introductions of already divergent SARS-CoV-2 isolates to Latvia. Just two weeks after the first positive case of COVID-19 was documented in Latvia on the 2 nd of March, isolates representing at least three major SARS-CoV-2 clades (L, GR and GH) were already circulating within the country corresponding to at least four epidemiologically unlinked introductions. No isolates belonging to clade L (most similar to the initial Wuhan-Hu-1 reference) were sequenced after the end of March and local circulation of clade G representatives was not detectable since early March, while clade GH and, specifically, GR isolates seem to have taken hold of the epidemic without showing signs of ceasing their proliferation within the Latvian population, however, recent reintroduction event possibility should not be ruled out due to cancelation of travel restrictions and insufficient testing of those entering the country. With more than half of the sequenced isolates belonging to the widely represented GR clade, up to this date, no isolates representing clades V or S were documented among the sequenced Latvian COVID-19 cases (Figures 2 and 4) . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint Neighbor-joining phylogenetic tree was built to more apparently infer genetic distances between the samples ( Figure 5 ). Although of satisfactory topology, supporting major clade clustering, the tree evidently shows the possible discrepancies between the reported sampling dates and expected sequence divergence (e.g. some of the samples most divergent from the root are dated with the end of April, while some of the most recent ones are twice less divergent), which is not attributable to sequencing errors or the possibility of co-infection by two different "strains". Identical sequences sampled within a short date range ( Figure 5 .) might be strongly indicative of epidemiologically linked transmission, given the relatively small daily amount of positive COVID-19 cases in Latvia even during the peaks of the disease spread. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint While providing interesting insights on the COVID-19 situation in Latvia, which might be representative of Baltics region to an extent, given the scarce amount of isolate genomes available from neighboring countries, it, however, should be noted, that the main drawback for each of the presented analyses is stemming from the available dataset -discrete early sampling with some of the dates since first positive case not being sampled at all (Figure 1 ). Another major drawback is the unavailability of patient/isolate epidemiological data that could be linked to the respective cases sequenced (e.g. sequence epidemiological linkage, patient travel history etc.) in the frame of this study. Inclusion of additional data and retrospective sequencing of a larger number of cases that would allow for a more in-depth analysis of epidemiological situation will be performed and published elsewhere. In conclusion, the high-throughput framework for SARS-CoV-2 isolate sequencing and data analysis in Latvia has been built by Latvian Biomedical Research and Study Centre early on during the start of the pandemic, tested with the help of both, governmental and local private laboratory sample providers, and proposed as a pivotal tool to monitor the local outbreaks and aid in decision making. This framework allowed us to ensure the sequencing of viral isolates from almost all new cases starting from the beginning of July, 2020 with fast date delivery to the Centre for Disease Prevention and Control in Latvia allowing to link the epidemiological data to the isolates being sequenced. We believe that this framework is of vital importance for rapid implementation of the most suitable public health measures, possible transmission history deduction and viral evolution monitoring for the prevention of future epidemiological outbreaks. This research is funded by the Ministry of Education and Science, Republic of Latvia, project "Establishment of COVID-19 related biobank and integrated platform for research data in Latvia", project No. VPP-COVID-2020/1-0016, project "Multidisciplinary approach to monitor, mitigate and contain COVID19 and other future epidemics in Latvia", project No. VPP-COVID-2020/1-0008 and 04.06.2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint Supplementary figure 2. Mutational landscape of Latvian SARS-CoV-2 isolates. Y axis shows the mutated position of a reference SARS-CoV-2 genome. X axis shows the cumulative mutation count at a given position. Number to the right of bars indicate cumulative mutation count at a given position and bars are color-coded according to the protein that the corresponding site participates in encoding. Note, that Y axis is discrete and only positions with mutations documented in local isolates are shown. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 9, 2020. . https://doi.org/10.1101/2020.09.08.20190504 doi: medRxiv preprint Global initiative on sharing all influenza data -from vision to reality disease and diplomacy: GISAID's innovative contribution to global health A new coronavirus associated with human respiratory disease in China Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia Clinical features of patients infected with 2019 novel coronavirus in Wuhan The Architecture of SARS-CoV-2 Transcriptome Zoonotic origins of human coronaviruses Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Novel SARS-like betacoronaviruses in bats, China Bats Are Natural Reservoirs of SARS-Like Coronaviruses Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)? PLOS Pathog Viral metagenomics revealed sendai virus and coronavirus infection of malayan pangolins (manis javanica) Transmission of SARS-CoV-2 in Domestic Cats Infection of dogs with SARS-CoV-2 SARS-CoV-2 infection in farmed minks, the Netherlands A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus A Unique Clade of SARS-CoV-2 Viruses is Associated with Lower Viral Loads in Patient Upper Airways. medRxiv Prepr Serv Heal Sci The D614G mutation of SARS-CoV-2 spike protein enhances viral infectivity and decreases neutralization sensitivity to individual convalescent sera. bioRxiv An updated analysis of variations in SARS-CoV-2 genome The ORF3a protein of SARS-CoV-2 induces apoptosis in cells Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility Variant analysis of COVID-19 genomes NextStrain: Real-time tracking of pathogen evolution Ltd hCoV-19/Latvia/024/2020 EPI_ISL_450524 2020-04-19 11 37x Metatranscriptome DNBSEQ-G400RS, MGI Tech Co Ltd hCoV-19/Latvia/029/2020 EPI_ISL_486412 2020-04-29 14 7839x Metatranscriptome DNBSEQ-G400RS, MGI Tech Co 79x Metatranscriptome DNBSEQ-G400RS, MGI Tech Co The authors acknowledge the use of infrastructure provided by High Performance Computing Centre of Riga Technical University.