key: cord-1026219-y62zkfom authors: Alteri, C.; Cento, V.; Piralla, A.; Costabile, V.; Tallarita, M.; Colagrossi, L.; Renica, S.; Giardina, F.; Novazzi, F.; Gaiarsa, S.; Matarazzo, E.; Antonello, M.; Vismara, C.; Fumagalli, R.; Epis, O. M.; Puoti, M.; Perno, C. F.; Baldanti, F. title: Genomic epidemiology of SARS-CoV-2 reveals multiple lineages and early spread of SARS-CoV-2 infections in Lombardy, Italy date: 2020-07-20 journal: nan DOI: 10.1101/2020.07.19.20152322 sha: 76895a450f20dca53b81caaa1a0f7058c483fd7a doc_id: 1026219 cord_uid: y62zkfom From February to April, 2020, Lombardy (Italy) was the area who worldwide registered the highest numbers of SARS-CoV-2 infection. By extensively analyzing 346 whole SARS-CoV-2 genomes, we demonstrated the simultaneous circulation in Lombardy of two major viral lineages, likely derived from multiple introductions, occurring since the second half of January. Seven single nucleotide polymorphisms (five of them non-synonymous) characterized the SARS-CoV-2 sequences, none of them affecting N-glycosylation sites. These two lineages, and the presence of two well defined clusters inside Lineage 1, revealed that a sustained community transmission was ongoing way before the first COVID-19 case found in Lombardy. Since coronavirus disease 2019 , was initially reported in China on 30th December 48 2019 1,2 , SARS-CoV-2 spreads world-wide and, as of 14 th June 2020, there have been 7.55 million 49 confirmed infections and 423,000 deaths reported worldwide (World Health Organization, 2020). 50 In Italy, the first case of evident SARS-CoV-2 transmission occurred at February 20 th , in Codogno, 51 Lombardy, when a young man affected by an interstitial pneumonia was diagnosed for SARS-CoV-52 2. Since that day, the number of diagnosed COVID-19 cases exponentially increased and Lombardy, 53 become the area most affected by the COVID-19 pandemic, with a total number of 89,018 infections 54 at June 2, out of 233,197 total cases in Italy. Thereafter, the COVID-19 epidemic grew exponentially 55 during the first days of March 2020, peaking on 21 March 2020 with 6,557 newly confirmed cases 56 per day. Two months later, reported COVID-19 cases in Italy dropped to ~600 per day, indicating 57 the epidemic is close to be contained. 58 Lombardy is the most populated region in Italy (10 million inhabitants) and one of the widest. Milan 59 represents the largest metropolitan area in Italy and the third most populated functional urban area 60 in Europe with a strong economical and transportation links to Europe and outside. This scenario 61 makes Lombardy the perfect place to host and favor the spread of a highly transmissible virus, such 62 as SARS-CoV-2. 63 In this study, an integrated approach including epidemiological and viral genetic data was used to 64 reconstruct the pattern of SARS-CoV-2 spread in this region. Whole genome sequencing was 65 The severity of the disease was classified into mild, moderate, or severe, if showing i) mild clinical 78 symptoms without sign of pneumonia on imaging, ii) fever and respiratory symptoms with radiological 79 findings of pneumonia, iii) respiratory distress, with oxygen saturation ≤93% at rest, mechanical 80 ventilation, or presence of multiorgan failure (septic shock) and/or admission to intensive care unit 81 (ICU) hospitalization. 82 Total RNAs were extracted from nasopharyngeal swabs by using QIAamp Viral RNA Mini Kit, 84 followed by purification with Agencourt RNAClean XP beads. Both the concentration and the quality 85 of all isolated RNA samples were measured and checked with the Nanodrop. Virus genomes were 86 generated by using version 1 of the CleanPlex SARS-CoV-2 Research and Surveillance Panel 87 (https://www.paragongenomics.com/product/cleanplex-sars-cov-2-panel/), according to the 88 manufacturer's protocol starting with 50 ng total RNA and followed by Illumina sequencing on a 89 NextSeq 500. Briefly, the multiplex PCR was performed with 2 pooled primer mixture and the cDNA 90 reverse transcribed with random primers was used as a template. After 10 rounds of amplification, 91 the two PCR products were pooled and purified. Then the digestion reaction was performed to 92 remove non-specific PCR products, followed with second PCR reaction for barcoding with 24 rounds 93 of amplification. Libraries were checked using High Sensitivity Labchip and quantified with Qubit. 94 Equimolar quantity of libraries was pooled, and the obtained run library mix was loaded at 1.5pM 95 into NextSeq500 for sequencing in the Mid Output format with paired-end 2 x 150 bp. The Illumina 96 sequencing platform takes less than 26 hours to obtain 30.2 Gb of sequencing data (compressed 97 format), achieving between 170.000 and 856.000 paired-end fragments per sample (340.000 and 98 1.712.000 sequences), with a mean coverage depth of 2.500. 99 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. . https://doi.org/10.1101/2020.07.19.20152322 doi: medRxiv preprint Reference-based assembly of the metagenomic raw data was performed as follows. Illumina 101 adapters were removed, and reads were filtered for quality (average q28 threshold and read length 102 > 135 nt) using FASTP. 3 First and last 15 nucleotides were then removed from all reads. The 103 mapping of cleaned reads was performed against the GenBank reference genome NC_045512.2 104 (Wuhan, collection date: December 2019) using BWA-mem 4 , and consensus sequences were 105 generated using samtools 1.10. 5 Single nucleotide polymorphisms (SNP variants) were called 106 through a pipeline based on samtools/bcftools 6 , and all SNPs having a minimum supporting read 107 frequency of 40% with a depth ≥100 were retained in the consensus sequence. 108 The consensus sequence obtained merging the most represented SNP (intra-patient prevalence 110 >40%) was retained per patient. Moreover, all available whole-genome SARS-CoV-2 sequences (n 111 = 3244) on GISAID (gisaid.org) on 3 May 2020 were downloaded. Sequences from GISAID that 112 were error-rich, those without a date of sampling and identical sequences from each country 113 outbreak were removed. Finally, the dataset was reduced by only retaining the earliest, and the most 114 nucleotides. We used both the maximum likelihood (ML) and Bayesian coalescent methods to 120 explore the phylogenetic structure of SARS-CoV-2. The ML phylogeny was estimated with RAxML 7 121 using the under the best-fit model of nucleotide substitution GTR+I 8 with gamma-distributed rate 122 variation 9 . Tree topology was assessed with the fast bootstrapping function with 1000 replicates. The 123 ML tree was inspected in TempEst 10 , in order to define the correlation between genetic diversity 124 (root-to-tip divergence) and time of sample collection (Supplementary Figure 1) . In order to obtain a 125 corresponding time-scaled maximum clade credibility tree, a Bayesian coalescent tree analysis was 126 undertaken with BEAST v1.10.4, 11 using the HKY+Q4 substitution model with gamma-distributed 127 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. Data were analyzed using Rgui and the statistical software package SPSS (v32.0; SPSS Inc., 139 Chicago, IL). 140 Sequences are in the process to be deposited in Gisaid (https://www.gisaid.org/) and will be 142 completed by acceptance of the manuscript. Additional data that support the findings of this study 143 are available from the authors upon request. symptoms, ranging from mild to severe. Twenty-five samples were excluded due to failed 151 amplification (n=9), or poor genomic coverage (<60%, n=16). The final study population thus 152 consisted of 346 patients, whose demographic and clinical characteristics are reported in Table 1 . 153 This study population was well representative of the whole Lombardy region, with exception of the 154 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. . https://doi.org/10.1101/2020.07.19.20152322 doi: medRxiv preprint East part and North valleys (i.e. Brescia, Mantua, Valtellina and Valcamonica, Figure 1 ). Two-155 hundred patients (57.8%) were male, and the median age was of 72 (IQR: 53-83) years. Fever was 156 the most common COVID-19 symptom at admission, followed by cough and dyspnea. Chest 157 radiographs or CT scan confirmed a classical bilateral interstitial pneumonia for 24.1% of them. 158 Patients with severe COVID-19 manifestation more frequently suffered from at least one chronic 159 comorbidity, compared to those with moderate and mild manifestations (p=0.002), with greater 160 impact played by hypertension and chronic kidney disease (p<0.001). 161 SARS-CoV-2 sequence reads were able to cover from 94.0% to 99.7% of the reference genome 163 The genetic pairwise distance of the 346 sequences indicated that the SARS-CoV-2 sequences 168 evolved progressively during time (rho=0.465, Supplementary Figure 1 and 3) . Seven single 169 nucleotide polymorphisms (SNP) were shared among >5 genomes, with a prevalence ranging from 170 3.6% (20268, A to G, syn in nsp15) to 99.2% (14408, C to T, non-syn P to L in RdRp) (Figure 2 A 171 and B). All SNPs have been detected in at least one previously published SARS-CoV-2 sequence 172 in GISAID. Five out of these seven SNPs were non-synonymous, and three mapped within SARS-173 CoV-2 structural proteins. Notably, 2 non-synonymous SNPs reside in S protein: the C to T at 174 position 23575 (S protein; amino acid T to I; intra-patient prevalence ~100%) found in 23 North Italian 175 sequences and previously detected in less than 1% of samples in China, Europe and North America, 176 and the A to G at position 23403 (S protein; amino acid D to G; intra-patient prevalence ~54%), 177 present in 67.8% of North Italian sequences and firstly detected in China with a prevalence of 1.7%, 178 then rapidly selected and spreading in all countries with a prevalence ranging from ~17% in Asia to 179 ~70% in Europe and North America. 180 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. Table 1 . 188 According to the topology of the ML tree, the genetic distance from SARS-CoV-2 reference strain 189 was lower for isolates in Lineage 1 respect to isolates in Lineage 2 (2.7x10 -4 [2.0x10 -4 ;3.7x10 -4 ] vs (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. This suggests that the initial spread of viral strains belonging to the two Lineages involved mainly 212 the territory in which the first access occurred, with little communication between the two highly 213 affected areas. 214 The Bayesian molecular clock analysis estimated a mean evolutionary rate of 1.47 x 10 -215 genomes worldwide implies a very low posterior probabilities (<0.50) for most of the nodes. 217 Concordantly, SARS-CoV-2 sequences in Lineage 1 did not form a monophyletic group (Figure 4) . spread in all countries, reaching the 37.6% in Italy (intra-patient prevalence:100%) (Figure 2A) . 231 Single nucleotide polymorphisms (SNPs), the non-syn A26530G in M (intra-patient prevalence 232 ~98%) and the syn A20268G in nsp15 (intra-patient prevalence ~54%), firstly identified in South Asia 233 and Europe (Figure 2A) , characterized sequences in Cluster A by Cluster B. 234 From the molecular clock analysis, we were able to estimate the times of the most recent common 235 ancestor (tMRCA) of all monophyletic traces. Lineage 2 has earlier tMRCAs that coincide with 24 th 236 January 2020 (95% HPD 20 January to 28 January 2020). Clusters A and B have few days later 237 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. . https://doi.org/10.1101/2020.07.19.20152322 doi: medRxiv preprint tMRCAs, around 27 th January 2020 (95% HPD 23 January to 31 January 2020) and 31 th January 238 2020 (95% HPD 26 January to 5 February 2020) (Figure 4) . By looking at the distribution of 239 Lombardy sequences in the Bayesian Tree, it is thus conceivable that the circulation of SARS-CoV-240 2 started in this region with at least two different and nearly contemporaneous foci in the second half 241 of January, one month before the first COVID-19 diagnosis in Codogno (Lodi). This is in line with 242 evidences of initial outbreaks in other European countries, including Germany, in the second half of 243 The Bayesian reconstruction also allowed to estimate the median tMRCA estimate of the COVID-19 245 pandemic that was 4 December 2019 (95% HPD 29th November to 21th December 2019; Figure 4) , 246 consistent with previous estimates. 14-15 247 These data on the genomic epidemiology of SARS-CoV-2 in Lombardy, based on the largest number 249 of whole genome SARS-CoV-2 sequences generated in a single study, indicate the simultaneous 250 circulation of two lineages of SARS-CoV-2 likely starting from multiple introductions, which occurred 251 in the second half of January (Figure 3 and 4) , one month before the first COVID-19 case detected 252 in Codogno (Lodi). Of note, the ML tree showed that these two chains of transmission were wide in 253 size and fast in spreading. 254 The MRCAs of these two SARS-CoV-2 lineages were closely related in time, as they dated just one 255 month before the first diagnosed case, yet they showed different preferential geographical 256 circulations. Lineage 1 mostly interested the Southern Lombardy, including Lodi and Cremona (Table 257 1), while Lineage 2 predominated in the North of Lombardy, mostly in Bergamo and its adjacent 258 territories (such as Alzano and Nembro, that represented a major focus of the epidemics). The 259 predominance of these lineages in different territories, the lack of a monophyletic signal for Lineage 260 1, and the detection of two well supported clusters inside it (Figure 4) (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. . https://doi.org/10.1101/2020.07.19.20152322 doi: medRxiv preprint Altogether, the spread of two lineages in the two most populated cities of the Lombardy region (Milan 266 and Pavia), along with a broadly comparable level of genetic pairwise distance, and similar viral load 267 levels within them, support their similar fitness, and hence exclude the occurrence of a strong 268 competition between them. Consistent with this, using a Chi-squared test, no trend of association 269 between disease severity (including evidence of interstitial pneumonia, and severe presentation) and 270 lineages was detected (Table 1) Cluster A and B, in Lineage 1, are characterized by the presence of two SNPs, the non-syn A26530G 285 (intra-patient prevalence ~98%), leading to the D3G mutation within a B-cell epitope of M protein 19 , 286 and the syn A20268G in nsp15 (intra-patient prevalence ~54%). 287 Looking at the overall variability of SARS-CoV-2, we found that only 7 SNPs (2 out 7 synonymous) 288 characterized our consensus sequences, highlighting a good conservation rate of this virus along 289 time. This conservation rate is confirmed within the spike structural protein, where only 2 mutations, 290 one of them at low prevalence (i.e, C23575T, corresponding to the amino acid variant T671I), were 291 detected. None of these mutations have a role in altering pre-existing N-glycosylation sites or in 292 creating new ones, 20 which may be beneficial for the development of vaccines strategies. 293 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. phylogenetic lineages is small and may be similar to the rate of potential errors introduced during 306 reverse transcription, PCR amplification, or sequencing. 22 To overcome these problems, Bayesian 307 approach, known to be a powerful way to estimate species divergence, and thus expected to provide 308 more robust results, was applied. Moreover, the integration of host characteristics (such as 309 geographical location, collection date and clinical manifestations) aided phylogenetic interpretation. 310 Moreover, the intra-host variability of SARS-CoV-2, and the role of potential existing minority variants 311 has not been investigated here. Initial evidences suggest that intra-host variation of SARS-CoV-2 312 can be frequently found among clinical samples (median number of intra-host variants: 1-4), but at 313 the same time these variants were not observed in the population as polymorphisms, probably 314 suggesting a bottleneck or purifying selection involved. 23,24 Thus, ad hoc designed studies are 315 necessary to provide an extensive overview of SARS-CoV-2 intra-host variability and minority 316 variants description, if and how these minority variants can spread in the population, and their 317 potential role in virulence and transmissibility. 318 In the peak of epidemic, SARS-CoV-2 diagnosis was mainly addressed to symptomatic cases or 319 subjects at high risk of exposure (i.e. health care workers exposed to positive patients without 320 adequate protection). This approach may have caused a substantial underestimation of positive 321 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. . https://doi.org/10.1101/2020.07.19.20152322 doi: medRxiv preprint subjects, preventing to include in our analysis asymptomatic infections, whose role in the 322 transmission chains and in influencing the evolution of epidemic remains a challenge to investigate. 323 Moreover, even if the number of samples here analysed is noteworthy, the epidemic in the East (i.e. 324 Brescia and Mantua) and the valleys of the North (i.e. Valtellina, and Valcamonica, Figure 1 ) is poorly 325 represented. With exception of Brescia, that represents today one of the most COVID-19 affected 326 area after Bergamo, Cremona and Lodi, Mantua and the valleys account today for the 6.7% of 327 confirmed COVID-19 cases in Lombardy, a percentage that does not represent a major issue for the 328 good reproducibility of SARS-CoV-2 epidemic in the region. 329 In conclusion, this study allows the identification of SARS-CoV-2 lineages circulating in the most 330 affected COVID-19 area at the beginning of 2020, representing a huge reserve of genetic information 331 of a virus that became able to pandemic spread, and caused in Lombardy more than 16,000 deaths 332 in some weeks. We cannot exclude that this multiple and simultaneous circulation of SARS-CoV-2 333 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 20, 2020. . https://doi.org/10.1101/2020.07.19.20152322 doi: medRxiv preprint A new coronavirus associated with human respiratory disease in China A pneumonia outbreak associated with a new coronavirus of probable bat origin fastp: an ultra-fast all-in-one FASTQ preprocessor Hardware acceleration of BWA-MEM genomic 345 short read mapping for longer read lengths A statistical framework for SNP calling, mutation discovery, association mapping and population 348 genetical parameter estimation from sequencing data Multiallelic calling model in bcftools RAxML version 8: a tool for phylogenetic analysis and post-analysis of large 352 phylogenies Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over 356 sites: approximate methods Exploring the temporal structure of 359 heterochronous sequences using TempEst (formerly Path-O-Gen). Virus evolution 2, vew007 Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. 362 Virus evolution 4, vey016 Bayesian analysis of elapsed times in continuous-time 364 Markov chains Investigation of a COVID-19 outbreak in Germany 366 resulting from a single travel-associated primary case: a case series Genomic Epidemiology of SARS-CoV-2 in Guangdong Province Phylodynamic Analysis, 176 genomes Introductions and early spread of SARS-CoV-2 374 in the New York City area Structural, glycosylation and 377 antigenic variation between 2019 novel coronavirus (2019-nCoV) and SARS coronavirus The SR-rich motif in SARS-CoV nucleocapsid protein is important 380 for virus replication Phosphorylation of the arginine/serine dipeptide-rich motif of the severe 382 acute respiratory syndrome coronavirus nucleocapsid protein modulates its multimerization, 383 translation inhibitory activity and cellular localization Site-specific glycan analysis of the SARS-386 CoV-2 spike Emergence of Drift Variants That May Affect COVID Phylogenetic interpretation during outbreaks requires 391 caution Genomic diversity of SARS-CoV-2 in Coronavirus Disease 2019 patients 394 Intra-host site-specific 396 polymorphisms of SARS-CoV-2 is consistent across multiple samples and methodologies Conflict of Interest: The authors have no financial and non-financial competing interests that might 400 be perceived to influence the results and/or discussion reported in this paper cognitive disorders (n=11). f Data available for 180 patients. g Data available for 237 patients. h Diagnosed by X Ray or CT Scan . Data available for 232 patients. i Real-time reverse transcription PCR Ct (cycle threshold) values of these samples ranged from 9 to 35 (GeneFinderTM COVID-19 Plus RealAmp Kit, ELITech; AllplexTM 2019-nCoV Assay, Seegene We thank Biodiversa s.r.l. for providing technical support, particularly Dr. Marco 403