key: cord-0835895-h3ftgzqz authors: Zhang, W.; Govindavari, J. P.; Davis, B.; Chen, S. C.; Kim, J. T.; Song, J.; Lopategui, J.; Plummer, J. T.; Vail, E. title: Analysis of SARS-CoV-2 Genomes from Southern California Reveals Community Transmission Pathways in the Early Stage of the US COVID-19 Pandemic date: 2020-06-13 journal: nan DOI: 10.1101/2020.06.12.20129999 sha: ad642086a6abcb842037d7dff1cfb54931388003 doc_id: 835895 cord_uid: h3ftgzqz Given the higher mortality rate and widespread phenomenon of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS CoV-2) within the United States (US) population, understanding the mutational pattern of SARS CoV-2 has global implications for detection and therapy to prevent further escalation. Los Angeles has become an epicenter of the SARS-CoV-2 pandemic in the US. Efforts to contain the spread of SARS-CoV-2 require identifying its genetic and geographic variation and understanding the drivers of these differences. For the first time, we report genetic characterization of SARS-CoV-2 genome isolates in the Los Angeles population using targeted next generation sequencing (NGS). Samples collected at Cedars Sinai Medical Center were collected from patients with confirmed SARS-CoV-2 infection. We identified and diagnosed 192 patients by our in-house qPCR assay. In this population, the highest frequency variants were in known mutations in the 5'UTR, AA193 protein, RdRp and the spike glycoprotein. SARS-CoV-2 transmission within the local community was tracked by integrating mutation data with patient postal codes with two predominant community spread clusters being identified. Notably, significant viral genomic diversity was identified. Less than 10 percent of the Los Angeles community samples resembled published mutational profiles of SARS-CoV-2 genomes from China, while >50 percent of the isolates shared closely similarities to those from New York State. Based on these findings we conclude SARS-CoV-2 was likely introduced into the Los Angeles community predominantly from New York State but also via multiple other independent transmission routes including but not limited to Washington State and China. With the emergence of the COVID-19 global pandemic caused by Severe Acute 6 . The A type is thought to be the most ancestral form (T29095C). The B type (T8782C, C28144T) is most prominent in East Asia and remained there, without additional, acquired mutations, which is suggestive of founder effects or environmental resistance against this type. The C type (G26144T) is found mainly in Europeans. More specifically, sequence analysis from US East Coast patients appear to originate from the European population (C type). Whereas the US West Coast population of patients, primarily from the Seattle and Northern California area, share resemblance to A and C type6. While this published study is disputed, it does support the current belief that different SARS-CoV-2 genome isolates from China Los Angeles is the largest city on the US West Coast and the 2nd major city to take precautionary measures to restrict their population to their homes as fatalities emerged early March 2020. Cedars Sinai Medical Center (CSMC) serves more than 1 million people and is the largest health service center west of the Mississippi. An inhouse SARS-CoV-2 RT-qPCR diagnostic test was adopted March 21, 2020 allowing our clinical laboratory to rapidly screen and identify COVID-19 positive patients. After transmission from China, our timeline for SARS-CoV-2 testing follows other reported introductions into different global populations8-12. To date, the sole Los Angeles deposited SARS-CoV-2 genome is not linked to a particular mode of introduction by Nextstrain3. Based on these cumulative findings, we hypothesize that the local Los Angeles community was likely disseminated from a US West Coast SARS-CoV-2 strain which was directly transmitted from China. In an effort to further understand this evolving virus, we sought to perform next generation sequencing (NGS) analysis on COVID-19 positive patients. We conducted phylogenetic analyses on this unique West Coast population to identify local community spread within the greater Los Angeles area. A broader geographical distribution comparison of these early Southern California with New York State, Washington State and China isolates was conducted to ascertain possible early transmission pathways of SARS-CoV-2 dissemination into Los Angeles. Here for the first time, we report the trends for potential sources of SARS-CoV-2 introduction into the Los Angeles community. Clinical specimens were collected by nasopharyngeal swabs from patients presenting with COVID-19 like symptoms. Total nucleic acid was extracted using the QIAamp Viral RNA Mini Kit on the QIAcube Connect (Qiagen, Germantown, USA). All patients were first assessed by CSMC developed RT-qPCR diagnostic test for SARS-CoV-2 viral RNA. Specifically, the nucleic acid was screened for the presence of SARS-CoV-2 using real-time single-plex RT-qPCR for the SARS-CoV-2 Nsp3 gene. All samples were diagnostically COVID-19 positive with amplification of the targeted region crossing the threshold before 40 cycles. In total, 189 COVID-19 positive samples were used for parallel NGS analysis. All samples were quantified by Qubit and 100 ng of total RNA was processed for 1st strand and 2nd strand cDNA synthesis using NEBNext Ultra II Directional RNA Library Prep Kit modular workflow (New England Biolabs, Boston, USA) according to the manufacturers' recommendations. Target enrichment of 200 ng cDNA was performed All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. We sequenced 192 specimens that are tested positive for SARS-CoV-2 using the Illumina Respiratory Virus Targeted Panel. These specimens were collected between March 22nd to April 15th including 82 females (42.7%) and 110 (57.3%) males with median age of 59.5 yrs. The overall distribution is shown in Figure 1 . Of those 192 cases, 21 were deceased (10.9%), 122 were admitted and subsequently discharged (63.5%), 11 were admitted and currently hospitalized for treatment (5.7%), and 38 were outpatients that were not hospitalized for COVID-19 (19.8%). The pool of 192 SARS-CoV-2 positive samples obtained 2,222,425,974 reads in raw data. A minimum of ~1M All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. . reads per sample was generated and mapped to the SARS-CoV-2 reference. A total of 1,737,684,077 reads (78% of total) were mapped to SARS-CoV-2 reference genome (NC_045512.2) obtained from NCBI GenBank using BWA-MEM (0.7.17-r1188)13 with default parameters. Mapping ratio varies between 0.3% to 98.98% which positively correlated (R2=0.42) with Ct value obtained from RT-qPCR. Overall, low mapping ratios that fell below 50% genome coverage correlated to samples which Ct value (>30 cycles) in the RT-qPCR diagnostic test. Whole-genome comparison of the CSMC samples revealed >99.8% identity with the SARS-CoV-2 reference genome (NC_045512.2). Mutational analyses of this sample set against the reference genome revealed a total of 518 mutated sites detected across the length of the SARS-CoV-2 genome (Figure 2 ). 84% of the variants were private and five variants were found in more than half of all samples sequenced (Table 1 ). In total, we identified 82 sites that were mutated in more than two isolates in this cohort. These CSMC isolates contained on average 5.1 mutations per sample. The top 20 mutated sites, their predicted alterations and frequencies are summarized in Table 1 and All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. . respectively. Mutations at G25563T (ORF3a) and C1059T (nsp2) have been reported to be co-expressed. Furthermore, Type A (East Asia) and C (European) mutations were not among the highest mutations defining the CSMC isolates. Type B is defined by two mutations but only the C28144T mutation was frequently observed in our population. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. While these variants tightly segregate into two main clusters of the tree, they do not appear to track with sample date collection (Supplemental Figure 2 ). The genomic diversity in our population was an early introduction into the community and remained through the timing of collections. From our local phylogenetic tree analysis, 17 patients which represents >10% of our sample population were identified in one cluster (red) (Figure 3 ). Further meta-data All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. . https://doi.org/10.1101/2020.06.12.20129999 doi: medRxiv preprint analysis revealed, these patents lived in the same or adjacent postal code and were all members of the same religious denomination. An additional community transmission cluster was observed with a tightly associated cluster containing eight patients (green) ( Figure 3) while the viral genome of these patients share variant C1887T exclusively (Supplemental Figure 3) . We did not observe other obvious connections within samples outside of those two clusters. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. . resemble isolates from China. The remaining isolates (26%) lie within cluster D defined by Washington State deposited SARS-CoV-2 genomes. Interestingly the only cluster in which SARS-CoV-2 genomes from CSMC are excluded is one of the two Washington State clusters (F). We present for the first time a comprehensive study of sample population from one of the COVID-19 epicenters in the US. A caveat to our sample collection is that the emergency department admissions are less frequent for younger patients and biased to patients greater than 18 years old. Despite this, the average age of CSMC patients was ~60 years old, which is consistent with other observations that older adults are susceptible to COVID1910. Patient samples by diagnostic RT-qPCR that had lower cycle numbers e.g. higher viral copies generally had poorer outcomes. Moreover, patients with higher viral loads detected by RT-qPCR also correlated to a higher percent SARS-CoV-2 genome coverage by sequencing. From a technical perspective, 48 of 192 patients which had poorer quality sequencing coverage (<50%) and also were diagnostically positive by RT-qPCR at >30 cycles. The remaining144 patients passed all quality control metrics for sequencing and alignment generally <25 cycles of RT-qPCR19. Hence, when using NGS approaches for diagnostic purposes, a potential caveat is that genome sequencing favors those patients with higher viral titers and may not capture those who have low viral copy numbers. The local phylogenetic tree demonstrated two large clusters which were mainly defined by six high frequency mutations. Phylogenetic analysis of these samples by All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. This finding is further validated in: 1) our local phylogenetic tree which disseminates into 2 main clusters and 2) our global tree in which our population closely resembles SARS-CoV-2 genomes geographically distributed with the majority from New York City, followed by a large fraction from Washington State, together identifying possible routes for dissemination to Southern California. Given Seattle was the 1st documented US appearance of SARS-CoV-2 the introduction of the virus from Washington State7 is consistent with our tree and the timing of our data sampling, consistent with our All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. . hypothesis. However, despite our earlier predictions an even larger portion of our sample population had a high resemblance to genomes from New York State. With New York City being the largest epicenter of the SARS-CoV-27,26,27 and the appearance of our samples within 3 separate clusters of New York isolates, SARS-CoV-2 likely disseminated from multiple introductions from New York State. Furthermore, the CSMC population also clustered tightly within a large Washington State cluster likely disseminated to Southern California, appearing as a major cluster in our local population. Although we restricted our analyses to these three geographical origins, given we found a high genomic diversity amongst the CSMC SARS-CoV-2 isolates, COVID-19 large impact on the Los Angeles community likely originated from independent disseminations of the virus from multiple geographical routes. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 13, 2020. . https://doi.org/10.1101/2020.06.12.20129999 doi: medRxiv preprint Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 Global initiative on sharing all influenza data -from vision to reality Nextstrain: real-time tracking of pathogen evolution A pneumonia outbreak associated with a new coronavirus of probable bat origin A SARS-CoV-2 protein interaction map reveals targets for drug repurposing Phylogenetic network analysis of SARS-CoV-2 genomes The emergence of SARS-CoV-2 in Europe and the US Virological assessment of SARS-CoV-2 Preliminary case report on the SARS-CoV-2 cluster in the UK, France, and Spain Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China COVID-19 National Incident Room Surveillance Team. COVID-19 The new coronavirus that came from the East: analysis of the initial epidemic in Mexico Fast and accurate short read alignment with Burrows-Wheeler transform BCFtools/csq: haplotype-aware variant consequences MAFFT multiple sequence alignment software version 7: improvements in performance and usability FastTree 2 -approximately maximum-likelihood trees for large alignments TreeTime: Maximum-likelihood phylodynamic analysis Genotyping coronavirus SARS-CoV-2: methods and implications Comparison of seven commercial RT-qPCR diagnostic kits for COVID-19 The global population of SARS-CoV-2 is composed of six major subtypes CoV Genome Tracker: tracing genomic footprints of Covid-19 pandemic Genome-wide data inferring the evolution and population demography of the novel pneumonia coronavirus (SARS-CoV-2) Comprehensive Transcriptomic Analysis of COVID-19 Blood, Lung, and Airway Tracing two causative SNPs reveals SARS-CoV-2 transmission in North America population Identification of Novel Missense Mutations in a Large Number of Recent SARS-CoV-2 Genome Sequences Rapid whole genome sequence typing reveals multiple waves of SARS-CoV-2 spread Introductions and early spread of SARS-CoV-2 in the New York City area The authors would like to thank all those that are helping in the fight against SARS-CoV-2 especially the research community who has kindly shared genomic data, publicly available for download from www.gisaid.org. WZ, JP and EV conceived of the project and experimental design. JP, BS, SC executed the experiment and generated the sequencing data. WZ conceived and executed all data analyses. WZ, JP and EV contributed to the writing of the manuscript. JL, JK, JS and JPG gave helpful insight and suggestions from conception through to execution and writing. All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This project was funded by an internal grant to Eric Vail generously provided by the Department of Pathology and Laboratory Medicine, Cedars Sinai Medical Center. The authors declare no conflicts of interest.