key: cord-0909297-qfk7vs9b authors: Villarino, Elsa; Deng, Xianding; Kemper, Carol A; Jorden, Michelle A; Bonin, Brandon; Rudman, Sarah L; Han, George S; Yu, Guixia; Wang, Candace; Federman, Scot; Bushnell, Brian; Wadford, Debra A; Lin, Wen; Tao, Ying; Paden, Clinton R; Bhatnagar, Julu; MacCannell, Tara; Tong, Suxiang; Batson, Joshua; Chiu, Charles Y title: Introduction, Transmission Dynamics, and Fate of Early SARS-CoV-2 Lineages in Santa Clara County, California date: 2021-04-21 journal: J Infect Dis DOI: 10.1093/infdis/jiab199 sha: c09aa60ec15991e96e6208c75fad42953ff85a08 doc_id: 909297 cord_uid: qfk7vs9b We combined viral genome sequencing with contact tracing to investigate introduction and evolution of SARS-CoV-2 lineages in Santa Clara County, California from January 27 to March 21, 2020. Of 558 persons with COVID-19, 101 genomes from 143 available clinical samples comprised 17 different lineages including SCC1 (n=41), WA1 (n=9, including the first 2 reported deaths in the United States, diagnosed post-mortem), D614G (n=4), ancestral Wuhan Hu-1 (n=21), and 13 others (n=26). Public health intervention may have curtailed the persistence of lineages that appeared transiently during February–March. By August, only D614G lineages introduced after March 21 were circulating in SCC. The COVID-19 pandemic from the novel SARS coronavirus 2 (SARS-CoV-2) emerged from Wuhan, China, in December of 2019 and rapidly spread throughout the world, causing approximately 59 million cases and 1.4 million deaths as of November 22, 2020 [1] . The first confirmed SARS-CoV-2 case in the United States (U.S.) was diagnosed in a resident of Washington State on January 20, 2020 [2] , and, since then, multiple introductions into the United States been reported [3] [4] [5] [6] [7] [8] [9] , resulting in widespread community dissemination nationwide [9] . For outbreaks caused by SARS-CoV-2, health response and action play critical roles in the recognition and isolation of suspected infectious cases. Contact tracing is a classic epidemiologic tool to study outbreaks of infectious disease and track patterns of transmission that can inform public health interventions [10] . Genomic epidemiology using viral whole-genome sequencing (WGS) complements contact tracing during outbreak investigations and can track virus evolution and spread in an epidemic [11] . WGS of SARS-CoV-2 has been used to identify (i) undetected transmission of the WA1 lineage associated with the first reported SARS-CoV-2 case in the U.S. from Washington State in January 2020 [3] , (ii) multiple introductions of SARS-CoV-2 lineages into Northern California [4] , coast-to-coast transmission [5] , and (iii) importation of a viral lineage containing a D614G mutation (A23403G SNV) in the viral spike protein to New York from Europe [6, 8, 12] , with subsequent dispersion throughout the U.S. [12] . However, there have been few studies to date that include sampling and analysis of dynamic changes in SARS-CoV-2 genotypes within a single community over time. Here we sequenced a demographically representative sampling of SARS-CoV-2 strains circulating in SCC during January 27-March 21, 2020 and analyzed publicly available viral WGS data to mid-October 2020, to investigate the introduction, transmission, and persistence or disappearance of SARS-CoV-2 lineages in this community. A c c e p t e d M a n u s c r i p t Viral whole-genome sequencing, assembly, and phylogenetic analysis Viral whole-genome sequencing and Sanger sequencing confirmation of SNVs was performed as previously described (Supplementary Methods) [4, 23] . Complete, high-quality SARS-CoV-2 (n=19,922) genomes from the global COVID-19 pandemic with completely specified date information that were collected from infected persons on or prior to March 23, 2020 and sequenced were downloaded from the Global Initiative on Sharing of All Influenza Data (GISAID) database (August 10, 2020 build) [24, 25] , which has been expanded to include SARS-CoV-2 genomes, and processed using the NextStrain bioinformatics pipeline Augur [26] . After addition of the 101 newly sequenced genomes in the current study to the data set, a total of 20,223 genomes were aligned using MAFFT v7.4 [27] as implemented in Augur, and a maximum likelihood phylogenetic tree was constructed using IQTREE v1.6 [28] . Branch locations were estimated using a maximum-likelihood discrete traits model. The resulting tree was visualized in the NextStrain web application Auspice [26] and in Geneious v11.1.5 [29] . Smaller subtrees consisting of viruses in the WA1, SCC1, and SCC3 lineages were also constructed using the Augur pipeline. Multiple sequence alignments of clusters were generated using MAFFT v7.388 [27] and visualized in Geneious (Supplementary Methods). Lineage and cluster information extracted from the phylogenetic analyses was merged with the information stored in the Epi-DB. To assess whether the COVID-19 cases diagnosed by the SCCPHL were representative of the population of cases diagnosed in SCC during the period of our study, we compared by gender, age, race/ethnicity and home address the information from all cases reported to the CalREDIE database. For cases classified as travel-associated, such as imported cases, we evaluated whether the identified genomic lineage was consistent with the reported travel history. For all other COVID-19 cases that were determined to be locally A c c e p t e d M a n u s c r i p t acquired cases, we used the genomic data to confirm all links involving 2 or more persons that had been identified by contact tracing and epidemiologic investigation. For comparison of individual characteristics between COVID-19 infected persons with sequenced genomes and those for whom samples were unavailable for genomic sequencing or recovered genomes had insufficient coverage, we calculated p-values using the chi-square goodness of fit test. A p-value of less than 0.05 was considered statistically significant. A c c e p t e d M a n u s c r i p t (70.6%) had recoverable SARS-CoV-2 genomes with sufficient breadth of coverage (70%) across the genome for phylogenetic analysis. There were no statistically significant differences in gender or race / ethnicity between the 101 sequenced cases with viral WGS and 457 other cases in SCC (Table 1) , but there were differences in age, with sequenced cases being older overall (p=0.032). There was a higher proportion of deaths (p=0.0010) among the sequenced cases, a finding that was consistent with early criteria prioritizing testing of hospitalized persons with serious COVID-19 disease and with cases being International travel as a risk factor for COVID-19 The first two cases in January 2020 were identified in international travelers [14] . We found 129 cases of this 4-SNV lineage reported globally as of March 21, 2020, with major clusters in India [14] , southeast Asia, and California (n=10 cases). Two of 10 persons with international travel history (UC135, who traveled to Asia, and UC162, who returned from a trip to Central America but also attended a large party in the San Francisco Bay Area) were found to be infected with viruses of the SCC1 lineage ( Figure 2C and 3A) . To assess whether there were cases and deaths associated with COVID-19 in CDC confirmation of SARS-CoV-2 infection in post-mortem tissue specimens was obtained April 2020 from 2 persons who had died at home in February from an unknown respiratory illness [7] . The viral genomes associated with both cases, C-D1 and C-D2, were determined by CDC to be part of the WA1 lineage with 5 and 3 SNVs, respectively ( Figure 2B and 3C ) [4] , suggesting that infection had likely been acquired locally. In a third medical examiner case of an elderly male who died at home (UC187), clinical samples were tested at SCCPHL and found to be positive for SARS-CoV-2; the virus was subsequently found to belong to the D614G lineage by viral WGS (Figure 2A and 3A) . On February 26, 2020, the first case of community transmission of SARS-CoV-2 in California (UC4) was reported [4] , and SCCPHD was notified. One extended family member On February 29, the SCCPHD initiated an investigation of a COVID-19 outbreak among workers at SJC airport. Of 11 confirmed cases, all 9 with available viral genomes, sequenced from 5 workers, 2 household contacts, and 2 HCWs, were of the SCC1 lineage that shares the G29711T SNV (Figure 2A, 2C, and 3A; Table 2 , cluster G). Overall, 41 genomes out of 101 in the current study were assigned to the SCC1 lineage. Epidemiologic links were known a priori in 27 (69.5%) of cases, grouped into 7 clusters, including the aforementioned SJC airport cluster [4] that includes a household transmission event and two HCWs, a cluster associated with a grocery store that also involved a resident from Solano County [4] , and 5 other household transmission events of which 2 had a history of domestic travel and 2 had a history of international travel ( Table 2, In addition to Wuhan Hu-1, WA1, SSC1, SCC2, SSC3, Solano County, and D614G, ten other lineages were identified among cases in our series, including the aforementioned 4-SNV lineage in returning traveler UC184 (Figure 2A and 3A) . For the majority of these lineages (9 of 10, 90%), only 1 person from SCC was identified as being infected by a virus from each lineage, and these singleton cases were attributed to unknown community exposure. The one exception was UC180 and UC185; both adult males were infected with the A12557G lineage, although an epidemiologic link between the two cases was not determined. A c c e p t e d M a n u s c r i p t We performed genotype analysis of all 3,660 full-length sequenced genomes from California deposited in the GISAID database that had been collected during January 27-September 30, 2020. In January 2020, sequenced genomes from SCC corresponded mostly of Asian lineages, with 0-1 SNVs as compared to the ancestral Wuhan Hu-1 lineage. The WA1, SCC1, and D614G lineages emerged in February, and SCC1 expanded to become the single dominant lineage in the county in March (~25% of the sequenced genomes during that month, and 40.6% of the complete sample set) ( Figure 4A) . The SCC1 and WA1 lineages declined in number and disappeared in March and June, respectively, while the proportion of genomes from the D614G lineage rapidly increased, becoming the single predominant genotype in SCC by June (Figure 4A and 4B) . The A12557G and C25692T lineages were common in April and May, the latter lineage in part due to its association with a large skilled nursing facility outbreak (unpublished data) but disappeared afterwards (Figure 4A and 4B) . Similarly, additional lineages that were introduced to SCC from January to March 2020, including those associated with discrete household clusters (Solano County, G14718T, and G26591T), disappeared by Aug 2020 (Figure 4A and 4C) . Overall, similar longitudinal changes in lineage frequency were observed across the state of California ( Figure 4A and 4B) . In September 2020, all sequenced genomes in SCC and California were of the D614G lineage ( Figure 4A-C) , However, an analysis of additional SNVs in the four D614G genomes sequenced from SCC from January -March 2020 revealed that these sublineages disappeared from SCC and California by August 5 ( Figure 4C, left) , indicating that continuation of the D614G lineage in SCC was most likely due to ongoing introduction into the county after March 2020 rather than persistent community transmission. A c c e p t e d M a n u s c r i p t In this study, we combined the power of genomic epidemiology with public health surveillance using contact tracing to monitor the introduction and community transmission of at least 17 SARS-CoV-2 lineages circulating in Santa Clara County, California from January 27 to March 21, 2020. We identified 2 cases in which the infection was initially thought to have been associated with international travel by contact tracing, but viral genome analysis suggested that the individual had likely been infected by a locally circulating strain (SCC1). Viral WGS also identified a new epidemiological link at a local church between seemingly unrelated cases. Finally, we were able to elucidate the cause of death in three previously unexplained cases as unrecognized SARS-CoV-2 infections, and to determine their phylogenetic placement in the WA1 lineage. Genomic epidemiology has rapidly emerged as an indispensable tool for investigating and monitoring spread of outbreaks such as COVID- The 3 decedent cases in our study highlight the unmet need for expanded SARS-CoV-2 testing during the early stages of the pandemic in the U.S. that would have likely revealed cases of cryptic viral transmission not linked to ostensible travel history. They also underscore the value of performing autopsies and post-mortem testing early as a additional system for identifying the spread and shortening the time to assess the threat of the virus in a community. A robust public health genomic surveillance system of sufficient scope and scale to address pandemic threats such as SARS-CoV-2 needs access to many different types of samples for testing [15] . The D614G lineage containing a spike protein coding mutation is thought to have arisen in Germany from China in late January 2020 [16] , and rapidly spread via travel through Europe, and from there, to the U.S., associated with a large outbreak in New York City [6, 8, 12] . Epidemiologic, in vitro cell culture, and rodent model data to date [12, 17, 18] support the notion that D614G lineage viruses achieve higher viral loads and is more A c c e p t e d M a n u s c r i p t infectious than other strains although, notably, there is no evidence of increased pathogenicity. Thus, a potential fitness advantage may explain persistence and predominance of the D614G lineage in SCC, the U.S., and globally [12, 19] , although some have attributed the rise of D614G lineage to random founder effects [20] . The disappearance of the sublineages from the 4 sequenced D614G viruses in the study indicate that the surge in D614G cases in the county during the summer was mainly fueled by ongoing exogenous introduction. Our results confirm that SARS-CoV-2 community transmission was already occurring by late January 2020, when available testing was extremely limited and earlier than the first officially reported case in SCC on February 27 [21] . Given the diversity of viral lineages uncovered in this study, it is likely that no local intervention, short of shutting down all travel into and out of the region, could have prevented these repeated introductions into SCC. M a n u s c r i p t A c c e p t e d M a n u s c r i p t A c c e p t e d M a n u s c r i p t An interactive web-based dashboard to track COVID-19 in real time First Case of 2019 Novel Coronavirus in the United States Cryptic transmission of SARS-CoV-2 in Washington State Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States Introductions and early spread of SARS-CoV-2 in the New York City area Evidence for Limited Early Spread of COVID-19 Within the United States Sequencing identifies multiple early introductions of SARS-CoV-2 to the New York City Region Public Health Response to the Initiation and Spread of Pandemic COVID-19 in the United States Efficacy of contact tracing for the containment of the 2019 novel coronavirus (COVID-19) Towards a genomics-informed, real-time, global pathogen surveillance system Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Public Health Responses to COVID-19 Outbreaks on Cruise Ships -Worldwide A distinct phylogenetic cluster of Indian SARS-CoV-2 isolates Implementation Framework: Toward a National Genomic Surveillance Network Evolutionary and structural analyses of SARS-CoV-2 D614G spike protein mutation now documented worldwide SARS-CoV-2 spike D614G variant confers enhanced replication and transmissibility Spike mutation D614G alters SARS-CoV-2 fitness Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2 Rapid Sentinel Surveillance for COVID-19 CDC. CDC 2019-Novel Coronavirus (2019-nCoV) Real-Time RT-PCR Diagnostic Panel Available at Sensitive Full-Genome Sequencing of Severe Acute Respiratory Syndrome Coronavirus 2 disease and diplomacy: GISAID's innovative contribution to global health Global initiative on sharing all influenza data -from vision to reality Nextstrain: real-time tracking of pathogen evolution MAFFT multiple sequence alignment software version 7: improvements in performance and usability IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data We want to thank the CDC and California Department of Public Health (CDPH) teams that joined the SCCPHD and conducted many of the epidemiologic and contact tracing investigations, the staff from the UCSF CAT core facility for sequencing samples, and the staff from the China Basin clinical laboratory for the Illumina NextSeq sequencing efforts.We thank the staff of the CDC IDPB for analysis of autopsy tissues and facilitating the submission of tissue specimens. We also thank all the authors and research groups who have contributed genome data on GISIAD. Author credits for specific GISAID contributions can be found on https://www.gisaid.org. This work has been funded by the Innovative Genomics Institute (CYC), the New A c c e p t e d M a n u s c r i p t Assembled SARS-CoV-2 genomes in this study were uploaded to GISAID as FASTA files (accession numbers in Supplementary Table S1