key: cord-0918891-y8ddsai0
authors: Goswami, Chirayu; Sheldon, Michael; Bixby, Christian; Keddache, Mehdi; Bogdanowicz, Alexander; Wang, Yihe; Schultz, Jonathan; McDevitt, Jessica; LaPorta, James; Kwon, Elaine; Buyske, Steven; Garbolino, Dana; Biloholowski, Glenys; Pastuszak, Alex; Storella, Mary; Bhalla, Amit; Charlier-Rodriguez, Florence; Hager, Russ; Grimwood, Robin; Nahas, Shareef A.
title: Identification of SARS-CoV-2 variants using viral sequencing for the Centers for Disease Control and Prevention genomic surveillance program
date: 2022-04-25
journal: BMC Infect Dis
DOI: 10.1186/s12879-022-07374-7
sha: 1b9287fe2730fdc05573bd392d793858b60a1578
doc_id: 918891
cord_uid: y8ddsai0

BACKGROUND: The Centers for Disease Control and Prevention contracted with laboratories to sequence the SARS-CoV-2 genome from positive samples across the United States to enable public health officials to investigate the impact of variants on disease severity as well as the effectiveness of vaccines and treatment. Herein we present the initial results correlating RT-PCR quality control metrics with sample collection and sequencing methods from full SARS-CoV-2 viral genomic sequencing of 24,441 positive patient samples between April and June 2021. METHODS: RT-PCR confirmed (N Gene Ct value < 30) positive patient samples, with nucleic acid extracted from saliva, nasopharyngeal and oropharyngeal swabs were selected for viral whole genome SARS-CoV-2 sequencing. Sequencing was performed using Illumina COVIDSeq™ protocol on either the NextSeq550 or NovaSeq6000 systems. Informatic variant calling, and lineage analysis were performed using DRAGEN COVID Lineage applications on Illumina’s Basespace cloud analytical system. All sequence data and variant calls were uploaded to NCBI and GISAID. RESULTS: An association was observed between higher sequencing coverage, quality, and samples with a lower Ct value, with < 27 being optimal, across both sequencing platforms and sample collection methods. Both nasopharyngeal swabs and saliva samples were found to be optimal samples of choice for SARS-CoV-2 surveillance sequencing studies, both in terms of strain identification and sequencing depth of coverage, with NovaSeq 6000 providing higher coverage than the NextSeq 550. The most frequent variants identified were the B.1.617.2 Delta (India) and P.1 Gamma (Brazil) variants in the samples sequenced between April 2021 and June 2021. At the time of submission, the most common variant > 99% of positives sequenced was Omicron. CONCLUSION: These initial analyses highlight the importance of sequencing platform, sample collection methods, and RT-PCR Ct values in guiding surveillance efforts. These surveillance studies evaluating genetic changes of SARS-CoV-2 have been identified as critical by the CDC that can affect many aspects of public health including transmission, disease severity, diagnostics, therapeutics, and vaccines. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12879-022-07374-7.

Goswami et al. BMC Infectious Diseases (2022) 22:404 Background As the number of patients confirmed to be infected with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus and developing coronavirus disease 2019 (COVID- 19) increased, international sequencing efforts began to determine genetic variations in SARS-CoV-2 genes that may potentially increase viral transmission and pathogenicity [1] . However, the number of genomes sequenced within the United States (U.S.) deposited in the online genome repository, GISAID, in March of 2021 represented only 1.6% of the number of COVID-19 cases that month. As international sequencing efforts increased, variants of high concern were iden- Among the SARS-CoV-2 variants of concern, Beta B.1.351 (South Africa) and Gamma P.1 (Brazil) have high potential to reduce the efficacy of some vaccines [2] [3] [4] [5] . The B.1.617 variant (India) has become a variant of interest for its high transmission rate and ability to evade immune responses [6] . At the time of submission of this manuscript the first cases of the South African variant B.1.1.529 (Omicron) have been detected, but not yet identified in the U.S. [21] These genomic sequencing efforts have allowed scientists to identify not only SARS-CoV-2 positive patients and viral sequence variants, but to also monitor how new viral variants evolve and to understand how these changes affect the characteristics of the virus, with this information ultimately being used to better understand health impacts. In addition, these variants of concern are actively being monitored to determine if they may reduce the efficacy of currently approved vaccines against SARS-CoV-2 [7] .

In March of 2021, the Centers for Disease Control (CDC) contracted with large commercial diagnostic labs to sequence patient samples across the U.S. [8] . These laboratories committed to sequencing up to 6000 samples per week, with the capacity to scale up in response to the nation's needs (Fig. 1) . The purpose of this program is to perform routine analysis of genetic sequence data to enable the CDC and its public health partners to identify and characterize variant viruses-either new ones identified in the U.S. or those already identified abroad-and to investigate how variants impact COVID-19 disease severity and the effectiveness of vaccines and treatment. Infinity Biologix LLC (IBX), located in Piscataway NJ, was awarded one of these contracts to participate in this surveillance program.

In this study we describe the results of full-length genomic sequencing and surveillance of the SARS-CoV-2 virus from the first 24,441 confirmed positive saliva (SA) samples and a small subset of nasopharyngeal and oropharyngeal (NP, OP) swab-based samples sequenced between April and June 2021.

IBX was designated by the CDC to participate in its Genomic Surveillance for SARS-CoV-2 Variants program [1]. This program conducts genomic surveillance using a random sampling of previously confirmed positive samples from across the U.S., including all 50 states, Washington DC, Puerto Rico and major U.S. territories and possessions. A target of up to 6000 genomic sequences per week was established including State and Zip Code as the required demographics data. CDC requested, when possible and if available, additional demographic data as age, sex, and ethnicity of the patients.

Diagnostic patient samples for SARS-CoV-2 testing arrived via multiple external test clients, approved by IBX and Food and Drug Administration (FDA). Demographic, vaccination, ethnicity, sex and age were available Keywords: COVID-19, SARS-CoV-2, Centers for Disease Control, Next generation sequencing, Reverse transcription polymerase chain reaction, Cycle threshold, Lineage, Variant (See figure on next page.) Fig. 1 CDC tracking of emerging variants through the pipeline for genomic surveillance (https:// www. cdc. gov/ coron avirus/ 2019-ncov/ varia nts/ cdc-role-surve illan ce. html). As part of the CDC National SARS-CoV-2 Strain Surveillance (NS3) System, contracted laboratories select for sequencing a set of deidentified specimens that were previously subjected to SARS-CoV-2 RT-PCR testing and determined to be positive. Generally, the samples then undergo a three-step process for generating sequence data. Specimen preparation and sequencing: SARS-CoV-2 RNA is extracted and converted to complimentary DNA, enriched, and loaded into the next-generation sequencers. Sequence reads are aligned to SARS-COV-2 reference strain using the k-mer detection method. Aligned reads are then used to generate the consensus sequence, call variants and lineage determination. The information along with sequencing quality control statistics are transferred to the CDC repository electronically. Published data are made available to scientists around the world through public repositories if provided by the patient during sample collection. Most samples (> 95%) were saliva-based and collected in Minnesota (MN) and New Jersey (NJ) given the presence of IBX laboratories in each state.

The criteria for positive samples selection were established using a nucleocapsid protein (N Gene) Cycle Threshold (Ct) value threshold of ≤ 30.0 on the IBX TaqPath SARS-CoV-2 QPCR Assay. Per the CDC requirement, samples to be sequenced must have been no more than 10 days old following confirmation of a positive test result.

All samples selected for sequencing had nucleic acid freshly extracted from the primary sample source independently of the material extracted for the initial RT-PCR testing. Automated RNA extraction from either SA, NP or OP was carried out using the Chemagic Viral DNA/ RNA 300 Kit H96 (CMG-1033-S, PerkinElmer) on the Chemagic 360 Nucleic Acid Extractor (2024-0020, Perkin Elmer).

All samples were tested using the Infinity BiologiX TaqPath 

FASTQ files were generated for each sample after demultiplexing of the raw sequencing data on Base Space Sequence Hub (BSSH). Detection of the virus, generation of the consensus sequence and lineage/clade determination was performed using the DRAGEN COVID Lineage App v3.5.1 and v3.5.2 on BSSH.

Spearman's rank correlation coefficient, a rank-based analogue of Pearson's correlation, was used to measure the possibly non-linear but still monotonic relationship between mean Ct and % Genome Coverage. Tukey's test was used to test for differences in mean Ct value among the three collection methods.

Each sample was evaluated for the presence of the virus in the sequencing data using a k-mer based algorithm prior to performing variant calling steps. Each read was broken down into all possible contiguous 32 bases segments (32-mers) and compared to a pre generated list of 32-mers from all the amplicons in the CovidSeq assay (98 from SARS-CoV-2 and 11 from human control genes). An amplicon was considered detected when at least 150 k-mers were matched between the sample and reference. When 5 or more SARS-CoV-2 amplicons were detected, the virus was recorded as present in the sample and the variant calling pipeline was invoked.

Reads were aligned against the NC_045512.2 sequence (Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1) using the DRAGEN Map/Align algorithm. Reads pairs that aligned to the same start and end positions were marked as duplicate by default (only the highest quality pair is retained for variant calling). No quality pre-filtering of the data was performed by default, but portions of reads that do not match the reference were soft clipped by the aligner (removed for variant calling but retained in the Binary Alignment Map (BAM) file) to eliminate mismatches due to a drop in read quality or chimeric artifacts from the PCR. For variant calling, the DRAGEN Somatic default down sampling parameters were used (no more than 10,000 reads per 300 base window around a variable position and no more than 50 reads starting at the same position were considered, random removal with fixed seed for reproducibility). Each variant was then assigned a somatic quality (SQ) score and marked as 'weak_evidence' if falling below a fixed threshold (SQ < 3.0 hard filtering). Variant calling results were saved in a VCF formatted file and a consensus sequence generated using the bcftools CONSENSUS command. Any base covered with less than 10 reads was assigned an N (hard-masking) and any variable base was assigned the major allele base by default. Leading and trailing masked bases were then removed from the consensus by default.

Consensus sequences were analyzed with the Pangolin and NextClade pipelines to generate the lineage and clade information, respectively. By default, the latest version of the PANGOLearn and NextClade databases were downloaded for the most up to date lineage and clade information.

Data delivery to CDC was performed through our internal web application. Along with connecting to the Bas-eSpace file system, the application also connected to the CDC S3 data bucket using standard Amazon Web Services (AWS) S3 command line connection protocol using key based authentication [1]. Data transfer was performed on a per sample basis per CDC guidelines. For each sample the application created a file stack with pertinent files for samples that pass the quality thresholds.

The application then opened a connection to the CDC S3 bucket and created a dated folder. Batch folders were created inside the dated primary folder when more than one batch of transfers were processed on a given day. One subfolder per sample was then created in the batch folder. The application then transferred all the pertinent files for the sample into the sample folder and continued this process until all sample data were transferred. The application created a cumulative metadata file containing the metadata associated with the samples transferred and transferred the file to the S3 bucket.

A total of 24,441 SARS-CoV-2 positive samples (April to June of 2021) were sequenced (Additional file 1: Table S1 ). Of these, the majority (24,237) samples were SA, 131 and 73 were from NP and OP, respectively ( 

Of all the samples, 73.5% (17,974) samples were sequenced on the NextSeq 550, while 26.46% (6467) were sequenced on the NovaSeq 6000. The average sequencing coverage between the two was ascertained using average sequencing coverage over the SARS-COV-2 genome and fraction of nucleotides masked due to sequencing ambiguity in the consensus sequence generated. The global average SARS-CoV-2 genome coverage for all samples sequenced across both instruments was 1324×. Samples run on the NovaSeq 6000 had twice the average coverage (2133×) compared to samples run on NextSeq550 (1034×) because more reads per sample were generated on the NovaSeq 6000. The fraction of masked nucleotides in consensus sequence generated was globally 5.9%. The ambiguity fraction rate using the NovaSeq6000 was 2.9% while genomes sequenced on the NextSeq550 had a higher ambiguity rate at 6.9% (Fig. 2a-c) .

The mean RT-PCR Ct values for the three sample types used in this study across NP, OP and SA were 20.83, 21.83 and 22.70 respectively. By Tukey's test, the mean Ct values differed significantly between SA and NP (p = 4e−7), but not between the other sample pairs. With respect to SARS-CoV-2 sequence data quality metrics, several sample specific patterns were identified. The rate of failure of analytical detection of SARS-CoV-2 sequence was highest in OP samples (4.11%, 3 of 73 samples), followed by SA samples (1.58%, 382 of 24,237 samples). The sample type exhibiting the lowest rate of SARS-CoV-2 detection failure was NP (0%, 0 of 131 samples). An important measure of input sample and sequence data quality is reflected in the percentage of consensus sequence masked as ambiguous. Higher values of this percentage are indicative of greater proportion of data that is not informative in variant determination. Figure 3a depicts the average percentage of consensus sequence masked as ambiguous for the three sample types used in this study. With a value Fig. 2 Genome coverage and ambiguity rates between sequencing instruments. a Distribution of average coverage over genome in samples sequenced using NextSeq550 and NovaSeq600. b Distribution of percent ambiguous nucleotides (masked) in consensus sequence. c Fraction of consensus sequence masked due to nucleotide ambiguity in samples sequenced on NextSeq550 and NovaSeq6000 instruments sequenced on NextSeq550 and NovaSeq600 instruments of 0.5%, NP samples yielded the highest quality sequence data, while SA (5.84%) and OP swabs (41%) performing relatively less well. Another critical measure of quality, sequencing depth, was assessed for the three sample types as depicted in Fig. 3b . This analysis determined that NP and SA samples yielded the greatest depth, exhibiting values of 1948× and 1323×, respectively, while OP sample coverage was the lowest at 586×.

Samples were selected for inclusion in the sequencing study based on a minimum threshold of SARS-CoV-2 detection by RT-PCR (N Gene Ct value threshold ≤ 30). Table 2 ).

We observed a strong inverse association between higher coverage and N gene Ct values as determined by RT-PCR (partial r, controlling for sequencer, = − 0.58, partial r-squared = 0.34, p < 1e−15) ( Fig. 5) (Fig. 5) . Group 1, with lowest Ct samples had the highest genomic coverage while group 4 with highest Ct values, had the lowest mean genomic coverage. We identified only eight samples in our dataset that had mean genome coverage less than 100×. Of the remaining 24,433 samples we determined that a mean Ct value of 26.5 for NextSeq 550 and 27.9 for NovaSeq 6000 was a threshold for producing high quality genome sequencing reads (Fig. 5a-c) .

We identified a total of 161 lineages of SARS-CoV-2 variants in our dataset of 24, 237 positive SARS-CoV-2 samples. The 10 most prevalent lineages are listed in Table 3 . For a full list of all variants identified refer to Additional file 1: Table S1 . The prevalence and transmission of Delta (B.1.617) continued to rise and was the most prevalent between the months of July and December 2021 (Fig. 6a) . The only CDC variant of interest that was not identified in our study within the sample set and time frame tested was B.1.617.3. Further details on trends in lineage discovery in our study as the data continued to be accumulated between the months of June 2021 to February 2022 demonstrated a rapid rise and major prevalence of the Omicron variant BA.1 are provided in Additional file 1: Table S1 and Fig. 6a and b.

Extending the data set into March 2022 (Fig. 6b) sub-lineages of BA.1, BA 1.1 and BA.2 continue to increase at the time of this submission.

Genomics-based SARS-CoV-2 surveillance is an important tool for monitoring viral transmission during the current and future phases of the COVID-19 pandemic.

In this work, we sequenced the genomes from > 24,000 SARS-CoV-2-positive samples collected during routine diagnostic testing. Furthermore, we analyzed the genomic epidemiology and sequencing applications of SARS-CoV-2 for the CDC surveillance program between March 2021 and June 2021, with a focus on variant and lineage determination, quality of sequencing results between sequencing instruments and the association between Ct values, genome coverage and variant detection. Various NGS-based approaches have been developed to perform SARS-CoV-2 WGS using different sequencing platforms [12, 13] . These include direct RNA sequencing and metagenomics, amplicon-based methods and oligonucleotide capture-based methods. We employed COVIDseq for mass-scale SARS-CoV-2 genomic surveillance [14] and demonstrated that COVIDseq enables near-complete coverage of the SARS-CoV-2 genome. We used statistical analysis to demonstrate that there were differences in average coverage between the sequencing instruments with the NovaSeq 6000 having larger output per sample compared to the NextSeq 550. In addition, we determined that a Ct value of 26.5 for NextSeq and 27.9 for NovaSeq served as thresholds for producing high quality genomes, above these thresholds sequencing quality degraded. High quality genomes were identified as ones that have a fraction of ambiguous nucleotides in the consensus sequence of less than 10% and average genome coverage > 100×. Sequencing of samples with Ct values > 35 has previously been reported to show a sizable fraction of the reads are aligned to the human genome, independently of the method used to prepare the libraries, resulting in inconsistency in lineage and variant detections [14] .

We demonstrate that reports regarding variants of concern, such as transmission of the Delta variant, are consistent with other US and international reports [15] . We identified a total of 161 lineages of the SARS-CoV-2 variants in our dataset, between April to June 2021, and the top 10 lineages were consistent with data reported to date beyond the present study. In addition, as is the nature of RNA viruses, each new variant and strains that are identified may gain a natural advantage and become the dominant strain [15, 16] (Fig. 6) . Moreover, with the increased number of positive cases in summer and winter of 2021, these data may also be consistent with vaccine breakthrough infections [15] . However, additional studies are warranted. Additional variants, such as the Delta variant, where the variant of interest may not Although the Delta variant continues to represent most SARS-CoV-2 infections, sequencing data between the months of June and September 2021 are demonstrated increasing rates of transmission of another variant of concern, the Mu (B.2.621) variant, first identified in Colombia in January of 2021 [17] . Although the Mu variant accounts for only about 0.1% of cases within the present dataset, 77 positive samples with B.1.621 (Mu) have been identified since June of 2021.

By comparing the RT-PCR Ct values with the ability to detect variants of SARS-CoV2, a likely correlation with the clinical features of the infection may be inferred. Increased genome coverage associated with lower Ct values are likely due to higher viral load in each sample [18] . These results are consistent with previous reports where the ability to accurately detect variants correlates with lower Ct values [19] . However, most of these samples are from SA-based collections. Conflicting evidence regarding the sensitivity between various collection methods for detecting SARS-CoV-2 positive patients have been previously reported SARS-CoV-2 [6, 20, 21] . Thus comparison between SA and other collections method remains to be performed. The data presented in this study indicate that NP and SA are the optimal sample types for SARS-CoV-2 surveillance sequencing studies, both in terms of strain identification and sequencing depth of coverage. Additional data and studies will need to be performed to further elucidate sensitivities associated with collection methods. As demonstrated by the data above (Table 1) , most samples tested derived from patients who indicated they were not vaccinated at the time of testing. Of those who were vaccinated, at the time of this study, the prominent variant in those breakthrough infections (April-June 2021) was the B.1.1.7 (UK) variant. During this same period 28 cases of the Delta (B.1.617) variant were identified. However, as demonstrated in Fig. 6a , from June to July 2021 the occurrence of the UK variant dramatically decreases with a significant rise in the Delta variant that also correlated with an increase is positive cases throughout the U.S. At that time this accounted for the prominent variant identified throughout the US. Vaccination status of samples sequenced from July through the remainder of 2021 are currently being collected (data not published). Thus, additional studies and analyses will be required to correlate, vaccination status, viral loads, sequencing variant impacts and clinical severity. The incomplete demographic data of all samples in aggregate which were sequenced limits the ability to stratify variant transmission between not only geographical regions but also within Sex, age and other factors that may contribute to increased transmission rate. In November of 2021, the more transmissible variant from South Africa B.1.1.529 (Omicron) was detected [22, 23] . This variant quickly became more transmissible and prevalent throughout not only the U.S but the world, accounting for 99% of all SARS-CoV-2 positive cases (Fig. 6a, b) . Of interest is the concern that patients positive for this variant may be missed as the variant results in an S gene drop out. For these reasons it is imperative the appropriate methods for detection of positive SARS-CoV-2 patients can detect the N and Orf genes as well. At the time of this submission, sub-lineages of Omicron (BA.2) are increasing within the population accounting for increased positivity rates throughout Europe and is increasing within the U.S. [22] (Fig. 6b) .

In summary, the CDC surveillance screening program for SARS-CoV-2 variant transmission using whole viral genome sequencing is an important approach for population-based surveillance and control of viral transmission in the next phase of the COVID-19 pandemic. As SARS-CoV-2 strains continue to be sequenced by government, private, and academic entities all over the world, the sequencing data must continue to be shared publicly. These sequencing efforts will allow genomicbased surveillance of the virus and data sharing and contribute to efforts to understand population-level spread and control of the SARS-CoV-2 virus.

SARS-CoV-2 variants B.1.351 and P.1 escape from neutralizing antibodies

BNT162b2-elicited neutralization of B.1.617 and other SARS-CoV-2 variants

Sensitivity of infectious SARS-CoV-2 B.1.1.7 and B.1.351 variants to neutralizing antibodies

Structures of SARS-CoV-2 B.1.351 neutralizing antibodies provide insights into cocktail design against concerning variants

Evidence of escape of SARS-CoV-2 variant B.1.351 from natural and vaccine-induced sera

Covid-19: new coronavirus variant is identified in UK

Impact of the Delta variant on vaccine efficacy and response strategies

High throughput detection and genetic epidemiology of SARS-CoV-2 using COVIDSeq next-generation sequencing

An optimized, amplicon-based approach for sequencing of SARS-CoV-2 from patient samples using COVIDSeq assay on Illumina MiSeq sequencing platforms

A comparison of whole genome sequencing of SARS-CoV-2 using amplicon-based sequencing, random hexamers, and bait capture

Whole genome sequencing of SARS-CoV-2: adapting Illumina protocols for quick and accurate outbreak investigation during a pandemic

COVseq is a cost-effective workflow for mass-scale SARS-CoV-2 genomic surveillance

Transmission event of SARS-CoV-2 delta variant reveals multiple vaccine breakthrough infections

Genetic variants of SARS-CoV-2-what do they mean?

Characterization of the emerging B.1.621 variant of interest of SARS-CoV-2

Viral dynamics and real-time RT-PCR Ct values correlation with disease severity in COVID-19

SARS-CoV-2 variants of concern are associated with lower RT-PCR amplification cycles between

Change in saliva RT-PCR sensitivity over the course of SARS-CoV-2 infection

Saliva as a goldstandard sample for SARS-CoV-2 detection

Rapid spread of SARS-CoV-2 Omicron subvariant BA.2 in a single-source community outbreak

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

The authors wish to thank the Centers for Disease Control for guidance and wide base of knowledge making this study possible for the team efforts and the wide base of knowledge making this study possible. The authors wish to thank the New Jersey and Minnesota IBX qPCR and Sequencing teams for contributing to the testing and sequencing efforts throughout these studies.

The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s12879-022-07374-7.Additional file 1. Sample accession numbers submitted and successfully retrieved from GISAID. CG: conceptualization, formal analysis, investigation, visualization, writingoriginal draft; MS: conceptualization, formal analysis, investigation, visualization, writing-original draft; CB: conceptualization, formal analysis, investigation, visualization, writing-original draft; MK: methodology, investigation, writing-original draft; YW: methodology, investigation,-original draft; JS: methodology, investigation-original draft; JM: methodology; JL: supervision, methodology; EK: supervision, methodology; SB: Statistical Analysis, conceptualization, writing-original draft; DG: supervision; G.B, Supervision, funding acquisition, review and editing; A.W.P. Sample acquisition, writing-review and editing; MP: funding acquisition, writing-review and editing; AB: funding acquisition, writing-review and editing; FCR: funding acquisition, writingreview and editing; RH: funding acquisition, writing-review and editing; RG: funding acquisition, writing-review and editing; SAN; conceptualization, formal analysis, investigation, visualization, writing-original draft, funding acquisition, writing-review and editing. All authors read and approved the final manuscript.

This work was supported in part by the Centers for Disease Control and Prevention Contract No. 75D30121C10587. The funder was involved in the requirements for approval of viral sequencing and transfer of data through NCBI and GISAID. The funder was not involved in collection, analysis, and interpretation of data and in writing the manuscript.

High quality consensus sequences that resulted in definitive lineages as determined by CDC were successfully uploaded to GISAID (https:// www. gisaid. org/). The accession numbers for each sample submitted to and approved by CDC by IBX are publicly available in GISAID database. See Additional file 1: Table S1 for the accession numbers retrieved from GISAID. Raw data are available upon reasonable request from the corresponding author.

This study was determined to be exempt under 45 CFR § 46.104(d)(4), exempt from informed consent, and waived by Western Institutional Review Board (WCG) IRB Work Order #1-1493985-1 because the research involved is recorded by the investigator in such a manner that the identity of the human subjects cannot readily by ascertained directly or through identifiers linked to the subjects, the investigator does not contact the subjects, and the investigator will not re-identify subjects.

Not applicable.

Not applicable.