key: cord-0894549-hh1fuwdh
authors: Lee, Sin Hang
title: A Routine Sanger Sequencing Target Specific Mutation Assay for SARS-CoV-2 Variants of Concern and Interest
date: 2021-11-29
journal: Viruses
DOI: 10.3390/v13122386
sha: 991de0fb74af48b73bd6c84c365e670d32177e84
doc_id: 894549
cord_uid: hh1fuwdh

As SARS-CoV-2 continues to spread among human populations, genetic changes occur and accumulate in the circulating virus. Some of these genetic changes have caused amino acid mutations, including deletions, which may have a potential impact on critical SARS-CoV-2 countermeasures, including vaccines, therapeutics, and diagnostics. Considerable efforts have been made to categorize the amino acid mutations of the angiotensin-converting enzyme 2 (ACE2) receptor binding domain (RBD) of the spike (S) protein, along with certain mutations in other regions within the S protein as specific variants, in an attempt to study the relationship between these mutations and the biological behavior of the virus. However, the currently used whole genome sequencing surveillance technologies can test only a small fraction of the positive specimens with high viral loads and often generate uncertainties in nucleic acid sequencing that needs additional verification for precision determination of mutations. This article introduces a generic protocol to routinely sequence a 437-bp nested RT-PCR cDNA amplicon of the ACE2 RBD and a 490-bp nested RT-PCR cDNA amplicon of the N-terminal domain (NTD) of the S gene for detection of the amino acid mutations needed for accurate determination of all variants of concern and variants of interest according to the definitions published by the U.S. Centers for Disease Control and Prevention. This protocol was able to amplify both nucleic acid targets into cDNA amplicons to be used as templates for Sanger sequencing on all 16 clinical specimens that were positive for SARS-CoV-2.

The COVID-19 crisis has continued its pace. According to real time world statistics, as of 7 September 2021, there were >222 million cumulative human cases with >4.5 million deaths due to COVID-19 since the outbreak [1]. In the meantime, numerous amino acid mutations of the spike (S) protein of the SARS-CoV-2, the causative agent of COVID-19, are being recognized as whole genome sequencing data generated by the next generation sequencing (NGS) technology and have been used more widely for genomic surveillance [2] . Great efforts have been made to categorize these S protein amino acid mutations or substitutions into specific groups, according to their combination profiles. A few of these groups are referred to as variants in an attempt to correlate these amino acid mutation profiles with a possible increased transmissibility, increased virulence, or reduced effectiveness of vaccines against them [3, 4] .

The U.S. Centers for Disease Control and Prevention (CDC) has selected four variants of concern, namely the Alpha, Beta, Gamma, and Delta variants, to be closely monitored for their potential impact on critical SARS-CoV-2 countermeasures, including vaccines, therapeutics, and diagnostics. In addition, four variants of interest, namely the Eta, Iota, Kappa, and Pango Lineage B.1.617.3 variants, are being monitored and characterized [3] .

As widely reported in mass media, the SARS-CoV-2 Delta variant has spread around the world [5] and is becoming the variant of most concern [6] . However, the science and

Since both the WHO and the CDC definitions of SARS-CoV-2 variants of concern and interest depend on determination of the specific profiles of amino acid mutations from K417 to N501 in the ACE2 RBD, supplemented by mutations in other regions of the 1273 amino acid chain of the spike protein, especially by those in the NTD [3, 12] , a brief review of these common mutations is needed in order to select the target segments of the S gene for Sanger sequencing.

GISAID automatically updates its site of hCoV-19 spike glycoprotein mutation surveillance dashboard. The updates include spike protein changes in amino acid sequences of the ACE2 receptor binding domain (RBD) newly submitted to GISAID, displayed in structures organized by the most common clades. The 24/25 August 2021 dashboard data showed the new clades (Figure 1 ), all of which contain mutations commonly used for Delta variant categorization. While the GISAID hCoV-19 S protein mutation surveillance focuses on the ACE2 RBD mutations, some researchers have pointed out that the Delta variant has several unique mutations in the ACE2 RBD and the N-terminal domain (NTD) of the spike protein. The mutations in the NTD, such as T19R, G142D, E156G, F157del, and R158del, are involved in the enhanced infectivity by the BNT162b2-immune sera. The neutralizing activity of sera from vaccinated individuals, as well as convalescent COVID-19 patients, decreases for the Delta variant compared to the wild-type SARS-CoV-2 [15] [16] [17] . Both ACE2 RBD and NTD mutations should be evaluated [7] on all positive samples to understand the pathogenicity of the SARS-CoV-2 variants. The CDC's classifications and definitions of SARS-CoV-2 variants of concern (VOCs) and variants of interest (VOIS) are summarized in Table 1 . While the GISAID hCoV-19 S protein mutation surveillance focuses on the ACE2 RBD mutations, some researchers have pointed out that the Delta variant has several unique mutations in the ACE2 RBD and the N-terminal domain (NTD) of the spike protein. The mutations in the NTD, such as T19R, G142D, E156G, F157del, and R158del, are involved in the enhanced infectivity by the BNT162b2-immune sera. The neutralizing activity of sera from vaccinated individuals, as well as convalescent COVID-19 patients, decreases for the Delta variant compared to the wild-type SARS-CoV-2 [15] [16] [17] . Both ACE2 RBD and NTD mutations should be evaluated [7] on all positive samples to understand the pathogenicity of the SARS-CoV-2 variants. The CDC's classifications and definitions of SARS-CoV-2 variants of concern (VOCs) and variants of interest (VOIS) are summarized in Table 1 . 

Based on information retrieved from the GenBank database, a sequence of 116 amino acids from T393 to Y508, highlighted yellow in Figure 2 , contains the entire ACE2 RBD from K417 to N501. A sequence of 160 amino acids from M1 to Y160, highlighted green in Figure 2 , covers the entire NTD whose mutations are used as additional characteristics for variant categorization [3] . Assuming the classification algorithms defined by the CDC (Table 1) to be valid and stringent, accurate determination of the mutations of the amino acids from S45 to R158 and from positions K417 to N501 should be adequate for variant categorization. acids from T393 to Y508, highlighted yellow in Figure 2 , contains the entire ACE2 RBD from K417 to N501. A sequence of 160 amino acids from M1 to Y160, highlighted green in Figure 2 , covers the entire NTD whose mutations are used as additional characteristics for variant categorization [3] . Assuming the classification algorithms defined by the CDC (Table 1) to be valid and stringent, accurate determination of the mutations of the amino acids from S45 to R158 and from positions K417 to N501 should be adequate for variant categorization. Figure 2 . This is a figure showing the first 508 amino acids of SARS-CoV-2 S protein with highlighted NTD M1 to Y160 and ACE2 RBD T393 to Y508, retrieved from the GenBank database-Seq ID# NC_045512.2. The amino acids whose mutations ( Figure 1 and Table 1 ) are used for variant determination are typed in red. The amino acids in the ACE2 RBD are highlighted yellow, and those in the NTD are highlighted green.

The materials used for method development were the residues of 16 SARS-CoV-2 positive nasopharyngeal swab specimens from patients with clinical respiratory infections. These were previously tested patient specimens without patient identifications and were purchased from Boca Biolistics Reference Laboratory, Pompano Beach, FL, a commercial reference material laboratory endorsed by the U.S. Food and Drug Administration (FDA) as a supplier of clinical samples positive for SARS-CoV-2 by RT-qPCR assays. According to the commercial supplier, the swabs were immersed in VTM after collection and stored in freezer at −80 • C temperature following the initial testing.

In the author's laboratory, these 16 swab rinse specimens were proven to contain SARS-CoV-2 genomic RNA by successful bi-directional Sanger sequencing of a 398-bp N gene cDNA PCR amplicon. These 16 sequencing-confirmed positive samples were among the 30 specimens that were purchased and were initially classified as positive by RT-qPCR tests granted emergency use authorization (EUA) by the FDA for the presumptive qualitative detection of nucleic acid from the 2019-nCoV [18] . The general characteristics of these 30 swab specimens were previously published in detail elsewhere [19] . According to the commercial supplier, all these positive samples were re-tested by an EUA N gene RT-qPCR assay with Ct values ranging from 14.55 to 36.71. Nevertheless, only 16 of the 30 samples were shown to contain SARS-CoV-2 genomic RNA by partial N gene sequencing [19] . The test results of these 30 RT-qPCR positive clinical specimens collected from patients suspected of SARS-CoV-2 infection were used to fulfill the requirement for Clinical Laboratory Improvement Amendments (CLIA) certification to perform routine partial N gene Sanger sequencing for SARS-CoV-2 detection and reflex target S gene Sanger sequencing to determine variants of concern and interest. According to the FDA guidance, false results generated by RT-qPCR tests can be investigated using Sanger sequencing [20] . There are no FDA-authorized diagnostic test kits for SARS-CoV-2 variant determination.

Instead of cell-free fluid samples, which are used for most RT-qPCR assays, cellular components are routinely included in the material being tested in this assay [21] . The initially published protocol was slightly modified. Briefly, about 1 mL of the residue of the nasopharyngeal swab rinse in VTM was transferred to a graduated 1.5 mL microcentrifuge tube and centrifuged at~16,000× g for 5 min to pellet all cells and cellular debris. The supernatant was discarded except the last 0.2 mL, which was left in the test tube with the pellet. To each test tube containing the pellet with 0.2 mL supernatant, 200 µL of digestion buffer containing 1% sodium dodecyl sulfate, 20 mM Tris-HCl (pH 7.6), 0.2M NaCl, and 700 µg/mL proteinase K, was added. The mixture was digested for 1 hr in a heated shaker set at 47 • C. After digestion, an equal volume (400 µL) of acidified 125:24:1 phenol/chloroform/isoamyl alcohol mixture (Thermo Fisher Scientific Inc., Waltham, MA, USA) was added to each tube. After vortexing twice for extraction and centrifugation at~16,000× g for 5 min to separate the phases, the liquid in the phenol/chloroform phase was pipetted out and discarded. To the remaining aqueous phase solution, 300 µL of acidified 125:24:1 phenol/chloroform/isoamyl alcohol mixture was added for a second extraction. After a second centrifugation at~16,000× g for 5 min to separate the phases, 200 µL of the aqueous supernatant without any material at the interface was transferred to a new 1.5 mL microcentrifuge tube for nucleic acid purification [21] .

As reported by the CDC, nested PCR is the necessary step to generate SARS-CoV-2 cDNA amplicons to be used as the templates for Sanger sequencing [22] . The sequences, the sizes of the amplicons, and the reference location of the major primers used in this study are listed in Table 2 . 

The primary and nested RT-PCR conditions were described in detail previously [21] . This nested RT-PCR protocol has been shown to be able to amplify a single copy of target SARS-CoV-2 RNA to be used as template for Sanger sequencing [21] .

Sanger sequencing of the nested PCR amplicons was performed as previously described [21] . The workflow from nucleic acid extraction to variant determination by Sanger sequencing is summarized in Figure 3 . 

As reported by the CDC, nested PCR is the necessary step to generate SARS-CoV-2 cDNA amplicons to be used as the templates for Sanger sequencing [22] . The sequences, the sizes of the amplicons, and the reference location of the major primers used in this study are listed in Table 2 . 

The primary and nested RT-PCR conditions were described in detail previously [21] . This nested RT-PCR protocol has been shown to be able to amplify a single copy of target SARS-CoV-2 RNA to be used as template for Sanger sequencing [21] .

Sanger sequencing of the nested PCR amplicons was performed as previously described [21] . The workflow from nucleic acid extraction to variant determination by Sanger sequencing is summarized in Figure 3 . As reported previously [21] , the RNase P gene selected by the CDC as the extraction control for its RT-qPCR test panel was not always amplifiable by conventional PCR for DNA sequencing. A segment of human BRCA1 gene was chosen as the internal cellular extraction control in the current protocol. BRCA1 gene is always present and is only found in mammalian cells [23] . Other house-keeping genes may be used instead after validation.

Since mutations are widely scattered in the S protein amino acid chain among the variants (Table 1) , and PCR amplification of different specific segments of the 3822-base S gene may be needed for Sanger sequencing and for the differentiation of emerging variants, it is important to confirm that all RNA extracts from the clinical specimens, which were positive for an N gene segment [19, 21] , also contained an intact S gene. One way to achieve this goal without performing an entire S gene sequencing was to use the SB7/SB8 nested PCR primer set to amplify a 490 bp cDNA at position 21628-22117 (Table 2 ) and the VF3/VF4 nested PCR primer set to amplify a 315 bp cDNA at position 24913-25227 (Table 2 ) on all 16 nasopharyngeal samples that previously tested positive for the N gene. The representative parts of these two sequences from one sample are shown in Figure 4 . As reported previously [21] , the RNase P gene selected by the CDC as the extraction control for its RT-qPCR test panel was not always amplifiable by conventional PCR for DNA sequencing. A segment of human BRCA1 gene was chosen as the internal cellular extraction control in the current protocol. BRCA1 gene is always present and is only found in mammalian cells [23] . Other house-keeping genes may be used instead after validation.

Since mutations are widely scattered in the S protein amino acid chain among the variants (Table 1) , and PCR amplification of different specific segments of the 3822-base S gene may be needed for Sanger sequencing and for the differentiation of emerging variants, it is important to confirm that all RNA extracts from the clinical specimens, which were positive for an N gene segment [19, 21] , also contained an intact S gene. One way to achieve this goal without performing an entire S gene sequencing was to use the SB7/SB8 nested PCR primer set to amplify a 490 bp cDNA at position 21628-22117 (Table 2 ) and the VF3/VF4 nested PCR primer set to amplify a 315 bp cDNA at position 24913-25227 (Table 2 ) on all 16 nasopharyngeal samples that previously tested positive for the N gene. The representative parts of these two sequences from one sample are shown in Figure 4 . The upper panel of Figure 4 was excised from an electropherogram of a 490 bp amplicon sequence of the S gene defined by the nested PCR primers SB7 and SB8 ( Table 2 ). The computer-generated sequence has been converted to a 5′-3′ reading that was re-typed under the upper electropherogram with the last three bases of the SB7 forward PCR primer "CAC" underlined. The number 21646-21717 indicates the position of this segment of sequence in the SARS-CoV-2 genome. The letter "T" in red means that the wild-type nucleotide in this position has undergone a nonsynonymous mutation causing an H49Y amino acid mutation (CAT > TAT). The lower panel was excised from an electropherogram of a 315 bp amplicon sequence of the S gene defined by the nested PCR primers VF3 and VF4 ( Table 2 ). The computer-generated sequence in a 5′-3′ reading direction was retyped under the electropherogram with the entire 21-base reverse PCR primer underlined. The upper panel of Figure 4 was excised from an electropherogram of a 490 bp amplicon sequence of the S gene defined by the nested PCR primers SB7 and SB8 ( Table 2 ). The computer-generated sequence has been converted to a 5 -3 reading that was re-typed under the upper electropherogram with the last three bases of the SB7 forward PCR primer "CAC" underlined. The number 21646-21717 indicates the position of this segment of sequence in the SARS-CoV-2 genome. The letter "T" in red means that the wild-type nucleotide in this position has undergone a nonsynonymous mutation causing an H49Y amino acid mutation (CAT > TAT). The lower panel was excised from an electropherogram of a 315 bp amplicon sequence of the S gene defined by the nested PCR primers VF3 and VF4 ( Table 2 ). The computer-generated sequence in a 5 -3 reading direction was re-typed under the electropherogram with the entire 21-base reverse PCR primer underlined. The number 25088-25227 indicates the position of this segment of sequence in the SARS-CoV-2 genome.

Since the two sequences illustrated in Figure 5 are >3000 nucleotides apart within the S gene of the SARS-CoV-2 genome, their presence in one sample supported the interpretation that the sample being tested contained an intact S gene and was suitable as the material for the development of methods for S gene target specific mutation assays.

Initially, attempts were made to design primary and nested PCR primers to amplify a 1524-base segment of the S gene, encoding the first 508 amino acids of the SARS-CoV-2 spike protein (Figure 2 ), including the NTD and the ACE2 RBD in a single amplicon. It has been reported that, under certain conditions, the entire >1500-base bacterial 16S rRNA gene can be amplified by PCR [24, 25] . However, all attempts failed. A single >1500-bp S gene cDNA PCR amplicon could not be generated from the nasopharyngeal swab samples used for this study.

According to the CDC's definitions, all SARS-CoV-2 variants of concern and of interest contain at least one amino acid mutation in the S protein ACE2 RBD from K417 to N501 (Table 1) . However, R403T has also appeared recently at the GISAID hCoV-19 S protein mutation surveillance dashboard along with other mutations for emerging variant characterization [ Figure 1 ]. Therefore, a diagnostic base-calling electropherogram must contain a 297-base unambiguous sequence covering 99 amino acid codons (nucleotide position 22769-23065). For routine diagnostic convenience, these 297 bases must be present in one single computer-generated sequence on an electropherogram to confirm that the positive isolate is not a variant of concern or interest or to provide mutation information for variant identification. To fulfill these requirements, a pair of SS1/SS2 primary RT-PCR primers and a pair of SS3/SS4 nested PCR primers (Table 2) were selected to amplify a 460 bp primary PCR cDNA amplicon and a 437 bp nested PCR amplicon, respectively. These two pairs of primers were proven to be successful for the amplification of a 437 bp nested PCR amplicon to be used as sequencing templates from all 16 samples proven to contain a segment of N gene sequence. One of these electropherograms, showing the sequence encompassing the codons from R403 to N501, is presented in Figure 5 .

The number 25088-25227 indicates the position of this segment of sequence in the SARS-CoV-2 genome.

Since the two sequences illustrated in Figure 5 are >3000 nucleotides apart within the S gene of the SARS-CoV-2 genome, their presence in one sample supported the interpretation that the sample being tested contained an intact S gene and was suitable as the material for the development of methods for S gene target specific mutation assays. 

Initially, attempts were made to design primary and nested PCR primers to amplify a 1524-base segment of the S gene, encoding the first 508 amino acids of the SARS-CoV-2 spike protein (Figure 2 ), including the NTD and the ACE2 RBD in a single amplicon. It has been reported that, under certain conditions, the entire >1500-base bacterial 16S rRNA gene can be amplified by PCR [24, 25] . However, all attempts failed. A single >1500-bp S gene cDNA PCR amplicon could not be generated from the nasopharyngeal swab samples used for this study.

According to the CDC's definitions, all SARS-CoV-2 variants of concern and of interest contain at least one amino acid mutation in the S protein ACE2 RBD from K417 to N501 (Table 1) . However, R403T has also appeared recently at the GISAID hCoV-19 S protein mutation surveillance dashboard along with other mutations for emerging variant characterization [ Figure 1 ]. Therefore, a diagnostic base-calling electropherogram must contain a 297-base unambiguous sequence covering 99 amino acid codons (nucleotide position 22769-23065). For routine diagnostic convenience, these 297 bases must be present in one single computer-generated sequence on an electropherogram to confirm that the positive isolate is not a variant of concern or interest or to provide mutation information for In Figure 5 , the sequencing electropherogram shows 19 underlined codons of R403, K417, N439, V445, G446, L452, L455, F456, K458, A475, S477, T478, E484K (GAA > AAA mutation), G485, F490, Q493, S494, P499, and N501 in the ACE2 RBD region of the SARS-CoV-2 spike protein gene. Nonsynonymous mutations of the nucleotides in these 19 codons are routinely monitored for surveillance by GISAID (Figure 1 ).

The amino acids in the NTD region used by the CDC to define variants span from L5 to R158, a segment of 154 amino acids with a coding nucleic acid sequence of 462 bases. Attempts to generate a 569 bp nested PCR amplicon from the 16 clinical samples known to be positive for SARS-CoV-2 by partial N gene sequencing were successful in nine samples only (9/16). By necessity, the sizes of the primary RT-PCR amplicon and the nested PCR amplicon were reduced to 505 bp and 490 bp, respectively, to gain PCR sensitivity while using the SB5/SB6 pair for the primary PCR primers and the SB7/SB8 pair as the nested PCR primers to generate a 490 bp amplicon (Table 2 ) as the template for Sanger sequencing from all 16 samples. This 490 bp amplicon covers the codons of 17 key amino acids in a region from A67 to R158, i.e., a total of 92 codons with a 276-base sequence. Mutations in these 17 key amino acids are used by the CDC [3] to help distinguish variants of concern and interest for surveillance. Since several deletions are involved in these mutation profiles and a bi-directional Sanger may be needed for verification of some of these deletions, the size of the NTD nested PCR amplicon is longer than necessary for a one-directional reading so that the key mutation sites are not placed too close to the PCR primer sites in case a bi-directional sequencing is needed to confirm an SNP or a deletion toward the 3 end of a nested PCR primer site. A typical computer-generated electropherogram showing the codons of the 17 amino acids in the NTD, which the CDC uses to help define variants, is presented in Figure 6 . tation profiles and a bi-directional Sanger may be needed for verification of some of these deletions, the size of the NTD nested PCR amplicon is longer than necessary for a onedirectional reading so that the key mutation sites are not placed too close to the PCR primer sites in case a bi-directional sequencing is needed to confirm an SNP or a deletion toward the 3′ end of a nested PCR primer site. A typical computer-generated electropherogram showing the codons of the 17 amino acids in the NTD, which the CDC uses to help define variants, is presented in Figure 6 . In Figure 6 , the sequencing electropherogram shows 17 underlined codons of amino acids, A67, H69, V70, G75, T76, D80, T95, D138, G142, Y144, Y145, H146, W152, E154, E156, F157, and R158 of the SARS-CoV-2 spike protein in the NTD, which may mutate in In Figure 6 , the sequencing electropherogram shows 17 underlined codons of amino acids, A67, H69, V70, G75, T76, D80, T95, D138, G142, Y144, Y145, H146, W152, E154, E156, F157, and R158 of the SARS-CoV-2 spike protein in the NTD, which may mutate in different variants of concern and of interest. These 17 amino acid mutations, along with the ACE2 RBD amino acid mutations, are used for variant categorization (Table 1) .

Assuming the CDC's variant classification algorithms to be valid and stringent, the profiles of the amino acid mutations listed in Table 1 can be simplified into Table 3 , using combinations of the mutations in the ACE2 RBD and the NTD for accurate variant categorization. 

When RNA viruses are allowed to transmit from population to population, genetic change invariably occurs due to RNA polymerase copying errors, which may lead to single nucleotide nonsynonymous mutations and indel mutations. The wildtype Wuhan-Hu-1 SARS-CoV-2 spike protein has 1273 amino acids encoded by a 3822-base S gene. However, as of 23 August 2021, the number of S protein amino acid mutations reported worldwide already reached 2860 [26] . Even randomly mixing a small fraction of these mutations will result in an enormous number of combination profiles. Therefore, as a matter of necessity, the CDC can only select the most prevalent profiles, for example, the mutation combinations listed in the definitions of variants of concern and interest ( Table 1) for analyses. However, in the United States, COVID-19 patients and their healthcare providers were not even allowed to know if the SARS-CoV-2 detected in their specimens were a Delta variant [27] because no variant test had been authorized to be used for clinical usage, and the variant surveillance tests vary greatly from laboratory to laboratory.

A university laboratory director in California was quoted as claiming that an L452R mutation is often a telling sign and that about 94% of the samples analyzed by his laboratory that show that mutation are proven to be Delta [28] . "Right now, we are assuming any new case is Delta given the high probability", reportedly declared by an infectious disease specialist at the University of California in San Francisco [29] . It is generally believed that there is a "Lack of testing" for Delta variant and that "without adequate data, policymakers are just swinging in the dark," as stated by a clinical professor of population and public health sciences at the University of Southern California [30] . Therefore, there is an urgent need for a science-based routine testing method for accurate detection of the key S protein amino acid mutations on all samples positive for SARS-CoV-2 so that the Delta and other variants of concern or of interest can be properly and consistently identified for further analyses.

The currently widely used whole genome/NGS technology is an emerging, not yet stable technology for general use in disease diagnosis. There is a strong opinion within the EuroGentest and the European Society of Human Genetics that, for genes that are responsible for a significant proportion of the defects, the sensitivity should not be compromised by the transition from Sanger to NGS [31] . In addition, there is a high percentage of uncertainties of base calls associated with computational errors and biases in NGS [11] . While the NGS technique is widely applied, varying error rates have been observed [32] . The first genomic sequences of SARS-CoV-2 isolates from patient specimens in China [33] and in the United States [22] were verified by Sanger sequencing to avoid base-calling errors. Since specific variant classification is based on certain key amino acid mutations in the S protein, which in turn depend on accurate determination of SNPs and indel mutations of the S gene sequence, Sanger sequencing is the method of choice if the information derived from variant testing is used to influence patient management and policy making.

Sanger sequencing needs a properly prepared template, which is usually a PCR amplicon of the target nucleic acid, for example, a segment of the S protein gene. In molecular diagnostics, the size of the PCR amplicon of the target DNA or cDNA is usually <450 bp. Attempts to amplify big-sized templates in complex samples often lead to PCR failures [34] . It is technically impossible to amplify the entire 3822-base S gene as one single amplicon to be used as a Sanger sequencing template. PCR amplification of a 405 bp fragment from the SARS-CoV genome for sequencing and comparing the sequence of the amplicon with reference sequences in the GenBank database was the established method for molecular detection of SARS-CoV during the 2003 outbreak [35, 36] . The CDC's standard diagnostic protocol for SARS-CoV recommended using three specific primers to perform heminested PCR and to sequence a 348-bp heminested PCR amplicon "to verify the authenticity of the amplified product" [37] . With accurate diagnosis, prompt isolation of patients, and early treatment, the SARS 2003 outbreak ended in June with 8098 reported cases and 774 deaths worldwide [38] without a variant of concern reported. It is of interest to note that the CDC developed a sequencing-based molecular test to facilitate ending the SARS epidemic so quickly by using just 15 positive SARS patient samples for method development and that a method for the determination of SARS-CoV-2 variants in sewage was developed by nested RT-PCR amplification of the S gene in only six samples, followed by conventional Sanger sequencing of the cDNA PCR amplicons [39] . The method presented in this article followed the CDC's established SARS 2003 protocols [35] [36] [37] to sequence two~400-base segments of the S gene of SARS-CoV-2 for accurate determination of SNP and indel mutations, which are used to determine amino acid changes to further define variants. Specific points of discussion are presented as follows.

This article introduces a generic target specific mutation assay for the accurate detection of variants of concern (VOCs) and variants of interest (VOIs) by sequencing two nested RT-PCR amplicons of the SARS-CoV-2 spike protein gene, one located in the ACE2 RBD and one in the NTD region. Since the sample being tested includes the phenol-extracted digestate of virus-infected cells instead of cell-free fluid only [19, 21] , more viral genome copies are available for testing in this assay as compared to other commercial assays. In addition, the nested RT-PCR technology routinely amplifies the target nucleic acid for a total of 60 cycles to raise detection sensitivity. Therefore, this target specific mutation assay can determine the amino acid mutations accurately in samples with low viral loads when the whole genome/NGS surveillance technologies may fail.

Traditionally, the CDC recommends sequencing of an approximately 400-bp RT-PCR amplicon to verify the authenticity of the amplified product [35] [36] [37] in molecular testing for SARS-CoV. Phenol-chloroform has been shown to be a 10 6 times more sensitive extraction method than the popular commercial QIAamp blood kit in the detection of HBV DNA in serum samples [40] .

Any meaningful correlative analysis linking variants to clinical and epidemiological data must be based on precise determination of S protein amino acid mutations, which are the basis for variant categorization. Variant testing should be conducted routinely on all samples positive for SARS-CoV-2. The current surveillance programs select less than 5% of the positive samples with high viral loads for variant testing by whole genome/NGS; it is generating highly biased and potentially misleading information based on which public policy with impacts on society and economy is made. A high SARS-CoV-2 viral load in a clinical specimen is not invariably associated with disease severity [41] .

When RNA viruses are allowed to pass from host to host, only those mutations that can be passed down to descendant viruses in subsequently infected individuals can be observed, documented, and reported in the literature [42] . The WHO and the U.S. CDC have selected the mutations of eight amino acids, namely K417, L452, S477, T478, E484, F490, S494, and N501, in the spike protein ACE2 RBD as the key mutations to create a limited number of VOCs and VOIs for surveillance purposes (Table 1) . The WHO and CDC seem to advise that the absence of mutations in these eight amino acids rules out VOCs or VOIs, although such advice has not been clearly stated on record.

As SARS-CoV-2 continues to spread, more new amino acid mutations in the ACE2 RBD have been accumulated to the circulating strains and reported to GISAID (Figure 1) . Some of these new profiles may contain a mixture of mutations, each of which is considered unique for a specific variant, such as E484K, T478K, K417T, and N501Y (Figure 1 and Table 1 ). It is not clear if these new profiles are considered to be VOCs or as Delta variants if there is a T478K mutation.

According to the official classification algorithms, the T478K mutation in the RBD is unique to the Delta variant; by definition, the spike protein of the Delta variant contains eight mutations, including four mutations in the NTD (T19R, G142D, 156-157del, and R158G), two in the RBD (L452R and T478K), one mutation close to the furin cleavage site (P681R) and one in the S2 region (D950N) [3, 43] . The Delta variant was reported to become the most dominant SARS-CoV-2 worldwide in the summer of 2021. In the United States during the week of 22-28 August 2021, 99.1% of the SARS-CoV-2 isolates were classified as Delta variants [44] .

However, the WHO's definition for Delta variant is a profile of T19R, G142D, 157del, and 158del in the NTD plus L452R and T478K in the ACE2 RBD [12] , and the Public Health England uses P681R as the key mutation to define the Delta variant [13] . It is not clear which classification algorithm is being used to define Delta variants in different parts of the world. In the United Stated, some specialists simply assumed "any new case is Delta" [29] . Since the whole genome/NGS surveillance technology tends to generate uncertainties of base calls associated with computational errors and biases [11] , it is not known how many of the new cases have been erroneously classified as Delta variants as a result of computational errors and biases. Based on information available in the public domain, the sequence data for variant surveillance have not been verified by Sanger sequencing as stringently as those used to identify the initial Wuhan-Hu-1 SARS-CoV-2 strain [22, 33] .

When RNA viruses are subjected to passages as in serial culture transfers, an accumulation of mutations will occur [45] . The same biological process takes place among humans in the current COVID-19 pandemic.

The 2003 SARS spreading ceased in June. There was no SARS-CoV variant of concern in 2003 because the epidemic ended too soon for accumulation of a significant number of mutations in the circulating viruses.

During the 2020 COVID-19 outbreak, it took 11 months for the first variant of concern, an Alpha variant of SARS-CoV-2, to develop and to be isolated from a 58-year-old human male on 24 November 2020 in England, United Kingdom [46] . An accumulation of amino acid mutations and emerging of SARS-CoV-2 variants of concern was probably the result of uncontrolled transmission of the RNA virus among populations [47] .

For example, E484K is the unique mutation in combination with K417N and N501Y in the ACE2 RBD that is used to define the Beta variant, the so-called South Africa variant, first reported in December 2020 [48] . However, a search of the SARS-CoV-2 genomic sequence database in the GenBank revealed that solitary E484K mutations in the ACE2 RBD without concomitant K417N or N501Y were already reported to the GenBank from a specimen ). In addition to the solitary T478K mutation in the ACE2 RBD with a wild-type NTD sequence and a wild-type D950, the sequences mentioned above also contain a P681H mutation instead of a P681R that is used to define the Delta variant. It is not clear if these isolates are being classified as Delta variant. They are certainly not the descendent of the Delta variant originating in India. According to the currently accepted classification algorithms, P681H only occurs in the Alpha variant (Table 1) .

The GenBank database contains numerous SARS-CoV-2 spike protein amino acid mutation profiles, which may be mistaken as Delta variant if a stringent variant classification algorithm is not followed. A few potential sequence profiles that can be mistaken for a Delta variant are listed as follows: 

Currently, there is a coronavirus Delta variant scare being generated in the United States to the point that the created public anxiety may have a negative impact on the U.S. economic recovery from the pandemic [49] , although even the CDC does not know exactly how many U.S. coronavirus deaths are attributable to Delta variant infections [50] . Nevertheless, according to the data published up to 5 July 2021 by Public Health England, the system recorded a total number of 170,063 cases of Delta variant infection and 259 deaths among this group of patients [51] , with a mortality rate of 0.15%. In the same document, there were 225,864 cases with Alpha variant infection and 4264 deaths in the same group with a mortality rate of 1.89%. So, the Alpha variant is at least 10 times more deadly than the Delta variant.

For comparison, the Chinese data show that up to 3 March 2020, before any variants of concern emerged, there were 80,270 confirmed COVID-19 cases with 2981 deaths in China, most of which were from the epicenter of the outbreak [52] . The mortality rate of the wild-type Wuhan-Hu-1 SARS-CoV-2 infections is 2981/80,270 = 3.71%, which is about twice as high as the mortality rate of Alpha variant infections.

Therefore, the Delta variant is not more dangerous or more deadly than the wild-type Wuhan-Hu-1 strain or the Alpha variant. The high number of Delta variants being reported in the literature may have resulted from over-extrapolation bias based on sequencing of a very limited number of specially selected samples with surveillance testing methods of uncertain accuracy in unregulated laboratories. Generally, surveillance testing using sequencing technology to identify SARS-CoV-2 genetic variants can be performed in a facility that is not CLIA certified, provided that patient-specific results are not reported to (1) the individual who was tested or (2) their health care provider [53] . There are no quality control measures to identify potential flaws in coronavirus variant testing in the United States because the surveillance testing results are not for patient management, even though they are being used as the basis for the formulation of public health policies.

The high number of Delta variants being reported to the government for surveillance purposes may simply indicate that many SARS-CoV-2 strains with certain amino acid mutations described in the CDC's definition for the Delta variant (Table 1) have acquired a genetic profile that enables them to have a higher replication rate in the host than the others, but their pathogenicity may have been reduced to the level of that of the common human coronaviruses as those of types 229E, NL63, OC43, and HKU1 [54] . This heterogeneous group of SARS-CoV-2 strains may have been detected more often because there are more virus copies in the samples being tested resulting from their higher replication rates. A higher rate of being detected does not necessarily indicate that the virus variant is more transmissible unless actual movement of the variant among close contacts has been studied by epidemiological tracing research supported by accurate variant testing. Transmissibility of a virus is primarily determined by the infectivity of the pathogen [55] , not the viral load of the donor.

According to the CDC update for the week ending 28 August 2021, the combined proportion of cases attributed to the Delta variant is estimated to be greater than 99% in the United States. It is expected that Delta will continue to be the predominant circulating variant [56] . However, the 99% attribution to Delta is an estimate. Laboratories may use different profiles of amino acid mutations to define the Delta variant. Some reports were based on assumptions only [29] .

Notably, some researchers in the field use a profile of T19R, G142D, E156G, F157del, R158del, L452R, T478K, D614G, P681R, and D950N in the spike protein to define the Delta variant by following the GISAID database [57] . According to the latter system of classification, the Delta variant lacks T95I and has E156G and R158del [57] , the two mutations that are not in line with the CDC's definition for the Delta variant (Table 1) . A GenBank Sequence ID# OU534154 also lists an NTD/ACE2 RBD sequence containing G142D, E156G, F157del, R158del, L452R, and T478K with neither T19R nor T95I in the NTD. Based on the various issues discussed above, the actual number of Delta variants categorized according to the stringent CDC's definition is unknown. All statistics based on correlations between the Delta variant and its biological characteristics are highly questionable because the SARS-CoV-2 isolates currently classified as Delta variants may actually consist of numerous genetic variants.

In order to fully realize the potential of genomic epidemiology, there is a need for routine sequencing of viral nucleic acid established in parallel with COVID-19 testing [58] , on all positive samples, including those with low viral loads. Even with high viral load samples, it took several months for the CDC to accurately verify the entire~30,000-base sequence of a SARS-Cov-2 whole genome, using both the NGS and the nested PCR/Sanger sequencing technology [22] . Such an approach, even used to sequence the entire 3822-base spike protein gene, is not practical in routine diagnostic works because the common RT-PCR amplicon size in SARS-CoV diagnostic testing is~348 bp in size [38] . If an NGS technology is used for the diagnostic work, under certain circumstances it may need to sequence as many as 10 PCR amplicons to verify or to correct the base-calling uncertainties generated by the computational errors and biases of the NGS technology [9, 11] in a gene target of 3822 bases long among numerous non-target nucleic acids in a nasopharyngeal swab sample.

This article proposes routine sequencing of a 437-bp nested PCR cDNA amplicon of the S gene ACE2 RBD ( Figure 5 ) on all samples that are positive for a SARS-CoV-2 RNA gene. If there is no amino acid mutation in the RBD, the SARS-CoV-2 detected is not a VOC or a VOI. If the RBD sequencing shows any amino acid mutations, an additional 490-bp nested PCR cDNA amplicon of the S gene NTD is sequenced (Figure 6 ). Since a properly executed computer-generated sequencing electropherogram does not have ambiguous base calls, the codons of the amino acids in the ACE2 RBD and in the NTD can be easily determined without the need of bioinformatic services.

Assuming the CDC's definitions based on amino acid mutations for variant determination to be valid and stringent, even the recently reported Mu variant (PANGO lineage B.1.621), which is characterized by a combination of R346K, E484K, N501Y, D614G, and P681H [59], can be distinguished from other VOCs and VOIs by the protocol proposed because there are no concomitant mutations in the NTD sequence in the presence of only E484K and N501Y in the ACE2 RBD sequence for the Mu variant.

By the same token, a newly reported South Africa variant with PANGO lineage C.1.2, which contains multiple substitutions (R190S, D215G, N484K, N501Y, H655Y, and T859N) and deletions (Y144del, L242-A243del) within the spike protein [60] , can be distinguished from other VOCs and VOIs by the demonstration of only N484K and N501Y in the ACE2 RBD and a Y144del in the NTD without other concomitant mutations in the two amplicons targeted for Sanger sequencing.

The protocol presented in this article is able to sequence a 437-bp nested RT-PCR cDNA amplicon of the ACE2 RBD and a 490-bp nested RT-PCR cDNA amplicon of the N-terminal domain (NTD) of the S gene for the detection of the amino acid mutations needed for accurate determination of all variants of concern and variants of interest defined by the CDC and the WHO in samples positive for SARS-CoV-2, regardless of their viral loads. In order to fully realize the potential of genomic epidemiology, there is a need for routine diagnostic sequencing of viral nucleic acid established in parallel with COVID-19 testing on all positive samples, including those with low viral loads. Currently, there are no authorized SARS-CoV-2 variant diagnostics. In the United States, a Sanger sequencingbased variant determination assay certified under the CLIA program can be used as a routine diagnostic test for patient management and follow-ups. 

Next-Generation Sequencing (NGS) in COVID-19: A Tool for SARS-CoV-2 Diagnosis, Monitoring New Strains and Phylodynamic Modeling in Molecular Epidemiology

SARS-CoV-2 Variant Classifications and Definitions

WHO. Tracking SARS-CoV-2 Variants

3 Charts Show How Far Covid Delta Variant Has Spread around the World. PUBLISHED THU

WHO. Episode #45-Delta Variant. 5

World Health Organization. Regional Office for Europe. Methods for the Detection and Identification of SARS-CoV-2 Variants

COVID-19) CDNA National Guidelines for Public Health Units

CB144F5FCA2584F8001F91E2/$File/COVID-19-SoNG-v4.7.pdf

Rapid Communication. SARS-CoV-2 Variant Testing. Ver.1, Released 4.28.2021. The Association for Molecular Pathology (AMP)

Effectiveness of mRNA BNT162b2 COVID-19 vaccine up to 6 months in a large integrated health system in the USA: A retrospective cohort study

Computational Errors and Biases in Short Read Next Generation Sequencing

WHO. Coronavirus Disease (COVID-19): Weekly Epidemiological Update

SARS-CoV-2 Variants of Concern and Variants under Investigation in England

Clinical Laboratory Improvement Amendments (CLIA)

The SARS-CoV-2 Delta Variant Is Poised to Acquire Complete Resistance to Wild-Type Spike Vaccines

In vitro and in vivo functions of SARS-CoV-2 infection enhancing and neutralizing antibodies

Reduced neutralization of SARS-CoV-2 B.1.617 by vaccine and convalescent serum

Letter Dated

qPCR is not PCR Just as a Straightjacket is not a Jacket-the Truth Revealed by SARS-CoV-2 False-Positive Test Results. COVID-19 Pandemic Case Stud

FDA

Testing for SARS-CoV-2 in cellular components by routine nested RT-PCR followed by DNA sequencing

Severe Acute Respiratory Syndrome Coronavirus 2 from Patient with Coronavirus Disease, United States

Rapid evolution of BRCA1 and BRCA2 in humans and other primates

Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases

Long PCR-RFLP of 16S-ITS-23S rRNA genes: A high-resolution molecular tool for bacterial genotyping

You're Not Allowed to Know If You Have the Delta Variant

Sequencing Used to Identify Delta, Other Coronavirus Variants

How Do You Know If You Have the Delta Variant of COVID-19?

Where's the Data on Delta? Lack of Testing, Info Makes It Hard to See Virus's Full Scope

Guidelines for diagnostic next-generation sequencing

Systematic evaluation of error rates and causes in short samples in next-generation sequencing

Identification of a novel coronavirus causing severe pneumonia in human: A descriptive study

Effect of amplicon size on PCR detection of bacteria exposed to chlorine

Severe acute respiratory syndrome: Identification of the etiological agent

A novel coronavirus associated with severe acute respiratory syndrome

Severe Acute Respiratory Syndrome (SARS)

Key SARS-CoV-2 Mutations of Alpha, Gamma, and Eta Variants Detected in Urban Wastewaters in Italy by Long-Read Amplicon Sequencing Based on Nanopore Technology

Comparison of hepatitis B virus DNA extractions from serum by the QIAamp blood kit, GeneReleaser, and the phenol-chloroform method

COVID-19 viral load not associated with disease severity: Findings from a retrospective cohort study

Mechanisms of viral mutation

Reduced sensitivity of SARS-CoV-2 variant Delta to antibody neutralization

Complexities of Viral Mutation Rates

NR-54000 SARS-Related Coronavirus 2, Isolate hCoV19/England/204820464/2020 (Viruses)

SARS-CoV-2 Infection: New Molecular, Phylogenetic, and Pathogenetic Insights. Efficacy of Current Vaccines and the Potential Risk of Variants

SA Reaches Grim Milestone of 1 Million Covid-19 Cases

Will the Delta Variant Scare American Diners and Shoppers into Staying Home? Monday 26

Fact Check-the Delta Variant Death Toll Is Not Zero in the United States, as Posts Claim

SARS-CoV-2 Variants of Concern and Variants under Investigation in England

Wuhan and Hubei COVID-19 Mortality Analysis Reveals the Critical Role of Timely Supply of Medical Resources

CLIA SARS-CoV-2 Variant Testing Frequently Asked Question. Date: 3/19/2021. Does a Facility that Performs Surveillance Testing to Identify SARS-CoV-2 Genetic Variants Need a CLIA Certificate

CDC. Common Human Coronaviruses. Available online

Transmissibility and transmission of respiratory viruses

Data Tracker Weekly Review. Interpretive Summary for 3

Delta spike P681R mutation enhances SARS-CoV-2 fitness over Alpha variant

Emergence of novel SARS-CoV-2 variants in the Netherlands. Sci. Rep. 2021, 11, 6625. [CrossRef] 59. eCDC. SARS-CoV-2 Variants of Concern as of 6

The Continuous Evolution of SARS-CoV-2 in South Africa: A New Lineage with Rapid Accumulation of Mutations of Concern and Global Detection

The author thanks Wilda Garayua for her technical assistance.

Sin Hang Lee is Director of the Milford Molecular Diagnostics Laboratory specialized in developing DNA sequencing-based diagnostic tests implementable in community hospital laboratories.