key: cord-0800582-z6zmigf8 authors: Wang, Rui; Hozumi, Yuta; Yin, Changchuan; Wei, Guo-Wei title: Mutations on COVID-19 diagnostic targets date: 2020-09-20 journal: Genomics DOI: 10.1016/j.ygeno.2020.09.028 sha: 45045c57e4452484e62b6080d485fa506f7ceff4 doc_id: 800582 cord_uid: z6zmigf8 Effective, sensitive, and reliable diagnostic reagents are of paramount importance for combating the ongoing coronavirus disease 2019 (COVID-19) pandemic when there is neither a preventive vaccine nor a specific drug available for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It will cause a large number of false-positive and false-negative tests if currently used diagnostic reagents are undermined. Based on genotyping of 31,421 SARS-CoV-2 genome samples collected up to July 23, 2020, we reveal that essentially all of the current COVID-19 diagnostic targets have undergone mutations. We further show that SARS-CoV-2 has the most mutations on the targets of various nucleocapsid (N) gene primers and probes, which have been widely used around the world to diagnose COVID-19. To understand whether SARS-CoV-2 genes have mutated unevenly, we have computed the mutation rate and mutation h-index of all SARS-CoV-2 genes, indicating that the N gene is one of the most non-conservative genes in the SARS-CoV-2 genome. We show that due to human immune response induced APOBEC mRNA (C > T) editing, diagnostic targets should also be selected to avoid cytidines. Our findings might enable optimally selecting the conservative SARS-CoV-2 genes and proteins for the design and development of COVID-19 diagnostic reagents, prophylactic vaccines, and therapeutic medicines. Availability Interactive real-time online Mutation Tracker. nucleocapsid (N) gene, i.e., N1 and N2, as probes for the specific detection of SARS-CoV-2. The panel has also selected an additional primer/probe set, the human RNase P gene (RP), as control samples. Many other diagnostic primers and probes based on RNA-dependent RNA polymerase (RdRP), envelope (E), and nucleocapsid (N) genes have been designed [4] and/or designated by the World Health Organization (WHO) as shown in Table S1 of the Supporting Material, which provides the details of 54 commonly used diagnostic primers and probes [5] . The diagnostic kits are often static over time, yet SARS-CoV-2 is undergoing fast mutations. Hence, it is reported that different primers and probes show nonuniform performance [6] [7] [8] . In this study, we genotype 31421 SARS-CoV-2 genome isolates in the globe and reveal numerous mutations on the COVID-19 diagnostic targets commonly used around the world, including those designated by the US CDC. We identify and analyze the SARS-CoV-2 mutation positions, frequencies, and encoded proteins in the global setting. These mutations may impact the diagnostic sensitivity and specialty, and therefore, they should be considered in designing new testing kits as the current effort in COVID-19 testing, prevention, and control. We propose diagnostic target selection and optimization based on nucleotide-based and gene-based mutation-frequency analysis. We first genotype 31421 SARS-CoV-2 genome samples from the globe as of July 23, 2020. The genotyping results unravel 13402 single mutations among these virus isolates. Typically, a SARS-CoV-2 isolate can have eight co-mutations on average. A large number of mutations may occur on all of the SARS-CoV-2 genes and have broad effects on diagnostic kits, vaccines, and drug developments. Moreover, we cluster these mutations by K -means methods, resulting in globally at least six distinct subtypes of the SARS-CoV-2 genomes, from Cluster I to Cluster VI. Table 1 shows the mutation distribution clusters with sample counts ( SC ) and total single mutation counts ( MC ) in 20 countries. J o u r n a l P r e -p r o o f CA 113 835 80 561 9 106 42 417 84 525 33 290 AU 173 1204 587 5048 75 1010 195 2127 165 885 132 1076 DE 69 504 25 121 5 58 26 209 27 144 43 366 FR 100 718 14 55 2 22 48 523 74 465 10 83 UK 295 2328 1927 12777 2171 27636 1623 16123 1890 11835 2919 25576 IT 1 8 8 104 33 561 24 308 57 283 24 192 RU 7 52 2 32 19 219 7 53 32 187 119 968 CN 3 22 287 1155 2 32 7 50 8 35 3 26 JP 18 134 243 1001 23 272 9 79 23 139 191 It is noted that N-China-F [5] is the mostly-used reagent among all primers/probes, but the primer target gene of SARS-CoV-2 has 15 mutations involving thousands of samples, which may account for low efficacy of certain COVID-19 diagnostic kits in China [11] . Note that primers and probes typically have a small length of around 20 nucleotides. Currently, most primers and probes used in the US target are the N gene [5] . However, Table 2 shows that a plurality of mutations has been found in all of the targets of the US CDC designated COVID-19 diagnostic primers. The targets of N gene primers and probes used in Japan, Thailand, and China, including Hong Kong, have undergone multiple mutations involving many clusters. Therefore, the N gene may not be an optimal target for diagnostic kits, and the current test kits targeting the N gene should be updated accordingly for testing accuracy. It can be seen that so far, no mutation has been detected on ORF1ab-China-R and SC2-R, showing that they are two relatively reliable diagnostic primers. Notably, the targets of four E gene primers and probes have only six mutations.Also, no mutation has been found on the targets of ORF1ab-China-R and SC2-R. However, the target of nCoV-IP2-12759R recommended by Institute Pasteur, Paris has six mutations. Overall, targets of the envelope and RNA-dependent RNA polymerase based primers and probes have fewer mutations than the N gene. This observation leads to an assumption that the N gene is particularly prone to mutations. The accumulation of the frequency of virus mutations is due to the natural selection, polymerase fidelity, cellular environment, features of recent epidemiology, random genetic drift, host immune responses, gene editing [12] , replication mechanism, etc [13, 14] . SARS-CoV-2 has a higher fidelity in its transcription and replication process than other single-stranded RNA viruses because it has a proofreading mechanism regulated by NSP14 [15] . However, 13402 single mutations have been detected from 31421 SARS-CoV-2 genome isolates. Due to technical constraints, genome sequencing is subject to errors. Some "mutations" might result from sequencing errors, instead of actual mutations. Additionally, mRNA editing, such as APOBEC [12] , in defending virus invasion in the human immune system can create fatal mutations. Both cases may lead to single-nucleotide polymorphisms (SNPs) without a descendant. We report that among all of 31421 genome isolates, 13402 individual mutations have at least one descendant. It is well known that the sensitivity of diagnostic primers and probes depends on their target positions. Specifically, the beginning part of a primer or probe is not as important as its ending part. A high-frequency mutation on the right end of a primer or probe position of a target would possibly produce more false-negatives in diagnostics. Also, importantly, for primers involving significant mutations, polymerase chain reaction (PCR) annealing temperatures are estimated based on correctly matched sequences [16] . Annealing temperatures for primers and probes involving mutations of are given in Tables S4-S56 Table 2 shows that the degree of mutations on various diagnostic targets vary dramatically. Therefore, it is of great importance to know how to select an optimal viral diagnostics target to avoid potential mutations. We discuss such a target optimization via both nucleotide-based analysis and gene-based mutation analysis. mutations on the SARS-CoV-2 are of C > T type, due to strong host cell mRNA editing knows as APOBEC cytidine deaminase [12] . Therefore, researchers should avoid cytosine bases as much as possible when designing the diagnostic test kits. To further understand how to design the most reliable SARS-CoV-2 diagnostic targets, we carry out gene-level mutation analysis. Figure 8 and Table 3 and ORF6, exception for ORF7b, have higher mutation ratios. Considering the mutation frequency, we introduce the mutation h -index, defined as the maximum value of h such that the given gene section has h single mutations that have each occurred at least h times. Normally, larger genes tend to have a higher h -index. Figure 8 shows that, with a moderate length, the N gene has the second-largest h -index of 44, which is close to the largest h -index of 47 for NSP3. Therefore, selecting SARS-CoV-2 N gene primers and probes as diagnostic reagents for combating COVID-19 is not an optimal choice. Moreover, a few primers and probes used in Japan are designed on the spike and NSP2 gene. However, the high mutation ratio and h -index of spike and NSP2 gene indicate that these diagnostic reagents may not perform well. Furthermore, we design a website called Mutation Tracker to track the single mutations on 26 SARS-CoV-2 proteins, which will be an intuitive tool to inform other research on regions to be avoided in future diagnostic test development. As an unsupervised classification algorithm, the K -means clustering method partitions a given dataset X= 12 { , , , , , }, The standard procedure of k -means clustering method aims to obtain the optimal partition for a fixed number of clusters. First, we randomly pick k points as the cluster centers and then assign each data to its nearest cluster. Next, we calculate the within-cluster sum of squares (WCSS) defined below to update the cluster centers iteratively. where k  is the mean value of the points located in the k -th cluster k C . Here, 2  denotes the 2 L distance. It is noted that the k -mean clustering method described above aims to find the optimal partition for a fixed number of clusters. However, seeking the best number of clusters for the SNP variants is essential as well. In this work, by varying the number of clusters k , a set of WCSS with its corresponding number of clusters can be plotted. The location of the elbow in this plot will be taken as the optimal number of clusters. Such a procedure is called the Elbow method which is frequently applied in the k -means clustering problem. Specifically, in this work we apply the k -means clustering with the Elbow method for the analysis of the optimal number of the subtypes of SARS-CoV-2 SNP variants. The pairwise Jaccard distances between different SNP variants are considered as the input features for the k -means clustering method. WHO. Coronavirus disease 2019 (COVID-19) situation report-185 Coronavirus Disease (COVID-2019) Situation Reports Improved molecular diagnosis of COVID-19 by the novel, highly sensitive and specific COVID-19-rdrp/hel real-time reverse transcription-PCR assay validated in vitro and with clinical specimens A new coronavirus associated with human respiratory disease in China Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR Diagnosing COVID-19: The disease and tools for detection Comparative analysis of primer-probe sets for the laboratory confirmation of SARS-CoV-2 Evaluation of a quantitative RT-PCR assay for the detection of the emerging coronavirus SARS-CoV-2 using a high throughput system Analytical sensitivity and efficiency comparisons of SARS-CoV-2 RT-PCR assays. medRxiv Comparative performance of SARS-CoV-2 detection assays using seven different primer/probe sets and one assay kit Development of genetic diagnostic methods for novel coronavirus 2019 (nCoV-2019) in Japan Chinese Firm to Replace Clinical Laboratory Test Kits After Spanish Health Authorities Report Tests from Chinas Shenzen Bioeasy Were Only 30% Accurate APOBEC-mediated editing of viral RNA Mechanisms of viral mutation Making sense of mutation: what D614G means for the COVID-19 pandemic remains unclear Insights into RNA synthesis, capping, and proofreading mechanisms of SARS-coronavirus Thermodynamics and NMR of internal G.T mismatches in DNA Biochemistry Gisaid: Global initiative on sharing all influenza data-from vision to reality Clustal omega. Current protocols in bioinformatics Distance between sets Mutations on COVID-19 diagnostic targets Presence of mismatches between diagnostic PCR assays and coronavirus SARS-CoV-2 genome Khan et al analyzed the presence of the mutations/mismatches on 27 diagnostics assays [21 SARS-CoV-2 nucleocapsid (N) gene primers and probes have the most mutations The authors declare that they have no conflict of interest. The authors thank The IBM TJ Watson Research Center, The COVID-19 High Performance Computing Consortium, and NVIDIA for computational assistance. GWW thanks Dr. Jeremy S Rossman for valuable comments. emergence of viral variants that are no longer detectable by certain diagnostics tests is a real possibility. A cocktail test kit is needed to mitigate mutations. We propose nucleotide-based and gene-based diagnostic target optimizations to design the most reliable diagnostic targets. We analyze a full list of SNPs for all 31421 genome isolates, including their positions and mutation types. This information, together with ranking of the degree of the conservativeness of SARS-CoV-2 genes or proteins given in Table 3 , enables researchers to avoid non-conservative genes (or their proteins) and mutated nucleotide segments in designing COVID-19 diagnosis, vaccine, and drugs. SARS-CoV-2 genome sequences from infected individuals dated between January 5, 2020, and July 23, 2020, are downloaded from the GISAID database [17] (https://www.gisaid.org/). We only consider the records in GISAID with complete genomes ( > 29000 bp) and submission dates. The resulting 31421 complete genome sequences are rearranged according to the reference SARS-CoV-2 genome [3] by using the Clustal Omega multiple sequence alignment with default parameters [18] . Gene variants are recorded as SNPs.The Jaccard distance [19] is employed to compute the similarities among genome samples. The resulting distance matrix is used in the k -means clustering of all genome samples. The Jaccard distance measures the dissimilarity between SNP variants which is widely used in the phylogenetic analysis of human or bacterial genomes. Given two sets , AB, we first define the Jaccard similarity coefficient:and the Jaccard distance is described as the difference between one and the Jaccard similarity coefficient J o u r n a l P r e -p r o o f