key: cord-0788006-sx1tiro9 authors: Yin, Changchuan title: Latent periodicity-2 in coronavirus SARS-CoV-2 genome: evolutionary implications date: 2021-01-26 journal: J Theor Biol DOI: 10.1016/j.jtbi.2021.110604 sha: ec6da14a05568bc88505223f83df80f043f5a5c2 doc_id: 788006 cord_uid: sx1tiro9 The ongoing global pandemic of infection disease COVID-19 caused by the 2019 novel coronavirus (SARS-COV-2, formerly 2019-nCoV) presents critical threats to public health and the economy. The genome of SARS-CoV-2 had been sequenced and structurally annotated, yet little is known of the intrinsic organization and evolution of the genome. To this end, we present a mathematical method for the genomic spectrum, a kind of barcode, of SARS-CoV-2 and common human coronaviruses. The genomic spectrum is constructed according to the periodic distributions of nucleotides and therefore reflects the unique characteristics of the genome. The results demonstrate that coronavirus SARS-CoV-2 exhibits predominant latent periodicity-2 regions of non-structural proteins 3, 4, 5, and 6. Further analysis of the latent periodicity-2 regions suggests that the dinucleotide imbalances are increased during evolution and may confer the evolutionary fitness of the virus. Especially, SARS-CoV-2 isolates have increased latent periodicity-2 and periodicity-3 during COVID-19 pandemic. The special strong periodicity-2 regions and the intensity of periodicity-2 in the SARS-CoV-2 whole genome may become diagnostic and pharmaceutical targets in monitoring and curing the COVID-19 disease. The current global pandemic of COVID-19 caused by the novel coronavirus SARS-COV-2, formerly 2019-nCoV, has been a severe threat to pub-found in 2017 (Zhou et al., 2018) , caused millions of piglet deaths, but no human cases. We may postulate that SARS-CoV-2 evolved from its progenitor through mutations and evolutionary fitness in the host-shift and adapting. SARS-CoV-2 are fast-evolving pathogens that continuously undertake mutations in the generations of infection of the host (Yin, 2020; Wang et al., 2020; Korber et al., 2020) . This fact suggests that SARS-CoV-2 had evolved mutations in critical proteins and genome structures prior to establishing human infection. To understand how SARS-CoV-2 jumps from animals to adaptively infect humans is important to the surveillance of virus evolution and diversity, therefore ultimately controlling the COVID-19 pandemic and preventing future SARS-like outbreaks. SARS-CoV-2 is an enveloped positive-strand RNA virus, having an exceptionally long (29.9kb) genome (Zhou et al., 2020) ,. The genome consists of 5' leader cap sequence along with a 3' poly (A) tail, genes encoding nonstructural proteins (nsps), and structural proteins, as well as several accessory proteins. Approximate two-thirds of the genome comprises two large overlapping open reading frames (ORF1a and ORF1ab), encoding polyproteins that are subsequently cleaved by viral proteases to generate 16 nonstructural proteins (nsp1 to nsp16). Non-structural proteins are essential for RNA replication, transcriptions, and immune evasion. Nsp3, a large multidomain and multi-functional protein, plays essential roles in virus replication. The papain-like protease (PLpro) activity of nsp3 is responsible for the initial processing of OFR1a protein. In addition, nsp3, together with nsp4 and nsp6, recruits intracellular membranes to form double-membrane vesicles (DVMs) to support viral RNA replication. Nsp5 is a second viral protease (3C-like protease, 3CLpro) that splits both ORF1a and ORF1ab proteins. The downstream regions of the genome encode structural proteins, the spike (S) protein, the nucleocapsid (N) protein, the envelope (E) protein, and the membrane (M) protein. The four structural proteins are all required to produce a structurally complete viral particle. The S protein mediates viral attachment to the ACE2 receptor host, and the subsequent fusion between the viral and host cell membranes enables the virus to enter host cells. The nucleocapsid (N) protein, one of the most abundant viral proteins, can bind to the RNA genome and participate in processes of replication, assembly, host cell response during viral infection (McBride et al., 2014) . The SARS-CoV-2 genome has been sequenced and annotated, but little is known for the complex structure of the genome in light of evolutionary fitness and host infectivity. Conventionally, RNA viruses use encoded pro-teins to interact with the components in cellular response. However, numerous discoveries show that the virus RNA structures, determined by genome composition may also play central roles in maximizing virus replication and evolutionary fitness (Jensen and Thomsen, 2012) . For example, recent studies show that increased CG and TA dinucleotides in both coding and noncoding regions of echovirus 7 inhibit replication initiation during post-entry in several cell lines (Fros et al., 2017) . Therefore, RNA viruses simulate host mRNA composition, for example, the dinucleotide frequencies (Fros et al., 2017) . Animal genomes have a bias in their dinucleotide composition, and the heavy under-representation of CG and TA dinucleotides is especially well known. Most animal RNA and small DNA viruses suppress genomic CG and TA dinucleotide frequencies, apparently mimicking host mRNA composition (Di Giallonardo et al., 2017) . If a virus RNA composition or structure is very different from host mRNA, RIG-I (retinoic acid-inducible gene I)-like receptors may detect RNA molecules that are absent from the uninfected host (Goubau et al., 2013) . Detecting evolutionary microbial structures known as pathogen-associated molecular patterns (PAMPs) is an important feature of the innate immune system. Host-cells possess intrinsic defense pathways that prevent replication of viruses with increased CG and TA frequencies in mechanisms independent of codon usage (Belalov and Lukashev, 2013) . The genomic spectrum demonstrates dinucleotide, trinucleotide, and multinucleotide distributions. The dinucleotide distributions are often considered as the signature of a genome (Karlin and Burge, 1995) and the periodicity-3 patterns are distinguishing characteristics for the protein-coding regions (Tsonis et al., 1991) . The strengths of the periodicity-2 and periodicity-3 in a genome are determined by the perfect levels and copy numbers of dinucleotide and trinucleotide repeats, respectively. Because latent periodicity-2 and periodicity-3 are the essential characteristics of a genome, in this study, we only examine these two periodicities from the genomic spectrum when analyzing the genome. Note that this study is to investigate the latent periodicity-2 in SARS-CoV genomes, rather than dinucleotide repeats. Especially, the long enough stretches with tandem dinucleotide repeats are absent in viral genomes. The generation of latent periodicity-2 does not necessarily need the occurrence of perfect dinucleotide repeats. In this study, we present the genomic spectrum of SARS-CoV-2 and identify predominated latent periodicity-2 in ORF1a (nsps 3-6) that have been instrumental in interacting host immune systems and pathogenicity and host infectivity. These latent periodicity-2 regions are essential to the survival and infectivity of the microbe and can be considered as one of PAMPs in SARS-CoV-2. These genomic elements are evidence of the evolutionary fitness of SARS-CoV-2 in host-shift and adaption. Tracking the evolution of these elements may provide insights into the zoonotic origin of SARS-CoV-2 and the control of COVID-19 disease. In addition, the critical regions of ORF1a (nsps 3-6) that have elevated periodicity-2 can be key pharmaceutical targets and possibly the attenuating virus structure that is for vaccine development. To inspect the insightful traits of the SARS-CoV-2 genome, we utilize our periodicity analysis method to survey the nucleotide periodic distributions and the rendered periodicities in the genome. We previously proposed the periodicity analysis method to quantitatively detect the nucleotide repeats and periodicities in a genome (Yin, 2017) . The method employs nucleotide distributions on periodic positions in a genome and identifies approximate repeat structures as the signatures of the genome. Because we have included more functionalities, such as smoothing the periodicity profile, from the original method, here we describe the method in detail though the technical algorithms had been delineated previously (Yin, 2017) . Our computer programs of the periodicity analysis of a genome are available to the public at GitHub repository https://github.com/cyinbox/DNADU. The approximate tandem repeats and the rendered periodicity-p in a genome can be identified by counting the nucleotides of the same types over the positions corresponding to some integer period p. To formulate the strength of a periodicity, we may construct four numerical vectors for four nucleotides. The elements in each vector are the counts of the corresponding nucleotide on the periodic positions. The four numerical vectors of nucleotides are named a congruence derivative (CD) vectors (Yin and Wang, 2016; Yin, 2017) . The CD vector of a nucleotide for a specific periodicity is constructed by the cumulative frequencies of the nucleotide at these periodic positions (Definition 2.1). Definition 2.1. For a DNA sequence of length n, let u α (k) = 1 when the nucleotide α appears at position k, otherwise, u α (k) = 0, where α ∈ {A, T, C, G} and k = 1, · · · , n. The congruence derivative vector of the nucleotide α of the DNA sequence for periodicity p, is defined as , where mod(k, p) is the modulo operation and returns the remainder after division of k by p, and f α = (f α (1), f α (2), · · · , f α (p)). Four congruence derivative vectors f α of periodicity p for nucleotides A, T, C and G form a congruence derivative (CD) matrix of size 4 × p. The columns of the CD matrix indicate nucleotide frequencies at the periodic positions k = pt − q, where k is the position index of a DNA sequence, t = 1, 2, . . ., and q = p − 1, . . . , 2, 1, 0. For example, consider the CD matrix of periodicity 5 for DNA sequence, the first column of the CD matrix shows the nucleotide frequencies at periodic positions k = 1, 6, 11, . . . , 5t − 4; the second column of the matrix shows the nucleotide frequencies at periodic positions k = 2, 7, 12, . . . , 5t − 3; the third column of the matrix shows the nucleotide distributions at periodic positions k = 3, 8, 13, . . . , 5t − 2, and so on. The CD matrix of a DNA sequence describes nucleotide frequencies at all periodic positions and can be used to efficiently compute the Fourier power spectrum and determine periodicities in the DNA sequence (Yin and Wang, 2016) . Therefore, the CD vector reflects the arrangement of repetitive sequence elements and intrinsic periodicities in the DNA sequence. From the nucleotide frequencies on periodic positions, the periodicity strength can then be calculated from the statistical properties of the nucleotide frequencies over the periodic positions. Since the CD matrix contains the nucleotide frequencies on periodic positions, the variance of the matrix elements can measure the nucleotide distribution. For the CD matrix of periodicity p, the summation of 4p elements of the matrix is equal to the length n of the DNA sequence, and the mean of the elements of the CD matrix is n 4p . Therefore, to quantify the nucleotide distribution, we define the normalized distribution uniformity (NDU) of a DNA sequence using the CD matrix (Definition 3.2). Definition 2.2. For a DNA sequence of length n, let f i,j be an element of the CD matrix of periodicity p, the normalized distribution uniformity of periodicity p of the DNA sequence is defined as From Definition 2.2., we notice that the normalized distribution uniformity at periodicity p is an intuitive description for the level of unbalance of nucleotide frequencies on periodic positions. It depends on the quadratic function of the nucleotide frequencies, sequence, and periodicity length. N DU (p) can be used to indicate the presence and intensity of the periodicity p in a DNA sequence. This method offers an elaboration of the repetitive elements such as the repeat consensus, copy number, and the perfect level (Yin, 2017) . When using a sliding window along a genome, the periodicities of 2 to 10 are calculated in each window segment. The sliding window length is 250 bp with one base step size. Therefore, a two-dimension periodicity spectrum is formed for the genome. The two-dimension spectrum of a genome can be considered as the signature or the barcode of the genome. The nucleotide frequencies over the phased positions are impacted by indels because indels change the phases. To mitigate the impact of indels on the periodicity-p magnitude, we insert 0, 1, ...p − 1 zeros into even divided points of a DNA sequence and compute the periodicity-p from these sequences. The periodicity magnitude is the maximum of the obtained periodicity magnitudes from all the sequence with insertions. Because of the computational complexity for long periodicities, this indels mitigation approach is only used when computing the periodicity-2 and periodicity-3 in the SARS-CoV-2 isolates from the COVID-19 pandemic. To locate the positions of a repeat region in a genome, we smooth and filter the corresponding sliding-window periodicity by moving average convolution (De Jong, 1989) . Then the peaks in the periodicity profile are detected using the Z-score algorithm (Brakel, 2020) . The peak positions are used to demarcate repeats in a genome. Note that in addition to the proposed NDU method, the exact locations and the sequence patterns of repeats in a genome can be further depicted using RepeatMasker (Tarailo-Graovac and Chen, 2009) or Tandem Repeats Finder (Tarailo-Graovac and Chen, 2009 ). In a nutshell, to compute distribution uniformities of different periodicities of a DNA sequence, we first scan the sequence in different periodicity sizes, construct the congruence derivative matrix of each periodicity, and compute the NDU(p) of these periodicities p. The periodicity with the maximum distribution uniformity reflects the predominant pattern of repetitive elements. The NDU values of periodicities indicate the perfect levels, and copy numbers of corresponding repeat regions. This study depends on the complete genomes of coronaviruses, including SARS-CoV-2 (Wu et al., 2020) , Severe Acute Respiratory Syndrome (SARS) related coronavirus, Middle East Respiratory Syndrome coronavirus (MERS-CoV), and human infection coronaviruses (human-CoVs). These genome data are retrieved from the National Center for Biotechnology Information (NCBI) Gene Bank. The bat SLCoV/RaTG13 complete sequence was reported by (Zhou et al., 2020) , and retrieved from the GISAID repository (http://www.GISAID.org) (Shu and McCauley, 2017) . 3479 complete genomes of the SARS-CoV-2 isolates in the globe from Jan. to Dec. 2020 are randomly downloaded. The complete SARS-CoV-2 genomes satisfy the following conditions: The genome sequences have no uncertain nucleotides N, and the genome lengths are full according to the reference genome. The genome data in this study are listed and acknowledged in the supplementary material. 3.1. Genomic spectrum coronavirus SARS-CoV-2 reveals rich periodicity-2 pattern in nsps To identify the signature features of the coronavirus SARS-CoV-2 genome, we employ the periodicity spectrum analysis to identify the characteristic periodicities in the genome. We create the genomic spectrum (barcode) of SARS-CoV-2 ( Fig.1(a) ) using the sliding window NDU method and compare it with the counterparts of SLCoV/RaTG13 ( Fig.1(b) ), SARS-CoV/Tor2 ( Fig.1(c) ), and MERS-CoV ( Fig.1(d) ). From the spectrum comparison, we observe that SARS-CoV-2 and SLCoV/RaTG13 both have pronounced periodicity-2 in four regions while both SARS-CoV/Tor2 and MERS-CoV only have an extremely low level of periodicity-2 in the corresponding regions. The strong periodicity-2 in SARS-CoV-2 encouraged us to investigate the causes in detail. To locate the regions of rich periodicity-2 motifs, we verify that periodicity-2 and periodicity-3 are strong signals among all genomic periodicities ( Fig. 2(a, d) ), and detect the peaks of the sliding-window periodicity profiles ( Fig.2 (b, c) ). The peak positions are used to demarcate the dinucleotide motif regions in the genome. The dinucleotide motif regions (dinucleotide TN islands, N is A,T, C or G) are in ORF1a and the corresponding genes are listed in Table 1 . From the genomic spectrum analysis, the relative abundance of the dinucleotide motifs, particularly dinucleotide motif TN regions are mapped to the genes of ORF1a (nsp3, nsp4, and nsp6) ( Fig.3 and Table 1 .). However, these dinucleotide motifs are weak or imperceptible in the corresponding regions in the SARS-CoV and MERS-CoV genomes. In the coronavirus SARS-CoV-2 RNA genome, the gene for replicase of 20 kb encodes two overlapping polyproteins, ORF1a (replicase 1a) and ORF1ab (replicase 1ab). The genome structure and the corresponding dinucleotide regions identified are illustrated in Fig.3 . The two polyproteins are responsible for viral replication and transcription . The expression of the C-proximal portion of pp1ab requires (-1) ribosomal frame-shifting. The first dinucleotide motifs are in the coding region of Papain-like proteinase (PL proteinase, non-structural protein 3, nps3). Nsp3 is the largest essential component of the replication and transcription complex. The PL proteinase in nsp3 cleaves nsps 1-3 and blocks host innate immune response, promoting cytokine expression (Lei et al., 2018; Serrano et al., 2009 ). The second dinucleotide repeat is in the coding region of non-structural protein 4 (nsp4). Nsp4 is responsible for forming double-membrane vesicles (DMV). The third dinucleotide motif is in the coding region of the C-terminal 3CLPro protease (3 chymotrypsin-like proteinase, 3CLpro) and nsp6. 3CLPro protease is essential for RNA replication. The 3CLPro proteinase is responsible for processing the C-terminus of nsp4 through nsp16 for all coronaviruses (Anand et al., 2003) . Therefore, conserved structure and catalytic sites of 3CLpro may serve as attractive targets for antiviral drugs (Kim et al., 2012) . Together, nsp3, nsp4, and nsp6 can induce DMV (Angelini et al., 2013) . In summary, the high latent periodicity-2 islands found in this study are located in the host-interaction regions of the genome of SARS-CoV-2. These special periodicity-2 regions in ORF1a most likely contribute to the adaptive immune response, therefore, implying evolution fitness. Coincidentally, previous work on MERS-CoV using co-evolution analysis revealed that nsp3 represents a preferential selection target in adaptive evolution for zoonotic MERS-CoV to a new host (Forni et al., 2016) . Our finding that nsp3 is involved in evolution fitness is consistent with the discovery in MERS-CoV. We investigate the correlation of dinucleotides AA and TA contents in SARS-CoV genomes and virulence. We examine the dinucleotide in the genomic regions (6 kb -12 kb). The increased periodicity-2 in the genomic regions of SARS-CoV-2 and SARS-like CoVs are the results of the unbalanced distributions of dinucleotides. To understand the evolutionary tendency of coronavirus genomes, we examine the genomic spectra of four major bats SARS-like coronaviruses (SLCoVs), pangolin-SLCoV, SLCoV/ZXC21, SLCoV/WIV1, and SLCoV-/Shaanxi2011, all of which naturally live in bat Rhinolophidae horseshoe. Because pangolin-SLCoV was found similar to SARS-CoV-2, pangolin was exploratorily postulated as an intermediate animal host of SARS-CoV-2. SLCoV/ZXC21 is the second similar strain to SARS-CoV-2, with 82% similarity (Hu et al., 2018) . SLCoV/WIV1 was closely related to SARS-CoV/Tor2 in terms of genome identity and ACE2 binding in human cells (Ge et al., 2013) . SLCoV/Shaanxi2011 was found in 2011 . The results show that pangolin-SLCoV displays three major latent periodicity-2 regions in nsp3 and nsp4, but lacks the corresponding dinucleotide repeats in 3CLPro and nsp6 as found in SARS-CoV-2 ( Fig.4(a) ). So SARS-CoV-2 is mostly closed to SLCoV/RaTG13 ( Fig.1(b) ) and SLCoV/ZXC21 (Fig.4(b) ), not pangolin-SLCoV. If pangolins are the intermediary hosts of SARS-CoV-2, and SARS-CoV-2 was indeed evolved from pangolin-SLCoV, we may infer that pangolin-SLCoV would need to evolve dinucleotide repeats in 3CLPro and nsp6 during evolution fitness before infecting human hosts. We also observed the evolution trend between SLCoV/WIV1 (Fig.4(c) ) and SLCoV/Shaanxi2011 (Fig.4(d) ). The genomic spectrum of SLCoV-/Shaanxi2011 is similar to SLCoV/WIV1, but has additionally increased dinucleotide repeat in 3CLPro and nsp6. This new dinucleotide repeat in the region 3CLPro and nsp6 in SLCoV/Shaanxi2011 is consistent with the regions found in SARS-CoV-2, SLCoV/RaTG13, and SLCoV/ZXC21. Therefore, the high latent periodicity-2 regions 3CLPro and nsp6 probably play an important role in the evolutionary fitness of SARS-CoV-2 to the human hosts. The results show that only SARS-CoV and SARS-like CoVs have low latent periodicity-2. The low-dinucleotide contents can be considered as in early evolution fitness when interacting with the human immune system, then low-dinucleotides may render high virus virulence because the virus has not adapted to the host immune system, and the host immune system acts intensely. As we have seen that latent periodicity-2 is the special signal in CoV genomes and is caused by unbalanced dinucleotide motif contents. To investigate the correlation of dinucleotide contents and pathogenicity of coronaviruses, we produce and compare the genomic spectra of four common human coronaviruses (Fig.5) . Classical human coronavirus 229E (HCoV-229E) and human coronavirus OC43 (HCoV-OC43) were identified in 2004. The two viruses are close relatives, and the virus characteristics are similar to human pathogenicity. Both HCoV-229E and HCoV-OC43 can cause young children and the elderly and have a low immune function. Almost 100% of children are infected in early childhood, mainly as self-limiting upper respiratory infections, such as the common cold and intestinal infections Symptoms caused by HCoV-OC43 strain are generally more severe than those of HCoV-229E virus. From the genomic spectrum analysis, we observe higher periodicity-2 in HCoV-229E than in HCoV-OC43 ( Fig.5 (a,c) ). These periodicity-2 rich regions correspond to the three sub-regions in SARS-CoV-2. High periodicity-2 value may attenuate the virus replication and therefore reduce severe virulence. That is in an agreement with the correlation of dinucleotide motif contents and pathogenicity previously. The spectra of HCoV-NL63 and HCoV-HKU1 demonstrate extremely high periodicity-2, as well as periodicity-3 in the corresponding regions ( Fig.5 (b,d) ). HCoV-NL63 and HCoV-HKU1 are the most common human CoVs that cause only a mild cold symptom or no symptom (Pyrc et al., 2007) . Again, these high dinucleotide repeats may contribute to the light pathogenicity in these two viruses. We wish to know whether SARS-CoV-2 originally evolved from SARS-CoV, or SARS-CoV-2 would evolve to SARS-CoV. The answer to this question may help us to predict the evolution of SARS-CoV-2 virus for better disease prediction and control. Because the genomes of SARS-related coronaviruses over a long time period are rarely available, to infer the evolution of coronaviruses, we track the trend of HCoV-229E coronaviruses over the last six decades from the first human infected HCoV-229E identified in 1962 (Thiel et al., 2001) . HCoV-229E virus causes common cold but occasionally it can be associated with more severe respiratory infections in children, elderly, and persons with underlying illness. Using the measurements of dinucleotides in the genomic regions, the trend of HCoV-229 coronaviruses may infer the evolutionary stages of bat coronaviruses. In a similar method, we may then determine the origin SARS-CoV-2 if the trend of SARS-CoV-2 is compared with SARS-like coronaviruses. The human coronavirus HCoV-229E strains used in the dinucleotide trend analysis are from different historical periods. The reference genome HCoV-229 in the evolutionary analysis was obtained from the infectious HCoV-229E, the 1973-deposited laboratory-adapted prototype strain of HCoV-229E (VR-740). The HCoV-229E prototype strain was originally isolated in 1962 from a patient in Chicago. The first clinical HCoV-229E isolate from a US patient in 2012 was included in this study (Farsani et al., 2012) The result in the periodicity trends of the HCoV-229E coronaviruses demonstrates that both periodicity-2 ( Fig.6(a) ) and periodicity-3 ( Fig.6(b) ) in the coronaviruses are increasing with evolutionary time. SARS-CoV-2 has relatively high periodicity-2 and periodicity-3. This result suggests that the trends of periodicity-2 and periodicity-3 are increasing with time. The evolutionary origin of coronaviruses can be inferred by the trends of periodicity-2 and periodicity-3. Therefore, we may compare the periodicity-2 and periodicit-3y in SARS-CoV-2 and SARS-like coronaviruses to understand the evolutionary origin of SARS-CoV-2. To investigate the correlation of dinucleotide repeats with virus virulence in SARS-related coronaviruses, we compare the spectra of the genomic region at coordinates 6k-12k bp of five SARS-related coronaviruses. The genomic regions contain abundant dinucleotide repeats. The region 6k-12k of the genome contains three dinucleotide repeats approximately located at 6k-8k, 8k-10k, and 10k-12k sub-regions. The five SARS-related coronaviruses have different levels of virulence. The most virulent virus is MERS-CoV, followed by SARS-CoV. The periodicity-2 magnitudes, which reflect the distribution of dinucleotides, are compared and shown in (Fig.7 (a,b) ). We may see that MERS-CoV genomic region has the lowest dinucleotide level in all three dinucleotide sub-regions. SARS-CoV also has a low dinucleotide level but is higher than MERS-CoV. Fig. 8 also shows the depleted CG and GC contents in SARS-CoV-2 genome, which is consistent with recent study (Xia, 2020) . We, therefore, may infer that the low dinucleotide level correlates with high virulence. The lower the dinucleotide level is, the higher virus virulence is. This postulation is supported by the observation of the dinucleotide in three SARS-related coronaviruses. Compared with the SARS-CoV, the SARS-like bats-SLOV/Rp3 has similar dinucleotide distributions in sub-regions 1 (6k-8k) and 3 (10k-12k), but higher dinucleotide distribution in sub-region 2 (8k-10k). SLCoV/Rp3 has lower virus virulence than SARS. The SARS-like SLCoV/ZXC21, which shares the highest sequence identity with SARS-CoV-2, shows slightly lower dinucleotide distributions in the two sub-regions 2 and 3, and slightly higher dinucleotide distribution than sub-region 1. We notice that trends of the periodicity-2 and periodicity-3 from MERS-CoV, SARS-CoV, SARS-like CoVs, and SARS-CoV-2 are increasing ( Fig.8 (a,b) ). Based on the previous analysis of the periodicity trends of coronaviruses at different times, we may infer that SARS-CoV-2 originates from the SARS-like CoVs. 3.5. The dynamics of latent periodicity-2 in SARS-CoV-2 genome during evolution It is well studied that dinucleotide composition imbalance in RNA viruses may impact the virus replications, specifically, attenuating or strengthening virus virulence during evolution (Fros et al., 2017; Gu et al., 2019) . Clinical evidences have suggested that SARS-CoV-2 has lower virulence than SARS-CoV. To investigate if the increased dinucleotide repeats correlate with virus virulence, we compare the dinucleotide frequencies in the three dinucleotide motif rich islands of SAR-CoV-2 and SARS-CoV/Tor2. The result shows that both SARS-CoV-2 and SARS-CoV have an abundance of dinucleotides TT and TA in the whole genome, and three prominent dinucleotide motif regions ( Fig.8 (a,b,c,d) ), and extreme CG deficiency in sub-region 3 (Fig.8(d) ). The dinucleotides TT and TA are increased in SARS-CoV-2 compared with SARS-CoV ( Fig.8 (a,b,c,d) ). The role of increased dinucleotides TT and TA in the SARS-CoV-2 genome is to possibly attenuate virus replication. One mechanism for attenuating virus by dinucleotide bias is that the dinucleotide regions fold a special structure as a target for cell RNA cleavage, which is a fundamental host response for controlling viral infections (Zhou et al., 1993) . The 2,5-oligoadenylate synthetase/RNase L system is an innate immunity pathway that responds to a pathogen-associated molecular pattern (PAMP) to induce degradation of viral and cellular RNAs, thereby blocking viral infection. In higher vertebrates, this process is often regulated by interferons (IFNs). Ribonuclease L (RNase L, L is for latent) is an interferon (IFN)-induced antiviral ribonuclease which, upon activation, destroys all RNA within the cellular and viral (Silverman, 2007) . RNase L cleaves hepatitis C virus (HCV) RNA at single-stranded TT and TA dinucleotides throughout the open reading frame (ORF). An interesting discovery is that in bacterium Mycoplasma pneumonia, which is a respiratory infection agent, the genomes also have relative abundance extremes dinucleotides TT and TA (Karlin, 1998) . Therefore, we may postulate that the dinucleotides TT and TA regions in SARS-CoV-2 are possibly cleaved by RNase L during infection. 3.6. Increases in periodicities in SARS-CoV-2 genomes during the COVID-19 pandemic During the ongoing COVID-19 pandemic in the year 2020, large numbers of SARS-CoV-2 isolate genomes with transmission dates from the globe have been accumulated in the GISAID database. To gather additional evidence supporting our hypothesis that the latent periodicity-2 in SARS-CoV-2 is increased during human infection, we measure the latent periodicities 2 and 3 in SARS-CoV-2 genomes during the COVID-19 pandemic (Fig.9) . The high-quality SARS-CoV-2 genomes are randomly selected from the global in each month of 2020. The results show that in the infection periods from Table 1 . Jan. to Sept./2020, the SARS-CoV-2 population has been increasing latent periodicity-2 ( Fig.9(a) ) and periodicity-3 ( Fig.9(b) ) in the genome variants. Notably, the periodicity-2 and periodicity-3 in the period Oct.-Dec./2020 are slightly lower than Jul.-Sept./2020 (Fig.9 ). The decreases of periodicities 2 and 3 of the SARS-CoV-2 genomes in Oct.-Dec./2020 can probably explain the virulence increase in the second wave of the global COVID-19 pandemic in Oct.-Dec./2020. Therefore, these results further validate our hypothesis that the latent periodicity-2 and periodicity-3 in an RNA virus are increasing during human infection. The changes of periodicity magnitude can be global indicators of the genome in tracking virus evolution. Nevertheless, further analysis of the dynamics of periodicities 2 and 3 in different geographic regions and clinical outcomes will be needed to understand the correlation of the periodicities and SARS-CoV-2 virulence during the infection and evolution. In this study, we investigate the latent periodicity-2 and periodicity-3 which are caused by approximate tandem repeats in the SARS-CoV-2 genome. The approximate dinucleotide motifs in the genomic spectra are revealed by the periodicity analysis. We discover that the strength of these repeats correlates with the evolutionary fitness of the virus to a human host for maximizing its survival in epidemics, instead of destroying the host. Therefore, RNA viruses simulate host mRNA composition such as the dinucleotide compositions (Fros et al., 2017) . Most vertebrate RNA and small DNA viruses suppress genomic CG and TA dinucleotide frequencies, apparently mimicking host mRNA composition. The abundance of dinucleotides TT and TA are most likely common pathogenicity islands in microbial genomes. This study on SARS-CoV-2 provides additional evidence that increased content of dinucleotides TT and TA in SARS-CoV-2 is the result of interaction with the host during virus evolution. We consider these three dinucleotide abundance regions as pathogenicity islands of the SARS-CoV-2 genome. In addition, these special regions may contribute RNA replications, and can be recognized by cell RNAase L for RNA degradation in the immune response. However, this study is only a theoretical analysis of the genomes. The actual functional consequences and the impacts on transmissibility and pathogenesis of these dinucleotide imbalances should be determined by biochemical experiments and animal models. The choice of latent periodicity-2 as the genome signature is the simplest, but combining the periodicity-2 and periodicity-2 would be more informative (Karlin and Burge, 1995) . The coding periodicity-3 is natural and is the most pronounced periodicity in viral genomes. As we have shown in SARS-CoV-2 genomes during COVID-19 pandemic period year 2000, both periodicity-2 and periodicity-3 are increasing with the infection time. We will investigate the contributions of the two periodicities to the virulence and transmission of SARS-CoV-2 in detail as future research. In humans and mammals, APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like3) systems help protect the organisms from viral infections. For the origin of dinucleotide repeats in the SARS-CoV-2 genome, we speculate that the molecular mechanism of increased dinucleotide repeats during evolution fitness is possibly APOBEC3-mediated editing of viral RNA, in which Cytosine (C) is often mutated to Uracil (U) by deamination (Bishop et al., 2004) . Therefore, many dinucleotide TN motif regions could be generated as the periodicity-2 island in the RNA genome by the APOBEC defense system. Cross-species transmission of coronaviruses from wildlife reservoirs may lead to disease outbreaks in humans, posing a severe threat to human health. To date, most studies on the zoonotic origin of SARS-CoV-2 primarily focus on the Spike protein, which is essential for the entry of virus particles into the cell. Mutations or acquisition of potential cleavage site for furin proteases in the Spike protein may confer zoonotic transmissibility of SARS-CoV-2 (Andersen et al., 2020; Hoffmann et al., 2020) . However, the spike protein is required but not sufficient for zoonotic coronavirus transmission. For instance, the Spike protein in SARS-like SHC014-CoV can only enable the chimeric virus SHC014-MA15 to infect human cells when the Spike protein is integrated into a wild-type SARS-CoV backbone (Menachery et al., 2015) . The SARS-CoV backbone contains ORF1ab replication components. Our study provides evidence for the importance of ORF1ab replication regions in evolutionary fitness. Consequently, these ORF1ab replication components are critical for zoonotic transmissions. Especially, the results in this study suggest that when these dinucleotide repeat regions in ORF1ab of SARS-CoV-2 genome are increased, the virus fits more in human hosts, therefore, virulence of the virus is decreased. Modification of these repeat regions, therefore, could produce the attenuated virus, which can be effective and safe for vaccine development. This study on the genomic spectrum of SARS-CoV-2 reveals high din-ucleotides TT regions in ORF1a might contribute to evolution fitness in host immune evasion. Accordingly, monitoring SARS-CoV-2 and developing antiviral drugs should envisage molecular characteristics and changes of nsps 3-6, which are vital components in interacting with human hosts. These special dinucleotide motif regions should be investigated in detail for its functions and phenotype changes in SARS-CoV-2. Importantly, these high periodicity-2 regions can possibly be the prophylactic and therapeutic targets for controlling COVID-19. At last, we clarify some aspects of the method used in this study. In genome analysis, Fourier transform is often used to detect the periodicities and repeats in a DNA sequence. The Fourier transform method computes the power spectrum of each frequency on the four binary indicator sequences that correspond to four nucleotide positions in the DNA sequence. In the periodicity detection method in this study, we introduce the mean of four nucleotides, i.e., n/4p, in Eq(2), instead of the mean of individual nucleotide composition. If we take the mean of individual nucleotide composition in Eq(2), the periodicity-2 and periodicity-3 spectra by the NDU method are the same as the corresponding Fourier power spectra (Yin and Wang, 2016) . Therefore, we consider using the mean n/4p in Eq (2) is an advantage to capture periodicity-2 in our original method (Yin, 2017) . In addition, the periodicity detection method uses a parallel four-nucleotide analysis. Yet the periodicity spectrum is only in the form of the sum over nucleotides of four types (Eq(2)). The periodicity profiles for the nucleotides of particular types could be more informative. In particular, the rankings for the intensities of periodicity-2 of SARS-CoV-2 and SARS-CoV are T>C>G>A (SARS-CoV-2: periodicity-2(A)=36.5391, periodicity-2(T)=75.2827, periodicity-2(C)=65.8607, periodicity-2(G)=43.5443; SARS-CoV/Tor2: periodicity-2(A)=18.3036, periodicity-2(T)=48.8777, periodicity-2(C)=37.7693, periodicity-2(G)=26.2949). The ranking for the intensity of coding periodicity-3 for the complete genome of SARS-CoV-2 is T>C>G>A (periodicity-3(A)=25.2457, periodicity-3(T)=70.5814, periodicity-3(C)=48.1901, periodicity-3(G)=45.9472), whereas for SARS-CoV/Tor2 the counterpart ranking is T>G>C>A (periodicity-3(A)=14.1041, periodicity-3(T)=47.2578, periodicity-3(C)=26.2882, periodicity-3(G)=29.4027). We see that difference in intensities for the nucleotides of the same types is also significant between the two viruses. Such four-nucleotide profiles of human CoVs are worth investigating in the future. The author sincerely thanks the researchers worldwide who sequenced and shared the complete genomes of SARS-CoV-2 and other coronaviruses from GISAID (https://www.gisaid.org/). This research is dependent on these precious data. The author especially appreciates Prof. Jiasong Wang (Nanjing University, China), Dr. Gang Cheng (the University of Illinois at Chicago), and Dr. Guo-Wei Wei (Michigan State University) for valuable discussions. The author is grateful to three anonymous reviewers for their constructive comments on the methods and presentation of the paper. The supplementary materials contain the list of genomes used in this study, and the genome curation methods, and the periodicity values of SARS-CoV-2 isolates in the COVID-19 pandemic. The supplementary materials of this paper are in the separate folders of this paper. Coronavirus main proteinase (3CLpro) structure: basis for design of anti-SARS drugs The proximal origin of SARS-CoV-2 Severe acute respiratory syndrome coronavirus nonstructural proteins 3, 4, and 6 induce double-membrane vesicles Causes and implications of codon usage bias in RNA viruses APOBECmediated editing of viral RNA Complete genome sequence of human coronavirus strain 229E Peak signal detection in real-time time series data Emerging coronaviruses: genome structure, replication, and pathogenesis Smoothing and interpolation with the state-space model Dinucleotide composition in animal RNA viruses is shaped more by virus family than by host species The first complete genome sequences of clinical isolates of human coronavirus 229E Coronaviruses: an overview of their replication and pathogenesis Extensive positive selection drives the evolution of nonstructural proteins in lineage C betacoronaviruses CpG and UpA dinucleotides in both coding and noncoding regions of echovirus 7 inhibit replication initiation post-entry Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor Cytosolic sensing of viruses Dinucleotide evolutionary dynamics in influenza A virus A multibasic cleavage site in the spike protein of SARS-CoV-2 is essential for infection of human lung cells Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats Sensing of RNA viruses: a review of innate immune receptors involved in recognizing RNA virus invasion Global dinucleotide signatures and analysis of genomic heterogeneity Dinucleotide relative abundance extremes: a genomic signature Broad-spectrum antivirals against 3C or 3C-like proteases of picornaviruses, noroviruses, and coronaviruses Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein The coronavirus nucleocapsid is a multifunctional protein A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence The novel human coronaviruses NL63 and HKU1 Nuclear magnetic resonance structure of the nucleic acid-binding domain of severe acute respiratory syndrome coronavirus nonstructural protein 3 GISAID: Global initiative on sharing all influenza data-from vision to reality Viral encounters with 2, 5-oligoadenylate synthetase and RNase L during the interferon antiviral response Epidemiology, genetic recombination, and pathogenesis of coronaviruses Using repeatmasker to identify repetitive elements in genomic sequences Infectious RNA transcribed in vitro from a cDNA copy of the human coronavirus genome cloned in vaccinia virus Periodicity in DNA coding sequences: implications in gene evolution Review of bats and SARS Decoding SARS-CoV-2 transmission, evolution and ramification on COVID-19 diagnosis, vaccine, and medicine WHO, 2020. Coronavirus disease 2019 (COVID-19) situation report ÃćâĆňâĂIJ 130. Coronavirus Disease A new coronavirus associated with human respiratory disease in China Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense Novel SARS-like betacoronaviruses in bats Identification of repeats in DNA sequences using nucleotide distribution uniformity Genotyping coronavirus SARS-CoV-2: methods and implications Periodic power spectrum with applications in detection of latent periodicities in DNA sequences Expression cloning of 2-5A-dependent RNAase: a uniquely regulated mediator of interferon action Fatal swine acute diarrhoea syndrome caused by an HKU2-related coronavirus of bat origin A pneumonia outbreak associated with a new coronavirus of probable bat origin