key: cord-0929148-k2juhyex authors: Hassan, Sk. Sarif; Moitra, Atanu; Rout, Ranjeet Kumar; Choudhury, Pabitra Pal; Pramanik, Prasanta; Jana, Siddhartha Sankar title: On spatial molecular arrangements of SARS-CoV2 genomes of Indian patients date: 2020-05-09 journal: bioRxiv DOI: 10.1101/2020.05.01.071985 sha: c39fe876990cd1a7f1590ba0fc16564743f87605 doc_id: 929148 cord_uid: k2juhyex A pandemic caused by the SARS-CoV2 is being experienced by the whole world since December, 2019. A thorough understanding beyond just sequential similarities among the protein coding genes of SARS-CoV2 is important in order to differentiate or relate to the other known CoVs of the same genus. In this study, we compare three genomes namely MT012098 (India-Kerala), MT050493 (India-Kerala), MT358637 (India-Gujrat) from India with NC_045512 (China-Wuhan) to view the spatial as well as molecular arrangements of nucleotide bases of all the genes embedded in these four genomes. Based on different features extracted for each gene embedded in these genomes, corresponding phylogenetic relationships have been built up. Differences in phylogenetic tree arrangement with individual gene suggest that three genomes of Indian origin have come from three different origins or the evolution of viral genome is very fast process. This study would also help to understand the virulence factors, disease pathogenicity, origin and transmission of the SARS-CoV2. The disease COVID-19 is caused by the SARS-CoV2 initiated in late December 2019 in Wuhan, China, and since then it has been impulsed various countries across the world [1] . Presently, this disease, a pandemic as announced by the WHO, is a major health concern [2] . The family of coronaviruses is enclosed by different CoVs which are a single-stranded, positive-sense RNA 5 genome of size approximately 26-32 kb [3] . The CoVs are classified into four genera, the α-CoVs, β-CoVs, γ-CoVs and δ-CoVs [4] . One of most important genera of coronaviruses is the β-CoVs where the present SARS-CoV2 belongs [5] . The β-CoVs mainly infect humans, bats including other animals such as camels, and rabbits and so on [6] . Two-third of the SARS-CoV2 genomes from 5' end is conserved for the ORF1 gene which encodes sixteen polyproteins and the 3' ends 10 contains various structural protein coding genes including surface (S), envelope (E), membrane (M), and nucleocapsid (N) proteins [7] . In addition there are six accessory protein coding genes such as ORF3a, ORF6, ORF7a, ORF7b, and ORF8 also present in the SARS-CoV2 genome [8] . The spatial arrangement of genes over the SARS-CoV2 genome is presented in the Fig.1 . [9] . The polyprotein ORF1ab encoded by the ORF1 gene play key roles in virus pathogenesis, 15 cellular signalling, modification of cellular gene expression [10] . The envelope (E) proteins play multiple roles during infection, including virus morphogenesis [11, 12] . The N protein encoded by the gene N plays a vital role in the virus morphogenesis and assembly [13, 14] . The M gene encodes the M protein which plays a central role in virus morphogenesis and assembly via its interactions with other viral proteins [15, 16] . It does determine the shapes of the virions, promotes membrane 20 curvature. The M gene sequence of SARS-CoV2 is similar to that of SARS-CoV and MERS-CoV with 90.1% and 39.2% respectively [17] . The S protein (S gene) is one of the most important structural proteins, which is used as a key that the virus uses to enter host cells. The spike protein attaches the virion to the cell membrane by interacting with host receptor and infects the host cell [18, 19] . In viral replication, the accessory proteins such as ORF3a, ORF6, ORF7a, ORF7b, SARS-COV2 genome shows 79.6% identity with SRS-COV1 genome [21] . It is reported that the Spike glycoprotein of the Wuhan coronavirus is modified via homologous recombination [22] . The SARS-CoV2 is more phylogenetically related to SARS-CoV than to MERS-CoV [23] . Still, the proximal origin of COVID-19 transmission or evolutionary relationship of SARS-CoV2 and other 30 coronaviruses is very much controversial. The outbreak and infectious behaviour of the SARS-CoVs and the lack of effective treatments for CoV infections demand the need of detailed understanding of coronaviral molecular biology, with a specific focus on both their structural proteins as well as their accessory proteins. From a molecular biology perspective, figuring out why the virus is so much virulent and 35 infectious than other CoVs belonging to the genus β-CoVs is one of the most important aspects to look into [24] . The present SARS-CoV2 genomes are classified, based on SNPs, into two broad groups known as L and S [25] . The nature of virulence is also associated with the L type of CoV2 genomes. Clearly, on having information of sequential similarity among genes and genomes of various 40 CoVs is not enough to decipher the deep message regarding various characteristics viz. virulence, infection and transmission capacities, embedded in the RNA sequence. So in order to find out the genomic information of the two types (L and S) of SARS-CoV2, an attempt is made to discover the molecular and spatial organizations of each gene embedded over a sample of four genomes of which three of them are from India and one from China-Wuhan. 45 In the NCBI virus database, as on 30th April, 2020, there are three complete genomes viz. MT012098 (India-Kerala), MT050493 (India-Kerala) and MT358637 (India-Gujrat) of SARS-CoV2 from Indian patients are available, which we consider for this present study. As a reference genome, NC 045512 (China-Wuhan) is taken. Note that, the genomes MT012098, MT050493 and 50 NC 045512 belong to the S-type and other genome MT358637 from India belongs to L-type as per classification made based on SNPs [25] . An information regarding the lengths and names (followed strictly by NCBI database) of all eleven genes embedded in the four genomes is presented in the Table 1 . Note that, in the Table 1 , * denotes absence of the gene in the respective genome. It is noted that the gene ORF7b is absent in the genomes MT012098 and MT050493 as shown 55 in Table 1 with '*' mark. From the sequence based similarity, a phylogenetic relations is given in the Fig.2 which describes that the genomes NC 045512 from Wuhan and MT012098 from India are very close to each other with 99.98% sequential similarity as mentioned in the article by Yadav P.D. et.al. [26] . At first, each sequence to a binary sequence of 1 s and 0 s as per the definition 1 has been transformed to a binary representation. Here purine (A,G) and pyrimidine (T,C) bases are represented as 1 and 0 respectively. This binary representation is named as purine-pyrimidine representation 65 [27, 28, 29, 30, 31] . Also each sequence is transformed to a binary sequence with respect to a nucleotide base B of 1 s and 0 s as per the following definition 2. Hence four binary representations for each nucleotide B ∈ {A, T, C, G} would be derived for a given nucleotide sequence. These binary representations are actually the spatial templates of 70 all the nucleotides. Each of these spatial templates is to be analysed using various methods as mentioned in the following. Bernoulli process with probability p of the two outcomes (0/1) [32, 33] . It is defined as where p 1 = k 2 l and p 2 = l−k 2 l ; here l is the length of the binary sequence and k is the number of 1's in the binary sequence of length l. The binary Shannon entropy is a measure of the uncertainty in a binary sequence. If the probability p = 0, the event is certainly never to occur, and so there is 75 no uncertainty, leading to an entropy of 0. Similarly, if the probability p = 1, the result is certain, so the entropy must be 0. When p = 0.5, the uncertainty is at a maximum and consequently the SE is 1. Shannon entropy is a measure of the amount of information (measure of uncertainty). Conservation of each of the four nucleotides has been determined using Shannon entropy [34] . For a given RNA sequence, the conservation SE is calculated as follows: where p Ni represents the occurrence frequency of a nucleotide N i in a RNA sequence. The Hurst Exponent (HE) is used to interpret the trend of a sequence, 80 which could be positive or negative [35] . The HE belongs to the unit interval (0, 1). If HE lies within (0, 0.5) then the sequence possesses a negatively trending. Otherwise if the HE belongs to (0.5, 1) then the sequence is positively trending. If the HE is turned out to be 0.5 then the sequence must possesses randomness. The HE of a sequence b n (length: n) is defined as In addition to these two parameters Shannon entropy and Hurst exponent, some basic derivative features such as nucleotide frequency (Freq), double nucleotide frequency, codon usage frequency, GC content, purine-pyrimidine density are obtained for a given nucleotide sequence [29, 31] . Also 90 based on nucleotide densities, a decreasing order (density order) is obtained for a given sequence. It is worth noting that the first positive frame has been considered to determine codons and double nucleotides over a given gene. Important features for the all genes ORF1-10 of the four genome using above methods have 95 been analyzed and interpreted in the following sections. For a given gene, we define a feature vector as (length, frequency of individual nucleotides, GC content, % of purines and pyrimidines, frequency of each codon usage, frequencies of each double nucleotides, Shannon entropy (SE) and Hurst Exponent (HE) of the spatial representations of each 100 nucleotides and purine-pyrimidine). Here we briefly state the findings based on the feature vectors derived for every gene embedded in the four genomes and accordingly based on the findings some discussions are made. Before we proceed to make specific observations about codon and double nucleotide usages we present a table (Table 2 ) below describing the molecular information of the each genes with associated remarks. In Table 3 Here in Table 4 , we present codon usages of the gene E across the all four genomes. In contrast, it is found that the amino acid V is encoded in the primary protein sequence by three different codons such as GTA, GTC and GTT are used with different frequency viz. 3, 1, 7 120 respectively. Interestingly, amino acid Tryptophan (W) which is encoded by TGG only, does not appear in the protein sequence E over the four genomes. Similar to the codon TGG, there are several other codons which also do not appear in the gene E across the genomes. The frequency of usage of the double nucleotides in the gene E over the four genomes are given in the Table 5 . The double nucleotide GG is not used at all in the gene E over the said genomes. It seems there is no bias of using double nucleotides unlike in the case of codon usages. • From the Table 3 , it follows that all the spatial representations of the nucleotides A, C, T and G of the gene E are positively trending in gene E over the four genomes. Based on the C. Among the four spatial representations, the most positively trending representation is of the nucleotide A which suggests that purine bases are positively trendier than pyrimidine bases. • Further, SEs of spatial representation of the nucleotide T is highest compared to the other 135 nucleotides. • Table 3 follows that the nucleotide conservation entropy is turned out to be significant close Codon Usages: The frequency of codon usages in gene M over the four genomes has been elaborated in the Table 6 . 145 It shows that the start codon ATG and stop codon TAA are with the frequencies four and once respectively in the gene M over the four genomes. Among the six codons which encode the amio acid L, the codon CTT is used in the gene M with highest frequency (12) The double nucleotide TT is used maximum with frequency 32. It is worth noting that the unused double nucleotide GG in the gene E, is used seventeen times in the gene M of the four genomes. The frequency of codon usages in the gene N over the four genomes are presented in the Table 8 . The frequency of usages are strictly same in three genomes NC 045512 180 (G 1 ), MT012098 (G 2 ) and MT050493 (G 3 ) compared with MT358637 (G4) which is slightly varied. Note that codons TAG, TGT, TGC and TGA are absent in the gene N across the four genomes. The codon CAA and CAG, which encode the amino acid Glutamine (Q), are used respectively The frequency usages of the double nucleotides is given in Table 190 9. Note that the frequency of the double nucleotides AT and GC in the gene N of the genome MT358637 is different from other three genomes as shown in Table 9 . 77 77 CA 58 58 AT 34 35 CT 37 37 AC 51 51 CC 35 35 AG 43 43 CG 17 17 TA 17 17 GA 43 43 TT 40 40 GT 17 17 TC 35 35 GC 47 46 TG 45 45 GG 34 34 It is noted that the highest frequency of usage is attained by the double nucleotide AA unlike in the previous cases. Clearly there is no bias of choices of double nucleotides in the gene N. • Table 3 Here we list all the codons with their respective frequencies in the gene S 220 over the four genomes NC 045512 (G 1 ), MT012098(G 2 ), MT050493(G 3 ) and MT358637 (G 4 ) in the Table 10 . The Table 11 . It is found that all the sixteen double nucleotide are used in the S gene over the four genomes. The TT is used with highest frequency and CG is present with lowest frequency over the gene S of the four genomes. It is noticed that the only double nucleotide TG is present in the gene S over the four genomes, with equal frequency (166). The longest gene ORF1 which encodes 16 non-structural proteins of length 21290 across the four genomes. Although the length is fixed among all the genomes, there is a slight frequency variance of the nucleotide bases as presented in the Table 2 . Here in addition, the nucleotide density and AT-GC density of the gene ORF1 across the four genomes are figured in the Fig.3 . 266 266 266 266 CAT 102 102 102 102 GTT 245 245 246 245 CCT 99 98 99 99 AAA 243 243 243 242 GAG 93 93 93 93 TTA 220 220 220 221 CCA 87 86 86 87 ACA 219 219 219 219 CAC 83 83 83 83 AAT 218 218 218 218 TTC 78 78 78 77 TTT 217 218 217 218 AGA 76 76 76 76 TTG 210 210 210 210 ATC 75 75 75 75 ACT 201 202 201 201 TAG 72 72 72 72 ATT 199 198 198 199 ACC 70 70 70 70 GAA 197 197 197 197 GTC 70 70 70 70 GCT 186 186 186 186 TGG 68 68 68 68 GGT 185 185 185 185 CTC 66 66 66 66 CTT 180 179 180 180 GGA 59 59 59 59 AAG 171 171 171 172 TAA 58 58 59 58 TGT 155 155 155 155 TGC 53 53 53 53 ATA 150 150 150 150 AGG 52 52 52 52 CAA 149 149 148 149 GCC 51 51 51 51 TAT 148 148 148 148 GGC 48 48 48 48 CTA 146 147 147 145 CGT 42 42 42 42 GTG 145 145 145 145 TGA 42 42 42 42 GAT 144 144 144 144 TCC 35 35 35 35 TAC 141 141 141 141 AGC 27 27 26 27 TCT 141 142 141 141 ACG 24 24 24 24 GTA 140 140 140 140 GGG 23 23 23 23 TCA 129 129 129 129 TCG 23 23 23 23 AAC 122 122 122 122 GCG 22 22 22 22 CTG 122 122 122 122 CGC 19 19 19 19 CAG 116 116 116 116 CCC 16 16 16 16 GCA 111 111 111 111 CCG 12 12 12 12 AGT 110 110 111 110 CGA 9 9 9 9 GAC 108 108 108 108 CGG 8 8 8 8 AAA AAC ACA ACC CAA CAC CCA CCC AAG AAT ACG ACT CAG CAT CCG CCT AGA AGC ATA ATC CGA CGC CTA CTC AGG AGT ATG ATT CGG CGT CTG CTT GAA GAC GCA GCC TAA TAC TCA TCC GAG GAT GCG GCT TAG TAT TCG TCT GGA GGC GTA GTC TGA TGC TTA TTC GGG GGT GTG GTT TGG TGT TTG Following in the Table 13 , the frequency of amino acids in the primary protein sequence of 275 ORF1 with graphical frequency distribution as shown in Fig.6 . The Table 13 follows that the amino acid Leucine is maximally used with frequency 944 over the protein ORF1 whereas Tryptophan is least used in ORF1. The frequencies of double nucleotides over the gene ORF1 of the four genomes are presented in the Table 14 with bar plot in the Fig.7 . It is seen that all the 280 sixteen double nucleotides are present over the gene ORF1 across the four genomes. It is observed that the most frequently used double nucleotides is TT with frequency 1133 in ORF1 across all the four genomes. It is noteworthy that the frequency of TT in ORF1 over the Indian genome MT358637 is 1135. There is not much variance of frequencies of double nucleotides in ORF1 from one genome to another, is observed. HEs and SEs of spatial representations: • From the Table 3 , it is quite clear that the spatial representations are positively trending. Each of the four nucleotide spatial representation has its own positive autocorrelation (trend- 300 Codon usages: The frequency of usages of codons in the gene ORF10 are given in the Table 15 . Due to the absence of the necessary codons, the amino acids Tryptophan and Glutamine are absent in the protein sequence of ORF10. It emerged that among three possible stop codons, the only stop codon TAG is present in ORF10. Among the four possible codons, only the CCG has been chosen to encode the amino acid 305 Proline(P) in the gene ORF10. Thus a choice bias codon is observed in the gene ORF10 across the four genomes. The usages of double nucleotides over the gene ORF10 is tabulated in Table 16 . The double nucleotide CC is thoroughly absent from the gene ORF10 across the four genomes. It is noted that the TA is having highest frequency in the gene ORF10 unlike in the previous cases. It is worth noting that in the aforementioned genes double nucleotides AA and TT were having highest frequencies in the corresponding gene. • From the HEs of genes ORF10 across four genomes, it is observed that the spatial represen-315 tations are turned out to be positively auto-correlated. The spatial binary representation of the nucleotide A is the most auto-correlated representation whereas the least auto-correlated spatial representation is of the nucleotide G. • As in the case of others genes, the uncertainty is at maximum of the presence and absence of pyrimidine bases in ORF10 across the four genomes. Codon usages: The frequency of codon usages over the the gene ORF3a across the four genomes is presented in the Table 17 . Note that that the codons TCG, CCC, GCG, TAG, TGA, CGA, CGG and GGG do not appear in the gene ORF3a. ATG 4 TCT 3 TAT 8 TGT 3 TTT 8 TCC 4 TAC 9 TGC 4 TTC 6 TCA 8 TAA 1 TGA 0 TTA 3 TCG 0 TAG 0 TGG 6 TTG 9 CCT 7 CAT 4 CGT 1 CTT 10 CCC 0 CAC 4 CGC 1 CTC 5 CCA 3 CAA 5 CGA 0 CTA 1 CCG 2 CAG 4 CGG 0 CTG 2 ACT 13 AAT 4 AGT 5 ATT 9 ACC 2 AAC 4 AGC 2 ATC 5 ACA 6 AAA 7 AGA 3 ATA 7 ACG 3 AAG 4 AGG 1 GTT 14 GCT 7 GAT 7 GGT 7 GTC 3 GCC 3 GAC 6 GGC 3 GTA 7 GCA 3 GAA 10 GGA 4 GTG 1 GCG 0 GAG 1 GGG 0 The stop codons TAG and TGA are not at all used in the gene ORF3a over the four genomes. All the necessary codons for encoding twenty amino acids are present in the gene ORF3a. All the sixteen double nucleotides are present in the gene ORF3a across the four genomes as shown in Table 18 . The highest frequency is attained by the double nucleotide TT in ORF3a over the four genomes. So there is no choice bias of double nucleotides in ORF3a. • The spatial representations of A, T, C and G as well as of the purine-pyrimidine bases are found to be positively auto-correlated. • SEs of the binary representations of the nucleotides A, C, T and G over the gene ORF3a across the genomes are invariant as found in the Table 3 . The SE of the binary spatial 340 representation of the purine and pyrimidine bases of the gene ORF3a across the four genomes is 0.99454 which is very closed to 1 and that represents maximum uncertainty . • From the nucleotide conservation SEs of ORF3a as found in the Codon usages: The frequency of each codon in ORF6 is remain invariant across the genomes as observed from the Table 19 . ATG 3 TCT 2 TAT 1 TGT 0 TTT 3 TCC 1 TAC 1 TGC 0 TTC 0 TCA 1 TAA 1 TGA 0 TTA 3 TCG 0 TAG 0 TGG 1 TTG 0 The Table 20 . The double nucleotide TA GG attain the highest and lowest frequency respectively in the gene ORF6 across the four genomes. • Clearly from the HEs as observed in Table 3 , all five different spatial representations (four 360 nucleotides and purine-pyrimidine) are positively auto-correlated/trending in the gene ORF6 across the four genomes. • From the Table 3 , it is seen that the SEs of the representation of A and T are almost same and hence it is concluded that the spatial representations of A (purine-base) and T (pyrimidinebase) are similar. The SE of the spatial representation of purine and pyrimidine bases in 365 ORF6 over the four genomes are again found same and that is 0.99992 which is very nearer to one. • The SEs of the conservation of nucleotides over ORF6 across the three genomes NC 045512, MT012098, MT050493 is identical with the value 0.97703 whereas that in the genome MT358637 is 0.98152. Codon usages: The codons usages over the gene ORF7a across the four genomes is tabulated in the Table 21 . It is observed that the frequencies of codon usages over the four genomes are turned out to be invariant. Double nucleotide usages: The Table 22 follows that all the double nucleotides are present with atleast frequency greater than or equals to 2. The highest frequency 24 is attained by the double nucleotide TT as found in ORF7a. The double nucleotides CC and CG both are having least frequency which is 2. • The spatial representation of A is found to be most positively trending as compared to others since the HE is maximum among all. The HE of the purine-pyrimidine arrangements is 0.60385 which depicts the positive auto-correlation over the representation. • From the SEs of nucleotide bases in ORF7a as mentioned in Table 3 , it is observed that the amount of uncertainty of presence of the nucleotide G over its binary representations MT012098, MT050493 as the SEs in these two cases are found to be identical. As previously seen, in the genome MT358637, the SE of nucleotide conservation in ORF7a is found to be more uncertain as the SE is seen to be close enough to 1. Codon usages: The codon frequencies in the gene ORF8 across the four genomes are given in the following Table 23 . turned out to be 24. It is found that the double nucleotide CA is present twelve times in the gene ORF8 over the genome MT050493 whereas in the other three genomes the gene ORF8 contains CA only eleven times uniformly. Also it is observed that TA is present in ORF8 over the genome MT050493 seventeen times whereas in the rests genomes it is present eighteen times in the gene ORF8. • The highest amount of autocorrelation is observed in the spatial representation of T as found from the Table 3 to be same and that is 0.99515. • From the The amino acids Lysine, Arginine, Glycine and Proline are absent from the protein sequence of The Table 26 follows that the only double nucleotides CG and GG are absent from the gene sequence of ORF7b across the two genomes. • The spatial representations are all positively trending as the HEs are found to be greater than 0.5. • The SEs of the spatial representations of nucleotides A, C, T and G in ORF7b are found to be same in the two genomes and they are 0.78637, 0.68404, 0.99403 and 0.55410 respectively. The SE of the binary representation of the purine-pyrimidine bases are also same and that is 0.94566 which is significantly less as compared to other genes as observed. • The nucleotide conservation SE of the gene ORF7b across the two genomes NC 045512 and MT358637 are found to be non-identical and they are 0.91796 and 0.98152 respectively. The uncertainty of nucleotide conservation over the gene ORF7b in the genome MT358637 is 460 higher which implies the nucleotides in ORF7b of the genome NC 045512 is conserved more than that of the other. are identical as shown in the Fig.8 . The Fig.8 shows two phylogenetic trees-one for four genes E, ORF3a, ORF6 and ORF7a and other for two genes M and N. Interestingly, phylogenetic tree for ORF1, ORF8 and ORF10 and S are different from each other (Fig.9 ). The phylogenetic relationship (Fig.3) other branch of the binary tree. Here again, the sequence-similarity based phylogeny (Fig.2) is not linearly reflected while phylogeny is derived by accounting the spatial and molecular organizations of the genes M and N. The distinct phylogenetic relationships are developed in the Fig.9 by the features of the genes ORF1, ORF10, ORF8 and S. From the phylogeny based on the features of the gene S as shown in Fig.9 , it is found that the genomes NC 045512 and MT050493 are most close to each other as belong to same level of the phylogeny. These two genomes are close to the genome MT358637. The genome MT012098 is distantly close to the other genomes since this genomes belong to a branch of primary binary level of the phylogeny. This phylogeny discriminates the genome MT012098 from NC 045512 according 485 to the spatial and molecular organization of the gene S, although they are sequentially very close to each other as mentioned previously. The gene ORF1 is discriminating the genome MT050493 from others as depicted in the phylogeny based on the features of the gene ORF1. Further it discloses the closeness among the three other genomes NC 045512, MT012098 and MT358637. The genomes NC 045512 and MT358637 are the closest as shown in Fig.9 (up-right). In the other 490 branch of same level of the phylogenetic tree, the genome MT012098 belongs. As per the phylogenetic relationship based on the features of ORF10, it is found that the genome MT012098 and MT050493 are most close to each other as they below to a binary branches in the same level. Then the genome MT358637 is closed in the upper level of binary tree of the phylogeny. The genome NC 045512 is distantly related to the cluster of three other genomes. It is 495 worth noting that the SEs of conservation of nucleotides in the gene ORF10 are the determining features of closeness among the genomes. From the phylogenetic relationship among the four genomes based on the features extracted for the gene ORF8 it is found that the three genomes NC 045512, MT012098 and MT358637 are close enough each other as they all belong to a single branch of the binary phylogenetic tree whereas the 500 genome MT050493 belongs to the other branch. It is observed that the Spatial arrangements of the nucleotide bases C and T in the gene ORF8 make the genome MT050493 different from other three. It is observed from the phylogeny in the Fig.2 ORF7a. These two genomes are close then with NC 045512 since they belong under the same binary branches. These three genomes are distantly close to MT358637. It is worth noting that the phylogeny based on sequential similarities among the four genomes does not go simply with the phylogeny based on the spatial features. Differences in phylogenetic tree arrangement with these genes suggest that three genome of 510 Indian have come from three different origin or evolution of viral genome is very fast process. Irrespective of the evolution, subtle biasness towards the usage of codon remains. ORF7b. • The GC content for all these genes are widely varied over the closed interval [27.95, 47.20 ]. Note that both the genes, ORF6 and N having lowest and highest GC content respectively, present over all the SARS-CoV2 genomes. One may note that the GC content (47.2%) of N is significantly high which may distinctly characterize it from the other structural proteins 525 E (38.15%), M (42.6%) and S(37.3%). • The distribution of purine and pyrimidine bases over each gene across four genomes are found to be highly uncertain. That is, the purine and pyrimidine bases are equally likely to appear in the sequences. Although it is noted that these purine-pyrimidine spatial organizations is positively trending. • From the Table 3 , it is to be pointed out that the higher consistently nucleotide conservations (SE:0.98152) over the all genes is observed in the genome MT358637 (India-Gujrat). • Though the three Indian genomes show 99% identity with the China-Wuhan genome, phylogenetic trees are different with respective to individual genes. This suggests that recombination has more impact than mutation in evolving SARS-CoV2 on a geocentric basis in India. Uncertainty on the distribution of purine and pyrimidine bases over each gene across the four genomes further suggest that possibility of recombination among coronavirus genus exists for evo-545 lution of SARS-CoV2 in India which may have different virulence behaviour. SH and PPC conceived the problem. AM and RKR coded and produced the results. SH, PPC, SSJ, PP analysed the data and PPC, PP and SSJ supervised the project. SH wrote the initial draft which was checked and edited by all other authors to generate the final version. Identification of a new human coronavirus Covid-19: a new challenge for human beings Genotype and phenotype of covid-19: Their roles in pathogenesis Sars-associated coronavirus Ecology, evolution and classification of bat coron-565 aviruses in the aftermath of sars Animal-to-human sars-associated coronavirus transmission? A genomic perspective on the origin and emergence of sars-cov-2 Characterization of the 570 receptor-binding domain (rbd) of 2019 novel coronavirus: implication for development of rbd protein as a viral attachment inhibitor and vaccine Genomic characterization of a novel sars-cov-2 Covid-2019: the role of the nsp2 and nsp3 in its pathogenesis The outbreak of sars-cov-2 pneumonia calls for viral vaccines Analysis of sars-cov e protein ion channel activity by tuning the protein and lipid charge The sars-cov nucleocapsid protein: a protein with multifarious activities The 585 epitope study on the sars-cov nucleocapsid protein The membrane protein of sars-cov suppresses nf-κb activation The m protein of sars-cov: basic structural and immunological properties Coronavirus disease 2019 (covid-19): current status and future perspective The spike protein of sars-cov-a target for vaccine and therapeutic development Mers-cov spike protein: a key target for antivirals The structural and accessory proteins m, orf 4a, orf 4b, and orf 5 of middle east respiratory syndrome coronavirus (mers-cov) are potent interferon antagonists A new coronavirus associated with human respiratory disease in china Covid-19 infection: origin, transmission, and characteristics of human coronaviruses Phylogenetic analysis and structural modeling of sars-cov-2 spike protein reveals an evolutionary distinct 610 and proteolytically-sensitive activation loop The molecular biology of coronaviruses On the origin and continuing evolution of sars-cov-2 Full-genome sequences of the first two sars-cov-2 viruses from india Fractals and hidden symmetries in dna Quantitative description of genomic evolution of olfactory receptors A quantitative genomic view of the coronaviruses: 625 Sars-cov2 Spatial distribution of amino acids of the sars-cov2 proteins Analysis of purines and pyrimidines distribution over mirnas of human, gorilla, chimpanzee, mouse and rat On the entropy of a hidden markov process Decomposition of dna sequence complexity Relative von neumann entropy for evaluating amino acid conservation Time-dependent hurst exponent in financial time series Authors acknowledge the financial support from IACS. The preferred stop codon in ORF8 is TAA across the four genomes and it is used once only. It is worth mentioning that the same length (366 bases) gene ORF7a uses only the stop codon TGA. Interestingly, all the twenty amino acid are present ORF8 protein. A list of frequency of double nucleotides over the gene ORF8across the four genomes is given in the Table 24 . It is found that all the sixteen double nucleotides are used in ORF8 across the four genomes. The highest frequency of TT in the gene ORF8 is The authors do not have any conflicts of interest to declare.