key: cord-0162648-7pl7k7k2 authors: Laha, S. K. title: A Comparative Genomic Analysis of Coronavirus Families Using Chaos Game Representation and Fisher-Shannon Complexity date: 2021-07-13 journal: nan DOI: nan sha: 068a603af3c81904a26f63d236b7a07fa0cc72d4 doc_id: 162648 cord_uid: 7pl7k7k2 From its first emergence in Wuhan, China in December, 2019 the COVID-19 pandemic has caused unprecedented health crisis throughout the world. The novel coronavirus disease is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) which belongs to the coronaviridae family. In this paper, a comparative genomic analysis of eight coronaviruses namely Human coronavirus OC43 (HCoV-OC43), Human coronavirus HKU1 (HCoV-HKU1), Human coronavirus 229E (HCoV-229E), Human coronavirus NL63 (HCoV-NL63), Severe acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratory syndrome-related coronavirus (MERS-CoV), Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and Bat coronavirus RaTG13 has been carried out using Chaos Game Representation and Fisher-Shannon Complexity (CGR-FSC) measure. Chaos Game Representation (CGR) is a unique alignment-free method to visualize one dimensional DNA sequence in a two-dimensional fractal-like pattern. The two-dimensional CGR pattern is then quantified by Fisher-Shannon Complexity (FSC) measure. The CGR-FSC can effectively identify the viruses uniquely and their similarity/dissimilarity can be revealed in the Fisher-Shannon Information Plane (FSIP). data in geosciences [31] , the evolution of the daily maximum surface temperature distributions [32] and the time series data of Standardized Precipitation Index (SPI) [33] . In this paper, the Fisher-Shannon information approach has been applied for a comparative genomic analysis of eight coronaviruses namely Human coronavirus OC43 ( Out of these viruses Bat coronavirus, RaTG13 causes infection in horseshoe bat, Rhinolophus affinis, whereas the remaining seven viruses are known to cause infection in humans. Chaos Game Representation (CGR) is an interesting process through which one-dimensional sequences can be converted to a two-dimensional space by an iterated function system (IFS). CGR, an alignment-free method, was proposed by Jeffrey [8] to visualize DNA sequences. The resulting two-dimensional representation is in the form of a scatter plot and many remarkable fractal-like patterns can be observed which are difficult to observe in the initial 1D sequences such as DNA, RNA, or protein sequences. The 2D planar CGR space is a continuous unit square described by four vertices assigned by the four nucleotides i.e., Adenine (A), Guanine (G), Cytosine (C) and Thymine (T). In other words, the coordinates of these four nucleotides are given by A = (0, 0); T = (1, 0); G = (1, 1) and C = (0, 1). In this Cartesian plane any nucleotide sequence of any length can be uniquely determined. The CGR coordinates are determined iteratively by the following process: the first nucleotide position is halfway between the starting point and the corresponding vertex of the nucleotide where the starting point is at (0.5, 0.5). The successive nucleotides are then plotted halfway between the previous nucleotide position and the vertex representing that nucleotide. For a DNA sequence, the equation for the above iterative function system (IFS) is given by It should be noted that FIM is conceptually different from Fisher information which is the information content of the random variable X about its distribution parameters, . Both the FIM and SEP have been applied to study signal complexity. The Fisher-Shannon Complexity (FSC) can be defined as the product of FIM and SEP, as given by, It can be shown that, 1 X C  , where the equality holds if and only if X has Gaussian distribution [29] . The whole-genome reference sequences of the viruses in the present study are downloaded from the NCBI Genbank. The downloaded genome sequences of the viruses are in the Fasta format and their Accession IDs are given in the following Table. The phylogenetic tree of the eight coronaviruses considered in the present study namely HCoV-OC43, HCoV-HKU1, HCoV-229E, HCoV-NL63, SARS-CoV, MERS-CoV, SARS-CoV-2 and RaTG13 is shown in Fig. 1 . From the Phylogenetic Tree in Fig.1 it is evident that SARS-CoV-2 is most similar to RaTG13, followed by SARS-CoV. The CGR maps of the eight coronaviruses are shown in Fig. 2 . Although fractal patterns in the whole genome sequence are discernible in these figures and they look very similar, it is difficult to compare the genomic sequences from visual inspection alone. Therefore, it is important to measure their complexity numerically for a comparative analysis. Thus, as These values can be then used for a comprehensive similarity/dissimilarity analysis of the whole genome sequences. The Fisher-Shannon Information Plane for the X and Y coordinates of the above mentioned viruses are shown in Fig. 3 and 4 respectively. From the FSIP along X coordinates (Fig. 3) it can be seen that HCoV-OC43, RaTG13 and SARS-CoV-2 form a close cluster. Thus, it may be inferred that these viruses are very similar for the CGR-X coordinates. The horizontal stripes i.e. the X-coordinates give the dinucleotide similarities of AC, CA, GT and TG. Finally, we also obtain the FSIP of the CGR-walk, which is defined earlier as a summation of the X and Y coordinates. This results in one dimensional time series. The FSIP is shown in Fig. 6 and it can be again seen that SARS-CoV-2, SARS-Cov and RaTG13 are very nearby which indicates that they are genetically very similar. Thus, from the Figs. 3-6 it can be concluded that the virus SARS-CoV-2 bears most close resemblance to the batcoronavirus RaTG13 followed by SARS-CoV. This confirms the earlier reported results that SARS-Cov-2 genome bears 79.6% sequence similarity to SARS-CoV and 96% similarity to the bat coronavirus [34] . In this paper, comparative genome analysis of eight coronaviruses namely HCoV-OC43, HCoV-HKU1, HCoV-229E, HCoV-NL63, SARS-CoV, MERS-CoV, SARS-CoV-2 and RaTG13 is carried out using Chaos Game Representation and Fisher-Shannon Complexity Chaos game representation of gene structure Predicting DNA duplex stability from the base sequence Evolution of long-range fractal correlations and 1/f noise in DNA base sequences Long-range correlations in nucleotide sequences H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences Z curves, an intutive tool for visualizing and analyzing the DNA sequences Non-standard bioinformatics characterization of SARS-CoV-2 Graphical and numerical representations of DNA sequences: statistical aspects of similarity Survey on encoding schemes for genomic data representation and feature learning-from signal processing to machine learning Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison Chaos game representation of proteins Chaos game representation of protein structures Similarity Studies of Corona Viruses through Chaos Game Representation Chaos game representation dataset of SARS-CoV-2 genome Analysis of similarity/dissimilarity of DNA sequences based on chaos game representation The fractal geometry of nature WH freeman Elements of information theory The general problem of the stability of motion Theory of statistical estimation Fisher information, disorder, and the equilibrium distributions of physics Fisher information and nonlinear dynamics Information theoretic inequalities Analysis of signals in the Fisher-Shannon information plane Advanced analysis of temporal data using Fisher-Shannon information: theoretical development and application in geosciences Spatio-temporal evolution of global surface temperature distributions Fisher Shannon analysis of drought/wetness episodes along a rainfall gradient in Northeast Brazil A pneumonia outbreak associated with a new coronavirus of probable bat origin SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis R: A language and environment for statistical computing. R Foundation for Statistical Computing Deep learning on chaos game representation for proteins.‖ Bioinformatics