key: cord-0756986-psh5jevz authors: Machado, J. A. Tenreiro; Rocha-Neves, João M.; Andrade, José P. title: Computational analysis of the SARS-CoV-2 and other viruses based on the Kolmogorov’s complexity and Shannon’s information theories date: 2020-07-04 journal: Nonlinear Dyn DOI: 10.1007/s11071-020-05771-8 sha: 2701fe2b0a37403283695341d00563523cdb33eb doc_id: 756986 cord_uid: psh5jevz This paper tackles the information of 133 RNA viruses available in public databases under the light of several mathematical and computational tools. First, the formal concepts of distance metrics, Kolmogorov complexity and Shannon information are recalled. Second, the computational tools available presently for tackling and visualizing patterns embedded in datasets, such as the hierarchical clustering and the multidimensional scaling, are discussed. The synergies of the common application of the mathematical and computational resources are then used for exploring the RNA data, cross-evaluating the normalized compression distance, entropy and Jensen–Shannon divergence, versus representations in two and three dimensions. The results of these different perspectives give extra light in what concerns the relations between the distinct RNA viruses. In December 2019, a mysterious pneumonia with unknown etiology was reported in the city of Wuhan, Province of Hubei, China [1] . The International Committee on Taxonomy of Virus (ICTV) named the virus as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [2] . Globally, to the 30th of July 2020, according to the World Health Organization Coronavirus Disease 2019 (COVID-19) Situation Report, 10,185, 374 confirmed cases of COVID-19 are reported, resulting in 503,862 deaths. The scientific community reacted as never before, and many researchers focused on this urgent topic [3] [4] [5] [6] [7] [8] [9] [10] . The mathematical and computer science communities are also studying this challenging problem, and we testimony the recent emergence of new models and algorithmic approaches. This common multidisciplinary research will allow the human kind to implement a robust and fast response to a problem that is causing a severe global socioeconomic disruption, and most probably will lead to the largest global recession since the Great Depression. The present paper follows this trend by studying the genetic information by means of mathematical and computational tools. The starting point is the information encoded in the ribonucleic acid (RNA), a polymeric molecule essential in various biological roles. The evolutionary origin and divergence of eucaryotes is mostly recoverable from their genetic relationships. The phylogeny of core genes, such as those for ribosomal proteins, provides a reasonable representation of many billions of years [11] . Unfortunately, the diversity of viruses prevents such a reconstruction of virus evolutionary histories as they lack any equivalent set of universally conserved genes on which to construct a phylogeny. Viral diversity is far greater than that of other organisms, with significant differences in their genetic material, RNA or deoxyribonucleic acid (DNA), and configurations (double or single stranded), as well as the orientation of their encoded gene. The smallest virus genomes [12] contain over 2,500 genes [13] . The RNA is a nucleic acid that is essential to all forms of life, and it is found in nature generally as single-stranded (ss) rather than a paired double-stranded (ds) like DNA. In DNA, there are four bases: adenine (A) complementary to thymine (T) and cytosine (C) complementary to guanine (G). In the RNA, uracil (U) is used instead of thymine. Like DNA, RNA can carry genetic information. An RNA virus is a virus that has RNA as its genetic material encoding a number of proteins. This nucleic acid is usually single-stranded RNA (ssRNA), but there are double-stranded RNA (dsRNA) viruses. These viruses exploit the presence of RNA-dependent RNA polymerases in eucaryotes cells for replication of their genomes. Many human diseases are caused by such RNA viruses such as influenza, severe acute respiratory syndrome (SARS), COVID-19, Ebola disease virus, chikungunya, Zika virus, influenza B and Lassa virus [14] . The RNA is usually sequenced indirectly by copying it into complementary DNA (cDNA), which is often amplified and subsequently analyzed using a variety of DNA sequencing methods. Therefore, the genomic sequences of the RNA viruses are published presenting the four bases, namely adenine (A), cytosine (C), guanine (G) and thymine (T). This genetic information is analyzed by means of the Kolmogorov's complexity and Shannon's information theories. In the first case, the so-called normalized information distance (NID) is adopted. In the second, a statistical approach is considered, by constructing histograms for the relative frequency of three consecutive bases (triplets). The histograms are interpreted under the light of entropy, cumulative residual entropy and Jensen-Shannon divergence. The results obtained for each theory, that is, the values assessing the virus genetic code under the light of the Kolmogorov's complexity and the Shannon's information, are further processed by means of advanced computational representation techniques. The final visualization is obtained using the hierarchical clustering (HC) [15] [16] [17] [18] [19] [20] and multidimensional scaling (MDS) techniques [21] [22] [23] [24] [25] [26] [27] . Three alternative representations, namely dendrograms, hierarchical trees and three-dimensional loci, are considered. According to the ICTV, human coronavirus belongs to the Betacoronavirus genus, a member of the Coronaviridae family, categorized in the order Nidovirales [28] . It has been categorized into several genera, based on phylogenetic analyses and antigenic criteria, namely: (i) Alphacoronavirus, responsible for gastrointestinal disorders in humans, dogs, pigs and cats; (ii) Betacoronavirus, including the bat coronavirus, the human severe acute respiratory syndrome (SARS) virus, the Middle Eastern respiratory syndrome (MERS) virus and now the SARS-CoV-2; (iii) Gammacoronavirus, which infects avian species; and iv) Deltacoronavirus [2, 29] . Four coronaviruses broadly distributed among humans (229E, OC43, NL63 and, HKU1) frequently cause only common cold symptoms [2, 12] . The two other strains of coronaviruses are linked with deadly diseases and zoonotic in origin, i.e., the SARS-CoV and MERS-CoV [30] . In 2002-2003, there was an outbreak of SARS beginning in Guangdong Province in China and affecting 27 countries subsequently [31] . It was considered the first pandemic event of the twenty-first century, due to the SARS-CoV infecting 8098 individuals with 774 deaths [31] . In 2012, in the Middle East, MERS-CoV caused a severe respiratory disease that affected 2494 individuals and 858 deaths [2] . In both epidemics, bats were the original host of these two coronaviruses [2] . Coronaviruses contain a positive-sense, singlestranded RNA genome. The genome size for coronaviruses ranges from 26.4 to 31.7 kilobases, and it is one of the largest among RNA viruses. The rapid spread of SARS-CoV-2 raises intriguing questions such as whether its evolution is driven by mutations, recombination or genetic variation [32] . This information is now being applied to the development of diagnostic tools, effective antiviral therapies and in the understanding of viral diseases. Although numerous studies have been done from the biological and medical perspective, to help further understand SARS-CoV-2 and trace its origin, this paper reports the use of multidimensional scale techniques for the finding of the similarities and the relationships among COVID-19 strains themselves and between other described viruses. For that, we have collected a set of 37 complete genome sequences of SARS-CoV-2 virus obtained in several countries from patients with COVID-19. To help verify whether it is possible to trace the original or intermediate host of SARS-CoV-2, we obtained the genomic sequences of SARS-CoV-2 virus in other hosts, including bats (16 genomic sequences), pangolins (8 genomic sequences) and environment (market of Wuhan) [13] . We have also collected 23 genomic sequences of the coronavirus that cause mild symptoms related to common cold in man (229E, OC43, NL63 and, HKU1). The genomic sequences of SARS-CoV (10 genomic sequences) and a MERS-CoV (13 genomic sequences) were also gathered. For comparison and control, we also obtained sequences of other deadly pathogenic RNA viruses such as Lassa (6 genomic sequences), Ebola (6 genomic sequences), dengue (7 genomic sequences), chikungunya (1 genomic sequence) and influenza B (2 genomic sequences). The diagram of Fig. 1 summarizes the main historical flow of coronavirus in what concerns the twenty-first-century epidemics. To the authors' best knowledge, this paper analyzes for the first time a large number of viruses associated with the combination of several distinct mathematical and computational tools. In short, we have the crosscomparison of: -The genomic sequences of 133 viruses. The results indicate clearly the superior performance of the approaches based on the Kolmogorov's complexity measure and the MDS three-dimensional visualization. Moreover, the characteristics of the coronaviruses within the large set of tested cases are highlighted. We conclude that: -The association between the Kolmogorov perspective and the three-dimensional MDS representation leads to be the best results. -The clusters are easily distinguishable and we observe the relation between the new SARS-CoV-2 virus and some CoV found in bats and in the pangolin. Bearing these ideas in mind, the paper is organized as follows. Section 2 introduces the mathematical and computational tools adopted in the follow-up. Section 3 analyzes the data by means of the complexity and information theories accompanied by the HC and MDS computational visualization resources. Finally, Sect. 4 summarizes the main conclusions. 2 Fundamental tools Evaluating the similarity degree between several objects, each having a number of features requires the definition of a distance. The calculation of a function d of two objects x A and x B can be interpreted as a distance if dðx A ; x B Þ ! 0 and satisfies the following axioms [33] : We find in the literature a variety of different functions that can tackle datasets and shed light to distinct characteristics [34] . For a set of numerical vectors x 1 ; . . .; x N ð Þ T 2 R n , the Minkowski norm L n : P N k¼1 x k j j n À Á 1 n , and in particular the Manhattan and Euclidean cases, L 1 and L 2 for n ¼ 1 and n ¼ 2, are often used [35] . In the case of DNA analysis, these norms support different algorithms [36] [37] [38] that allow the comparison of data sequences. We can also mention metrics such as the Chi-square [39] , Hamming [40] and edit [41] distances. Nonetheless, the selection of the optimal distances for a specific application poses relevant challenges [42] [43] [44] [45] . In fact, the adoption of a given metric often depends on the user experience that performs several numerical experiments before selecting one or more distances. Due to these issues, assessing the similarity of several objects is not a straightforward process and we can adopt both non-probabilistic and probabilistic information measures to obtain distinct perspectives between object similarities. The Kolmogorov complexity, or algorithmic entropy, addresses the information measurement without relying on probabilistic assumptions. The information measurement focuses an individual finite object described by a string and takes into consideration that 'regular' strings are compressible [46] . The Kolmogorov complexity of a string x, denoted as K(x), can be defined as the length of the shortest binary program that, given an empty string w at its input, can compute x on its output in a universal computer, and then halts. Therefore, K(x) can be interpreted as the length of the ultimate compressed version of x. The information distance of two strings (or files) fx A ; x B g 2 R, where R denotes the space of the objects, can also be computed by means of the conditional Kolmogorov complexity Kðx A jx B Þ [47, 48] . This concept can be read as the length of the shortest program to obtain x A , if x B is provided for input. In heuristic terms, if the two strings are more/less similar, then the calculation is less/more difficult and, consequently, the size of the program is smaller/larger. Therefore, the following inequality always holds Under the light of these concepts, the universal distance metric [47] denoted as normalized information distance (NID): ð3Þ was proposed. From equation (2), we have that the NID may take values in the range [0, 1]. Moreover, we have NIDðx A ; x A Þ % 0 and NIDðx A ; wÞ % 1, where w is an empty object that has no similarity to x A . It is shown [47] that the NID is a distance because it satisfies the axioms defined in (1), up to some additive precision, but is non-computable [47] . In spite of this limitation, the NID gives the basis for the so-called normalized compression distance (NCD), which is a computable distance [45] . The computability comes with the cost of using a good approximation of the Kolmogorov complexity by a standard compressor CðÁÞ (interested readers for the discussion between the equivalence of the NID and the NCD can find further details in [33] ). The NCD is given by: The NCD has values in the range 0\NCDðx; yÞ\1 þ , assessing the distance between the files x A and x B . The parameter [ 0 reflects 'imperfections' in the compression algorithm. The values of Cðx A Þ and Cðx B Þ are the sizes of each of the compressed files x A and x B , respectively, and Cðx A x B Þ is the compressed size of the two concatenated files considered by the compressor as a single file. Let us consider, for example, that Cðx B Þ ! Cðx A Þ. Expression (4) says that the distance NCDðx A ; x B Þ assesses the improvement due to compressing x B using data from the previously compressed x A and compressing x B from scratch, expressed as the ratio between their compressed sizes. Obviously, the approximation of the NID by means of the NCD poses operational obstacles. Due to the noncomputability of the Kolmogorov complexity, we cannot predict how close is the NCD to the real value of the NID, and the approximation may yield arithmetic problems, particularly in the case of small strings where numerical indeterminate forms may arise [33, 49] . Moreover, the compressor (as an approximation of the Kolmogorov complexity) must be 'normal' in the sense that given the object x A x A the compressor C should produce an object with almost to an identical size to the compressed version of x A . This is a limitation for the universality of the NCD since in specific applications the best performing loss-less algorithms (e.g., JBIG, JPEG2000 and JPEG-LS in image compression) do not satisfy such propriety [50] . Nevertheless, key results were already obtained using the NCD in image distinguishability [51] , image OCR [52] , malware recognition [53] and genomic analysis [54, 55] . Information theory [56] has been applied in a variety of scientific areas. The most fundamental piece of the theory is the information content I of a given event x i having probability of occurrence P X ¼ x i ð Þ: where X is a discrete random variable. The expected value of the information, the socalled Shannon entropy [57, 58] , is given by: where E Á ð Þ denotes the expected value operator. Expression (6) obeys the four Khinchin axioms [59, 60] . Several generalizations of Shannon entropy have been proposed, relaxing some of those axioms, and we can mention the Rényi, Tsallis and generalized entropy [61] [62] [63] , just to name a few. A recent and interesting concept is the cumulative residual entropies e given by [64, 65] : Within the scope of information theory, we can formulate also the concept of distance discussed in Sect. 2.1. The Kullback-Leibler divergence between the probability distributions X 1 and X 2 is defined as [34, [66] [67] [68] [69] : From this, we obtain the Jensen-Shannon divergence, JSD X 1 k X 2 ð Þ , or distance, given by: where Þcan be calculated as: The HC is a computational technique that assesses a group of N objects X i , i ¼ 1; Á Á Á ; N, in a q-dimensional space, and tries to rearrange them in a visual structure with objects Y i highlighting the main similarities between them in the sense of some predefined metric [70] . Let us consider N objects defined in a q-dimensional real-valued space and a distance (or dissimilarity) measure d ij between pairs of objects. The first step is to calculate a N  N-dimensional matrix, D ¼ ½d ij , where d ij 2 R þ for i 6 ¼ j and d ii ¼ 0, ði; jÞ ¼ 1; Á Á Á ; N, stands for the object to object distances [71] . The HC uses the input information in matrix D and produces a graphical representation consisting in a dendrogram or a hierarchical tree. The so-called agglomerative or divisive clustering iterative techniques are usually adopted for processing the information. In the first approach, each object starts in its own cluster and the computational iterations merge the most similar items until having just one cluster. In the second, all objects start their own cluster and the computational iterations separate items, until each object has his own cluster. For both approaches, the numerical iterations follow a linkage criterion, based on the distances between pairs, for quantifying the dissimilarity between clusters. The maximum, minimum and average linkages are possible criteria. The distance d x A ; x B ð Þ between two objects x A 2 A and x B 2 B can be assessed by means of several metrics such as the average linkage given by [72] : The clustering quality can be assessed by means of the cophenetic correlation [73] : where x(i, j) and y(i, j) stand for the distances between the pairs X i and X j , in the initial measurements, and Y i and Y j , in the HC chart, respectively, and x denotes the average of x. Values of c close to 1 (to 0) indicate a good (weak) cluster representation of the original data. In MATLAB, c is computed by means of the command cophenet. In this paper, we adopt the agglomerative clustering and the average linkage [74, 75] for tackling the matrix of distances based on the JSD metric (10) . The MDS is also a computational technique for clustering and visualizing multidimensional data [76] . Similarly to what was said for the HC, the input of the MDS is the matrix D ¼ ½d ij , ði; jÞ ¼ 1; Á Á Á ; N. The MDS is to adopt points for representing the objects in a d-dimensional space, with d\q, that try to reproduce the original distances, d ij . The MDS performs a series of numerical iterations rearranging the points in order to optimize a given cost function called stress S. We have, for example, the residual sum of squares and the Sammon criteria: The resulting MDS points have coordinates that produce a symmetric matrix U ¼ ½/ ij of distances that approximate the original one D ¼ ½d ij . In MATLAB, the commands cmdscale and Sammon can be adopted for the classical MDS and the Sammon stress criterion. The interpretation of the MDS locus is based on the patterns of points [77, 78] . Therefore, objects with strong (weak) similarities are represented by fairly close (distant) points. The MDS locus is interpreted on the relative positions of the point coordinates. So, the absolute position of the points or the shape of the clusters has usually a special meaning, and we can magnify, translate and rotate the locus for achieving a good visualization. In the same line of reasoning, the axes of the plot do not have units or physical meaning. The quality of the produced MDS locus can be evaluated using the stress plot and the Shepard diagram. The stress plot shows S versus d and decreases monotonically. Usually, low values of d are adopted, and present computational resources allow a direct three-dimensional representation, but often some rotations and magnification are required to achieve the best visual perspective. The Shepard diagram plots / ij against d ij for a given value of d, and a narrow (large) scatter of points indicates a good (weak) fit between the original and the reproduced distances. The information of 133 publicly released genomic sequences was collected in the Global Initiative on Sharing Avian Influenza Data (GISAID) and the GenBank of the National Center for Biotechnology Information (NCBI) databases (https://www.gisaid. org/, https://www.ncbi.nlm.nih.gov/genbank). The information regarding the sequences and serial numbers is given in Table of The mathematical tools of Kolmogorov and Shannon theories are used to compare and extract relationships among the data and to identify viral genomic patterns. The viral genomes are analyzed from the perspective of dynamical systems using HC and MDS. Dendrograms and trees are generated by HC algorithms, and a three-dimensional visualization through MDS visualization. Several clusters with medical and epidemiological interest were visualized. The MDS loci provide a key visualization of the relation of SARS-CoV-2 with the other known coronavirus affecting humans. The several non-coronavirus pathogenic viruses analyzed are very well delimited in several independent and easily delineated clusters (Zika, Chikungunya, Dengue, Lassa, Ebola, Influenza B). The several types of coronaviruses that cause mild disease (common flu) are very well separated from the other coronavirus clusters. Interestingly, the types, 229E, HkU1, NL63 and OC43, form separated miniclusters within a big well-defined cluster There are two well-defined CoV-19 or SARS-CoV clusters. The two environmental SARS-CoV-2 were aggregated with the human SARS-CoV-2. To differentiate the human MERS-CoV, a zoom was made to isolate it forming a separate cluster. The MERS-CoV obtained from camels was included in this cluster. SARS-CoV formed another independent and wellsegregated cluster. The CoV from bats and pangolin is dispersed among these several clusters of coronavirus forming sometimes clusters of few elements. Note that the bat CoV RaTG13 is near to one of the SARS-CoV clusters. This coronavirus is the closest known relative of SARS-CoV-2 [79] . For processing the RNA information consisting of ASCII files with the four nitrogenous bases, we consider two approaches. The first one follows the Kolmogorov complexity theory described in Sect. 2.2. Therefore, we consider the compressors zlib and bz2 (see https://www.zlib.net and https://sourceware. org/bzip2/) followed by the NCD distance (4). Nonetheless, the two algorithms give almost identical results, and therefore, only the zlib is considered in the follow-up. The application of the NCD produces a symmetrical matrix D, 133  133 dimensional that is tackled and visualized by means of a dendrogram and a HC tree obtained by the program Phylip http://evolution. genetics.washington.edu/phylip.html and MDS using MATLAB https://www.mathworks.com/products/ matlab.html. The corresponding charts are represented in Figs. 2, 3 and 4. It is known that the three-dimensional plots require some rotation in the computer screen. Therefore, Fig. 5 shows the MDS locus in a different perspective, without point labels and the cluster of SARS-CoV-2 connected by a line. The second approach follows the Shannon information theory described in Sect. 2.3. Therefore, we start by considering non-overlapping (codon or anticodon) triplets of bases and the histograms of relative frequency for each virus. In a second phase, after having the histogram for the complete set of virus, we process them using entropy concepts. Before continuing, we must clarify that we tested the adoption of ntuples, with n ¼ 1; 2; 3; 4, in the DNA information analysis of a large set of superior species such as mammals, where the genetic information is considerably larger and the construction of histograms for large values of n does not pose problems of statistical significance (since the number of histograms bins grows as 4 n ). It was observed that n ¼ 3 is a 'good value,' since n ¼ 1 leads to poor results, n ¼ 2 improves things but is still insufficient, while n ¼ 4 gives almost identical results to n ¼ 3. In a first experiment, we consider the Shannon entropy H vs fractional cumulative residual entropy e as possible descriptors of the 133 histograms. Figure 6 18 19 4 22 23 24 3 69 75 60 65 71 95 61 67 72 94 5 82 11 92 87 10 16 93 78 6 68 73 133 74 80 58 63 66 131 89 79 59 64 132 90 70 77 86 8 76 12 38 111 44 119 39 112 40 113 45 120 46 121 51 56 129 25 98 99 125 26 101 32 109 27 102 33 106 28 105 29 103 34 107 30 104 35 108 31 36 110 117 124 37 43 118 116 123 42 122 115 41 114 47 52 57 62 130 48 53 85 127 49 54 50 55 128 84 100 126 83 7 81 9 14 91 96 15 20 21 97 88 1 Fig. 2 Dendrogram for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory shows the resulting two-dimensional plot, that is, the Shannon entropy H vs the fractional cumulative residual entropy e, for the set of 133 viruses. We observe some separation between virus, but we have some difficulties due to the high density of points in some areas. We now try the other computational techniques introduced in Sect. 2. In the follow-up, we consider identical input information, that is, the JDS and the corresponding matrix D for the set of 133 viruses, and we compare the distinct computational clustering and visualization techniques. Figures 7, 8 and 9 show the dendrogram, the HC tree and the three-dimensional MDS locus, respectively. The dendrogram is the simplest representation and it is straightforward to interpret. However, this technique does not take full advantage of the space. The HC tree uses more efficiently the twodimensional space, but we now have more difficulties in reading the most dense clusters. The three-dimensional MDS takes advantage of the computational representation, but requires some rotation/shift/magnification in the computer screen. Successive magnifications can be done and are necessary to achieve a more distinct visualization. Therefore, we can say that all representations have their own pros and cons, although the three-dimensional MDS is, a priori, the most efficient method. Since the three-dimensional plot requires some rotation in the computer screen, Fig. 10 shows the MDS locus in a different perspective, without point labels and the cluster of SARS-CoV-2 connected by a line. The MDS 3D following the Kolmogorov method isolated the several groups of virus in several extended groups. The cluster of SARS-CoV-2 is very extended. Note that the SARS-CoV-2 virus found in the market of Wuhan is very near of the SARS-CoV-2 cluster. The CoV found in the pangolin also forms a cluster mixing with the upper part of the SARS-CoV-2 cluster. Bat CoV virus forms also a diffuse cluster lacking some contiguity. The bat CoV Yunnan RaTG13 is very near the SARS-CoV-2 cluster. There is a big cluster formed by the different CoV that causes a mild respiratory disease (common cool). Other pathogenic RNA virus clusters, i.e., Lassa virus, Zika, influenza B and Ebola, can be easily separated from the CoV clusters. However, with this methodology, the cluster formed by the dengue virus is not very far, in some views, from a part of the cluster formed by the CoV that causes mild respiratory disease. The SARS CoV virus forms another independent cluster. MERS-CoV from humans forms another extended cluster and in the vicinity of the CoV virus from camel. This paper explored the information of 133 RNA viruses available in public databases. For that purpose, several mathematical and computational tools were applied. On the one hand, the concepts of distance and the theories developed by Kolmogorov and Shannon for assessing complexity and information were recalled. The first involved the normalized compression distance and the zlib compressor for the DNA. The second understood the use of histograms of triplets of bases and their assessment through entropy, cumulative residual entropy and Jensen-Shannon divergence. On the other hand, the advantages of hierarchical clustering and multidimensional scaling algorithmic techniques were also discussed. Representations in two and three dimensions, namely by dendrograms and trees, and loci of points or points and lines, respectively, were compared. The results revealed their pros and cons for the specific application of the set of viruses under comparison. The MDS 3D in the Kolmogorov perspective leads to be the best visualization method. We have not only the clusters easily distinguishable, but we find also the relation between the new SARS-CoV-2 virus and some CoV found in bats (the primary host of the virus) and in the pangolin, the likely intermediate host. The SARS-CoV-2 found in the environment, namely the Market of Wuhan where the epidemic probably started, is indistinguishable from the SARS-CoV-2 found in humans. This type of methodology may help to study how an animal virus jumped the boundaries of species to infect humans, and pinpoint its origin, as this knowledge can help to prevent future zoonotic events [80] . The statistical and computational techniques allow different perspectives over viral diseases that may be used to grasp the dynamics of the emerging COVID-19 disease. These methodologies [79] may help interpreting future viral outbreaks and to provide additional information concerning these infectious agents and understand the dynamics of viral diseases. A novel coronavirus from patients with pneumonia in China Evolutionary trajectory for the emergence of novel coronavirus SARS-CoV-2 Complete genome sequence of middle east respiratory syndrome coronavirus isolated from a dromedary camel in Egypt Early dynamics of transmission and control of COVID-19: a mathematical modelling study Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period Genetic evolution analysis of 2019 novel coronavirus and coronavirus from other species Epidemic analysis of COVID-19 in china by dynamical modeling Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus COVID-19 evolves in human hosts Toward an efficient method of identifying core genes for evolutionary and functional microbial phylogenies Molecular epidemiology and evolutionary histories of human coronavirus OC43 and HKU1 among patients with upper respiratory tract infections in Kuala Lumpur The rapidly expanding universe of giant viruses: mimivirus, pandoravirus, pithovirus and mollivirus Fundamentals of Molecular Virology An efficient algorithm for a complete link method Hierarchical clustering via joint between-within distances: extending Ward's minimum variance method Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms The Elements of Statistical Learning: Data Mining, Inference, and Prediction Tidal analysis using timefrequency signal processing and information clustering Rare and extreme events: the case of COVID-19 pandemic Theory and Methods of Scaling The analysis of proximities: multidimensional scaling with an unknown distance function. Psychometrika 27(I and II Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis Multidimensional Scaling Modern Multidimensional Scaling-Theory and Applications Is multidimensional scaling suitable for mapping the input respiratory impedance in subjects and patients? Comput Analysis of UV spectral bands using multidimensional scaling The molecular biology of coronaviruses Coronavirus envelope protein: current knowledge Origin and evolution of pathogenic coronaviruses Severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats Genetic diversity and evolution of SARS-CoV-2 Clustering by compression Encyclopedia of Distances Taxonomy of nominal type histogram distance measures A measure of DNA sequence similarity by Fourier transform with applications on hierarchical clustering complexity for DNA sequences Relationship of bacteria using comparison of whole genome sequences in frequency domain Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm The distance function effect on k-nearest neighbor classification for medical datasets A Comparison of Categorical Attribute Data Clustering Methods Secure approximation of edit distance on genomic data Normalized forms of two common metrics A new study on distance metrics as similarity measurement Feature Extraction: Foundations and Applications Perceptually based comparison of image similarity metrics Three approaches to the quantitative definition of information Information distance Kolmogorov complexity with error Common pitfalls using the normalized compression distance: what to watch out for in a compressor Image similarity using the normalized compression distance based on finite context models Using normalized compression distance for image similarity measurement: an experimental study Normalized compression distance of multisets with applications On normalized compression distance and large malware On the Approximation of the Kolmogorov Complexity for DNA Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA A mathematical theory of communication Entropy and Information Theory Generalised information and entropy measures in physics Mathematical Foundations of Information Theory Information theory and statistical mechanics On measures of information and entropy Possible generalization of Boltzmann-Gibbs statistics Fractional order generalized information Cumulative residual entropy, a new measure of information & its application to image alignment Fractional cumulative residual entropy Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete Generalized information measures and their applications: a brief survey Divergence measures based on the Shannon entropy Measures between probability density functions Clustering Algorithms Multidimensional scaling visualization using parametric similarity indices On the Surprising Behavior of Distance Metrics in High Dimensional Space The comparison of dendrograms by objective methods PHYLIP (phylogeny inference package), version 3.5 c A Primer to Phylogenetic Analysis Using the PHYLIP Package A survey on multidimensional scaling Relativistic time effects in financial dynamics Multidimensional scaling analysis of virus diseases Profile of a killer: the complex biology powering the coronavirus pandemic The proximal origin of SARS-CoV-2 Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Acknowledgements The authors thank all those who have contributed and shared sequences to the GISAID database (https://www.gisaid.org/). The authors also thank those who have contributed to the GenBank of the National Center for Biotechnology Information (NCBI) databases (https://www. ncbi.nlm.nih.gov/genbank). The authors also thank Rómulo Antão for the help in handling the information with the compressors zlib and bz2. 72 91 77 78 81 82 79 15 32 34 39 50 51 53 52 57 59 48 47 55 49 54 31 115 123 124 127 125 120 119 116 117 121 122 118 37 2 6 7 101 102 105 9 10 16 17 46 60 61 44 45 35 128 129 28 65 25 67 14 87 70 126 18 19 23 20 21 22 132 24 133 130 131 112 114 109 94 110 111 113 93 98 95 96 97 26 3 4 73 74 75 76 88 92 90 89 84 85 86 83 80 8 13 107 12 106 103 99 104 108 100 11 62 66 64 69 63 68 5 29 30 27 40 33 36 43 42 38 56 41