key: cord-0975185-o9k9fr3d
authors: Machado, J. A. Tenreiro; Rocha-Neves, J. M.; Azevedo, Filipe; Andrade, J. P.
title: Advances in the computational analysis of SARS-COV2 genome
date: 2021-08-27
journal: Nonlinear Dyn
DOI: 10.1007/s11071-021-06836-y
sha: bab92b97e060dfd1e0544fef70ad014f4515105f
doc_id: 975185
cord_uid: o9k9fr3d

Given a data-set of Ribonucleic acid (RNA) sequences we can infer the phylogenetics of the samples and tackle the information for scientific purposes. Based on current data and knowledge, the SARS-CoV-2 seemingly mutates much more slowly than the influenza virus that causes seasonal flu. However, very recent evolution poses some doubts about such conjecture and shadows the out-coming light of people vaccination. This paper adopts mathematical and computational tools for handling the challenge of analyzing the data-set of different clades of the severe acute respiratory syndrome virus-2 (SARS-CoV-2). On one hand, based on the mathematical paraphernalia of tools, the concept of distance associated with the Kolmogorov complexity and Shannon information theories, as well as with the Hamming scheme, are considered. On the other, advanced data processing computational techniques, such as, data compression, clustering and visualization, are borrowed for tackling the problem. The results of the synergistic approach reveal the complex time dynamics of the evolutionary process and may help to clarify future directions of the SARS-CoV-2 evolution.

The severe acute respiratory syndrome virus (SARS-CoV-2) is a single-stranded Ribonucleic Acid (RNA) beta-coronavirus presenting a 29,903 nucleotides-long genome. It caused an outbreak in the Chinese city of Wuhan, in December 2019. Subsequently, the new coronavirus has spread worldwide in less than 4 months, and a pandemic situation was declared by the World Health Organization (WHO). The disease was named on February 2020, Coronavirus Disease 2019 (COVID- 19) , and the disease is now responsible for more than 145,000,000 confirmed cases, and 4,230,000 deaths reported worldwide as of August 3, 2021 (source: World Health Organization https:// covid19.who.int). There are significant differences in case fatality rates (proportion of deaths from a specific disease compared to the total number of individuals diagnosed with the disease for a particular period) between countries, possibly related to the efficacy of the measures adopted to limit viral spreading, the demographic pyramid, i.e., the distribution of various age groups in a particular country, and the screening, test and tracing strategy [1] .

Using the data from the database of the Global Initiative on Sharing Avian Influenza Data (GISAID) Consortium, major clades of SARS-CoV-2 were identified [2] . The initial RNA genome (accession number: NC 045512.2), identified in Wuhan is named as clade 'L'. This root clade suffered mutations due to numerous factors, and some new clades emerged with stable mutations: clade 'G' (presenting an alteration of the spike protein S-D614G was the first dominant variant) and its two derivatives 'GH' (with ORF3a-Q57H mutation) and 'GR' (affected by RG203KR mutation); clade 'V' (variant of the ORF3a coding protein NS3-G251), and clade 'S' (variant ORF8-L84S). Other alleles or combinations different from the previously described clades are classified as clade 'O' [2] . These data confirmed, until now, the relatively low variability presented by SARS-CoV-2 compared to other respiratory viruses [3] . The different fatality rates, speed of transmission, and infectiousness profiles observed in different countries were probably not related to differences in virulence of the clades and their characteristic mutations. However, very recently, some new mutations were found with potential epidemiological consequences. For example, 12 human cases were identified in September 2020 in North Jutland with a unique variant called 'cluster 5', a combination of mutations that have been not previously described. All 12 cases were linked to the mink farming industry or the local community [4] . However, the clinical presentation, severity, and duration of COVID-19, and transmission among those infected were similar to that of other circulating SARS-CoV-2 viruses. The variant cluster 5 has not been detected since September despite extensive sequencing and data sharing and is thought to be a dead-end, owing to the very restricted spreading and infectiousness to humans [4] .

More disturbing is the new variant strain of SARS-CoV-2 that contains 23 mutations, eight of which are in the spike protein the virus uses to bind to and enter human cells. The spike protein is also the focus of most COVID-19 vaccines that are now being administered in numerous countries. Moreover, the diagnostics tests of COVID-19 are also based on the protein sequence found on the Wuhan reference strain spike. Therefore, their efficacy can be changed by these genomic variations. The appearance of these mutations can also lead to immunological resistance and vaccine escape [5] . The new British variant was reported for the first time in December 2020 and has become highly prevalent globally and responsible for another COVID-19 wave in numerous countries [6] . Based on these mutations, this variant has been predicted to potentially be more quickly transmissible than other circulating strains of SARS-CoV-2. The variant is referred to as SARS-CoV-2 VUI 202012/01 (i.e., Variant Under Investigation, year 2020, month 12, variant 01), or B.1.1.7, as the lineage of the clade GR was classified on GISAID. Variant B.1.1.7 presents increased transmissibility and increased virus load. However, apparently, there is no association with more severe disease [7] although the increased infectiousness can lead to more deaths due to the strain of the health systems of the affected countries. In Mid-November in South Africa, a new lineage of the clade GH has emerged but shared one of the mutations described in the British variant. The South African virus variant (B.1.351), known as 'triple variant' is distinct from the UK variant, but both contain an unusually high number of mutations, with potential functional significance, compared to other SARS-CoV-2 lineages, and it can, apparently, partially escape to some available vaccines [8] . Several other variants were described more recently in several countries. Examples are the variants P1 and P2 in Brazil, variants B.1.429 and B.1.526 found in the USA, and yet more recently, a variant first sequenced in India (variant B.1.167) [9] . In April 2021, the South Africa, Brazil, and India variants caused or are causing large disease outbreaks in their respective countries with increased excess deaths due to rupture of the hospital capacities.

The SARS-CoV-2 B.1.1.7 variant was already detected at the end of December 2020 in France, Denmark, Holland, and Italy [10] . In 31 of May, the WHO has assigned simple, easy to say and remember labels for the most important variants of SARS-CoV-2, using letters of the Greek alphabet (see Tracking SARS-CoV-2 variants at https://www.who.int/en/activi ties/tracking-SARS-CoV-2-variants/#:∼:text=WHO% 20and%20its%20international%20networks,variant% 2c%20and%20prevent%20its%20spread).

The most important variants of concern according to this new classification are: Alpha, corresponding to lineage B. Table 1 ).

In the Summer of 2021, the SARS-CoV-2 Delta variant is becoming dominant in Europe, North America and other parts of the world and is responsible for a new wave, mainly in the unvaccinated population [11] . It is highly transmissible and it also appears to affect vaccine effectiveness and breakthrough infections in vaccinated individuals appear to be more frequent with this variant [11] . It was reported that the viral load is 1000 times higher for Delta compared with previous variants of the initial wave of infections and may have a faster replication rate, a reduced incubation period, and greater viral shedding [12] . The Delta variant was found to be approximately 64% more transmissible than the Alpha variant that was dominant in the waves of the end of December 2020 and first months of 2021. On the other hand, Alpha was already estimated to be 50% more transmissible than the D614G strain, responsible for the first wave in the beginning of 2020 [13] .

The so-called coronavirus 'waves' can be more contagious and spread faster than the initial ones due to the new variants that can also present other epidemiological problems. There is no definitive evidences that the various variants are associated with an higher disease severity. However, there is a clear risk that future epidemic 'waves' may be larger and, therefore, associated with greater burden for the health systems and society due to the lockdowns. Therefore, with this work we try to understand SARS-CoV-2 new variants and the relations with the already known virus clades using new strategies for finding the relationships among them [14] [15] [16] . It is important to understand the recent evolution of the virus partially responsible for the socalled 'waves' [17] . With this regard computational techniques associated with mathematical tools are a promising strategy to tackle genomic data-sets. This approach was tested in [18, 19] using the Kolmogorov complexity and Shannon information theories, associated with clustering techniques [20, 21] such as the Multidimensional Scaling (MDS). Given its successful application in a primary set of 133 items, encompass-ing a variety of virus, this paper extends the study to a more challenging problem, namely of analyzing and comparing the SARS-CoV-2 mutations. For that purpose a new data-set of 307 virus including RNA information from the beginning of the spread up to the day of writing this paper is selected. Furthermore, based on the aforementioned mathematical tools, we include a larger set of indices to allow a more complete comparison. In the case of the Kolmogorov complexity we consider four indices [22, 23] , normalized information distance, Compression-based Dissimilarity Measure, Chen-Li Metric and Compression-based Cosine (that are abbreviated by the acronyms N C D, C DM, C L M and Cos S). In the scope of the Shannon information theory we consider also four metrics, namely the Jaccard, Jensen-Shannon, Jeffreys and Topsøe distances (denoted as d J a , d J S , d J e and d T o ) [24, 25] . A third type and distinct type of assessment is also included and consists of the Hamming distance (d Ha ), widely used in information theory [26] .

Following these ideas the rest of the paper is organized as follows. Section 2 introduces the fundamental tools. The main mathematical concepts involved with distance, Kolmogorov complexity, Shannon information and Hamming metric are summarized. Additionally, the computational tools such as data compression and MDS are also included. Section 3 describes and analyses the data-set of very close, but distinct, information for a large number of RNA sequences. Section 4 develops a synergistic performance of a variety of measures associated with the MDS clustering and visualization. The results shed light on the dynamics of the evolutionary process of the SARS-CoV-2 lineages. Finally, Sect. 5 presents the conclusions.

A function d(·, ·) stands for the distance between two objects x and y if satisfies three axioms [27], namely identity d(x, y) = 0 if x = y, symmetry d(x, y) = d(y, x) and triangle inequality d(x, y) ≤ d(x, z) + d(y, z). These axioms imply the non-negativity (or separation condition) d(x, y) ≥ 0. On the other hand, they allow the definition of a plethora of different functions, with distinct pros and cons [24, 25] . Based on these notions, several algorithms [28] [29] [30] were adopted for comparing data sequences [26, [31] [32] [33] . However, users must have in mind that the selection of a set of distances for a given application requires some experience and that a number of numerical trials are usually necessary before finding the 'best' ones [34] [35] [36] [37] .

In the final part of the paper, the Appendix presents the mathematical and algorithmic fundamentals of the distances used in this paper for assessing the genetic information.

Compression data algorithms can be classified as 'lossless' and 'lossy'. Lossless compression algorithms are typically used for archival or other high fidelity purposes and reduce the size of files without losing any information in the file, which means that we can reconstruct the original data from the compressed file. Lossy compression algorithms reduce the size of files by discarding the less important information in a file, which can significantly reduce file size but also affect file quality.

In this paper we used the BZip2 compression algorithm which is based on Burrows-Wheeler transform [38] . This compressor has the extension BZ2 designating a pure data compression format not providing file archival feature. In this algorithm the speed is somewhat slower than for the compressor LZW (extension .Z) and Deflate (extension .zip and .gz) compression algorithms [39] . These employ the classic Deflate algorithm (even if correctly implemented Bzip2 algorithm can be easily made parallel, and benefit of recent multi-core CPU), but faster than more powerful compression schemes as in RAR format, 7Z format, and new ZIPX format. The compression ratio, also, is usually intermediate between the older Deflatebased ZIP/GZ files and modern RAR, 7Z and ZIPX formats [40] .

Let us consider a group of N objects x i , i = 1, · · · , N , in a q-dim space. The MDS is a computational method for [41] that re-organizes them in a structure where the objects are represented by points trying to highlight the similarities between them in the sense of a predefined distance [42] . The process starts by calculating a N × N dimensional matrix, D = [d i j ], with d i j ∈ R + for i = j and d ii = 0, (i, j) = 1, · · · , N , giving object to object distances [43] . In a second phase, the MDS calculates the point coordinatesx i in a d < q-dim space, trying to mimic the original distances. The MDS technique includes a numerical iterations for optimizing a cost function, often called Stress, that compares the distances

The MDS pointsx i have coordinates that yield a symmetric matrixD = [d i j ] of distances that approximate D. The MDS results are interpreted based on the clusters, and eventually of patterns, of points [18, 44] . Therefore, similar objects are represented by nearby points, and the opposite for dissimilar objects. Different distances produce distinct MDS maps and it is up to the user to choose the metrics that reflect better the characteristics of the objects under analysis. By other words, the different distances are correct from the mathematical point of view, but the association of each metrics with the MDS algorithm may produce disparate patterns in the plots. In some cases the emerging patterns, although different, lead to similar conclusions. In other cases, some distances reflect better (or worst) the information embedded in the dataset and the selection of the 'best' metric depends on a trial and error set of tests based on the user experience.

The RNA is commonly sequenced indirectly by copying it into the complementary DNA (cDNA). Then the cDNA is amplified and analyzed using a number of DNA sequencing methods. The sequences of the RNA are published in the databases presenting the bases, adenine (A), cytosine (C), guanine (G), and thymine (T ). Some symbols, such as N (unspecified or unknown nucleoside), R, (unspecified purine nucleoside), Y (unspecified pyrimidine nucleoside) and others, permeate only a small percentage of the information and are not considered [45] . The information about the N = 307 GS was collected in the Global Initiative on Sharing Avian Influenza Data (GISAID) available at https://www.gisaid.org/. The information regarding the sequences, serial, clade/variant and country are listed in the Tables 5, 6, 7, 8 and 9. The genetic information is organized in 8 clades {GH, GR, O, GV, G, L, S, V} with 10 elements each, making a total of 80 cases. The recent advent of new variants correspond to the remaining 227 additional items as follows:

• South Africa, with 10 cases for the variant 'South Africa Triple Variant', denoted as TV-ZA • Denmark, with 10 cases for the variant 'Mink Cluster V', denoted as CL5-DK • England, with 10 cases for the variant VUI2020/01, denoted as VUI-GB • Italy, with 10 cases for the variant VUI2020/01, denoted as VUI-IT • Denmark, with 10 cases for the variant VUI2020/01, denoted as VUI-DK • Portugal, with 10 cases for the variant VUI2020/01, denoted as VUI-PT • USA, with 10 cases for the variant VUI2020/01, denoted as VUI-US • a mixture of several cases scattered along the world to give an higher spatial diversity In synthesis, we have collected a first set of 80 GS of the SARS-CoV-2 virus obtained in several countries during a first period of the outbreak. The second set includes 227 recent GS. The smaller number of cases the recent genomic data for some countries is limited to the data set available at the time of writing this paper. All ASCII files have approximately 30 kBytes.

The N = 307 GS exhibit very small differences and, therefore, are difficult to distinguish. We can first characterize them by their length L that varies between minimum and maximum values of L min = 28560 and L max = 29900 symbols. Moreover, we have an average and standard deviation of L av = 29773.5 and L sd = 109.31 symbols, respectively. This small variability in the size is relevant for the reliability when comparing strings with the Kolmogorov-and Hamming-based metrics.

As mentioned before, the viral RNA information is represented by ASCII files with the four nitrogenous bases. Therefore, we can consider the grouping of k s = {1, 2, 3}, consecutive symbols. For simplicity, we denote the corresponding sub-strings by S 1 , S 2 , S 3 and obtain the statistics listed in Tables 2, 3 and 4. The term 'others' stands for a small number of other symbols distinct of the four nucleotide bases. Therefore, occasionally we find the symbols N, M, S, R, Y, K, H, V, and W, that abbreviate aNy (A, C, G or T), aMino (A or C), Strong interaction (3 H-bonds, G or C), puRine (A or G), pYrimidine (C or T), Keto (G or T), not-G (A, C or T), not-T (A, C or G), and Weak interaction (2 H-bonds, A or T), respectively. Figure 1 shows the cumulative numbers of cases for sub-strings including k s = {1, 2, 3} symbols in the N = 307 GS.

As a complementary analysis of the relationship between GS we use the VOSviewer software tool [46] [47] [48] [49] [50] for constructing and visualizing bibliometric networks (https://www.vosviewer.com/). The program was built having in mind scientometric applications, but can be used in the present case if we consider the associations of symbols as keywords in a standard technical text. For that purpose we construct k s -tuples of consecutive symbols in the GS in order to have 'words' (i.e., sub-strings) with k s consecutive symbols. We considered the VOSviewer options 'Full counting', 'Minimum number of occurrences of a term = 5' and 'Number of terms selected = 28'.

For example, Fig. 2 shows the network for the sub-strings with k s = 3 consecutive symbols in the i = 1 genomic sequence. For the other virus the results are of the same type. We observe the very small relevance of 'phrases' with several triplets and, on the other hand, the complex network relationship between triplets.

From the Tables 2, 3 and 4 and Figs. 1 and 2 we verify that the case of k s = 3 symbols represent a good compromise between complexity and accurate description of the information content. This conclusion follows previous observations with genetic data [19, [51] [52] [53] .

The existence of the symbols classified as 'other' and the variation of size in the GS can be considered as a kind of noise. However, as verified with the previous tests they reflect in very small numbers. Also, in most cases we considered about 10 GS for each type of virus. Therefore, we can proceed with the analysis of the genetic information knowing its robustness against possible volatility in the data.

The mathematical description of the viral information is based on the Kolmogorov, Shannon and Hamming perspectives implemented by the distances Ha presented in the Appendix. The MDS clustering and visualization is used to unravel relationships between the data and to identify possible patterns. In the case of the Kolmogorov complexity we consider the compressor BZip2 (https://www.zlib.net). In the case of the Shannon information, we start by calculating the 64-bin histograms (i.e., the triplets {A A A, A AC, . . ., GGT , GGG}) for the triplets (k s = 3) of the nitrogenous bases and then we calculate the distances between them. The resulting matrix D, 307 × 307 dimensional, that is processed by MDS using the Matlab command cmdscale.

The distances following the Kolmogorov theory {N C D, C L M, C DM, Cos S} yield almost similar MDS plots. This behavior was also observed in previous studies with distinct data-sets [54] . In what concerns the distances based on the Shannon theory {d J a , d J s , d J e , d T o } we note that d J a produces a slightly different plot from the group {d J s , d J e , d T o }, which return charts having just small differences. On the other hand, the distance d Ha leads to a very different chart. Therefore, for parsimony, for these sets of distances we depict just the plots for the N C D, d J acc , d J S and d Ha .

The MDS charts produced by the four distances are represented in Fig. 3 , 4, 5 and 6, respectively. In all cases the MDS plots require a careful rotation to get the correct 3-dimensional perspective and assessment, since the planar projections in the figures are not totally capable of depicting their structure.

Several MDS loci reveal different clusters, but, in general we do not have a clear group for each variant of the virus. The Kolmogorov-and the Shannonbased metrics show some clusters, but with a mixture of many variants, while the Hamming scheme gives the worst map in the perspective of clustering. Nonetheless, we can ask a different and more relevant question Table 3 Statistics of 2-tuples of symbols in the N = 307 GS Table 4 Statistics of 3-tuples of symbols in the N = 307 GS which is how to assess the 'dynamics' of the evolutionary process. We must note that the available information just reflects 'time samples' of the variants, that is, the date where the procedure for collecting, identifying and recording the GS took place. Consequently, we do not have a precise control of the time elapsed between the real mutation and the laboratory measurement. Nonetheless, we can have a good idea of the dynamical behavior if we include time information in the MDS plots, even if some 'noise' is present in the time information. It is well known that each distance captures a given characteristic of the phenomenon under analysis. Therefore, we wander if some measure associating several distances could reveal better the patterns embedded in the datataset. In this line of thought we can design a 'generalized' distance by weighting several of the previous distances. If we consider the distances N C D, d J acc , d J S and d Ha , then we can define the new metric:

where μ NC D , μ J a , μ J S and μ Ha are weight factors for the distances N C D, d J acc , d J S and d Ha , respectively, so that μ NC D + μ J a + μ J S + μ Ha = 1. We can adjust the numerical values of the wight factor to reflect the importance of each distance for improving the MDS representation in the sense of providing a more clear visualization of the dynamical effect. In our case, after some experiments, and having in mind the dynamics in time, we consider the weights μ NC D = 0.55, μ J a = 0.2, μ J S = 0.2 and μ Ha = 0.05. Figure 11 shows the MDS plot where we observe the emergence of five clusters denoted by the

We verify (i) the first and second clusters, A and B, exhibit a very low scattering in contrast with the others, (ii) the existence of some noise in the evolutionary process, (iii) that time flows in discrete steps in the MDS representation, that is, the time evolution is not continuous, (iv) that we can have some evolutionary bifurcations in time as it is visible for {D, E}, and (v) clusters for different time instants may be close in the MDS locus. A deep reflection upon these results seems consistent with our present knowledge of the evolutionary process. In fact, we can interpret and justify the previous assertions as follows:

• the initial strains of virus had very similar characteristics, contrary to the more recent variants that reveal an increasing variability, which agrees with the first observation • evolution and, in particular, mutations, occur randomly which justifies the second conclusion • mutations do not emerge continuously in time and, in fact, they are the result of a multitude of issues such as environmental, social, geographical and economical factors. Therefore, new variants that lead to a relevant number of infections are expected to emerge without a clear time pattern which is in accordance with the third consideration It is important to analyze the effect of the clustering algorithm, the dimension of the visualization space and the type of representation. In this perspective, the Generalized distance (1) and, consequently, the same matrix D used for the MDS scheme, is now used with the set of programs Phylip [55, 56] Figures 12 and 13 shows the dendrogram and tree generated by Phylip for the generalized distance, d Ge , respectively. The time evolution is represented by colors as before.

The 2-dim graphical representations are easier to visualize in the sense that the user does not needs to for the initial 60% of the total period. In the case of a graphical output my means of a tree we have a more intricate pattern, with the objects placed in the form of two 'arcs'. The outer arc covers the period of time from beginning up to approximately 75%, while the inner arc corresponds to the rest, that is, to the more recent 25% GS. Moreover, the initial period of time of about 30% is place it the top of the outer arc. In summary, the strategy followed in this paper is consistent with present-day understanding about the SARS-CoV-2 genome and the adoption of several distinct distances allows users to have a complementary interpretation of the information embedded in the dataset.

The information of 307 RNA viruses available in a public database was explored by means of an association of mathematical and computational tools. The notions Kolmogorov complexity, Shannon information and Hamming distance, on the perspective of analytic tools, and the ideas of compression and MDS algorithms, on the point of view of computational tools, were considered. Three sets of indices of dissimilarity were adopted, allowing a broad comparison of the results, with four distances based on the BZip2 compression, four distances using 2-dimensional histograms of consecutive triplets, and one metric using the Hamming distance between triplets of bases. From these, the MDS algorithm allowed an efficient clustering and 3-dim visualization. The MDS plots revealed pros and cons of the alternative distances adopted for Fig. 13 Tree representation of the SARS-COV2 time dynamics during the period 2020/Feb/24-2021/Apr/23 for N = 307 GS using the generalized distance, d Ge assessing the set of viruses. The problem at stake proved to pose a considerable challenge and no clear clusters emerged for the virus variants included in the dataset. This motivated a new question, namely the relevance of clustering the variants when thinking about its evolutionary dynamics, since many variants have minor differences between themselves. The idea of assessing the dynamics in time lead to the design of a generalized distance taking advantage of the characteristics of distinct metrics and allowing a adjustment to the phenomenon by means of weight factors. The results showed interesting dynamical effects in terms of forming clear clusters in time. Besides the analytic tools, several algorithmic mechanisms were explored, such as the Matlab, VOSviwer and Phylip programs, which illustrate the diversity and richness of present day computational strategies. The synergistic perspectives provided by distinct processing tools and graphical representations allow comparing the genomic data and provide a computational strategy for exploring future viral outbreaks. The association of analytic and computational techniques may help interpreting the phylogeny of these new strain outbreaks, associate its dynamics and selective pressures, and give additional insight for the quick development and testing of tailored countermeasures. The computational analysis of the SARS- CoV-2 genome may have a role in the early detection of potential variants of concern and help in the characterization of the risk posed to global public health. This is important as it may contribute to the global monitoring of SARS-CoV-2 variants and to improve the search for a more effective response to the COVID-19 pandemic. Italy/LAZ-AMC3-5015/2020|EPI_ISL_717978|2020-12-14  GR  VUI2020/01  Italy  VUI-IT   112  hCoV-19/Italy/CAM-TIGEM-174/2020|EPI_ISL_736786|2020-12-21  GR  VUI2020/01  Italy  VUI-IT   113  hCoV-19/Italy/CAM-TIGEM-181/2020|EPI_ISL_736787|2020-12-21  GR  VUI2020/01  Italy  VUI-IT   114 

The Kolmogorov complexity tackles the assessment of information without considering probabilistic notions [57] . Given a string x, the Kolmogorov complexity can be defined as 'the length of the shortest program that, given an empty string ψ, can compute x and then halts'. Consequently, the Kolmogorov complexity can be somehow viewed as the length of the compressed form of the string x.

The information distance between two strings [22, 23] leads to concepts such as, normalized information distance (N C D) [58, 59] , Chen-Li Metric (C L M) [60] [61] [62] , Compression-based Dissimilarity Measure (C DM) [63] and Compression-based Cosine (Cos S) [54] , given by:

The operator C(x) represents the length (measured as a number of bits) of string x after being compressed by a given algorithm C(·). The concatenation of two strings, x and y, is denoted as x y and, therefore, C(x y) gives the number of bits when compressing together x and y. Moreover, the notation C(x|y) stands for the length of x when conditionally compressed with a model built on string y.

Sculley and Brodley [54] noted that the compression has an associated feature space, N. Therefore, they map a string x into a vectorx ∈ N and C(x) = ||x|| 1 . After some simplifications, we verify that the four compression distances (2) use the expression ||x|| 1 + ||ȳ|| 1 − ||x +ȳ|| 1 since we can write N C D(x, y) = 1 − ||x|| 1 + ||ȳ|| 1 − ||x +ȳ|| 1 max ||x|| 1 , ||ȳ|| 1 ,

C L M(x, y) = 1 − ||x|| 1 + ||ȳ|| 1 − ||x +ȳ|| 1 ||x +ȳ|| 1 ,

C DM(x, y) = 1 − ||x|| 1 + ||ȳ|| 1 − ||x +ȳ|| 1 ||x|| 1 + ||ȳ|| 1 ,

Cos S(x, y) = 1 − ||x|| 1 + ||ȳ|| 1 − ||x +ȳ|| 1 ||x|| 1 ||ȳ|| 1 .

(3d)

The four indices reduce to the form 1− ||x|| 1 +||ȳ|| 1 −||x+ȳ|| 1

f (x,ȳ)

, where the normalizing term f (x,ȳ) varies for each case.

In the scope of the Information theory [64] , the socalled information content I of an event x i with probability P (X = x i ) is defined as:

where X denotes a discrete random variable.

The expected value of the information, E(I ), named Shannon entropy [65, 66] , becomes:

that follows the four Khinchin axioms [67, 68] .

In the case of a two-dimensional distribution (X 1 , X 2 ) we have the joint entropy H (X 1 , X 2 ) and mutual information M I (X 1 , X 2 ) given by:

Several distances can be defined from these mathematical tools. In the follow-up we adopt the Jaccard and Jensen-Shannon distances, d J a (X 1 , X 2 ) and d J S (X 1 X 2 ) respectively.

The Jaccard distance represents a set-theoretic interpretation of information based on the normalized distance M I (X 1 ,X 2 ) H (X 1 ,X 2 ) and is formulated as:

The Jensen-Shannon divergence is the symmetric version of the Kullback-Leibler divergence d K L (X 1 X 12 ) and is given by:

where X 12 = 1 2 (X 1 + X 2 ). The expression of d J S (X 1 , X 2 ) can be rewritten as:

In the Shannon's entropy family we include also the Jeffreys (d J e ) and Topsøe (d T o ) expressions given by:

Hamming distance

We adopt also a distance based on standard concepts for comparing the symbols forming the genetic sequences (GS). In information processing, the Hamming metric between two strings of equal length is given by the number of positions at which the corresponding symbols are different [26, 69] . Therefore, we start by considering triplets of k s consecutive symbols in each string. This means that we form sub-strings with k s successive symbols in the DNA strand, so that for a genetic sequence x with length L x we obtain n x = L x 3 of k stuples. The assessment of similarity between the k-th character, k = 1, . . . , k s , in the i-th pair of sub-strings (i.e., the x k and y k symbols in the two sub-strings i extracted from the x and y sequences), is then based on the concept of Hamming metric:

where the indices k and i stand for the k-th symbol in the i-th sub-string (k = 1, . . . , k s , and i = 1, . . . , n x ), respectively. In general the strings x and y have slightly different lengths and we simply adopt the length n xy = min n x , n y . For calculating the total distance between two strings we test the 'normalized' version of the Hamming distance:

where δ i (x k , y k ) = 1 − δ i (x k , y k ) and the term 'normalization' means that the final values are given in relation to the total number of compared sub-strings, n xy .

Luxembourg VUI 196 hCoV-19/Switzerland/ZH

Demographic science aids in understanding the spread and fatality rates of COVID-19

Proc. Natl. Acad. Sci

Geographic and genomic distribution of SARS-CoV-2 mutations

Genomic variance of the 2019-nCoV coronavirus

COVID mink analysis shows mutations are not dangerous -yet

Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology

Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations

Genomic characteristics and clinical effect of the emergent SARS-CoV-2 B.1.1.7 lineage in London, UK: a wholegenome sequencing and hospital-based cohort study. The Lancet Infectious Diseases

Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa

Sequence analysis of 20, 453 severe acute respiratory syndrome coronavirus 2 genomes from the Houston metropolitan area identifies the emergence and widespread distribution of multiple isolates of all major variants of concern

Estimated transmissibility and severity of novel SARS-CoV-2 variant of concern 2020/12/01 in England

Outbreak of SARS-CoV-2 infections, including COVID-19 vaccine breakthrough infections

Viral infection and transmission in a large, well-traced outbreak caused by the SARS-CoV-2 delta variant

Increased household transmission of COVID-19 cases associated with SARS-CoV-2 variant of concern B.1.617.2: a national case-control study

Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine

Safety and efficacy of the ChAdOx1 nCoV-19 vaccine (AZD1222) against SARS-CoV-2: an interim analysis of four randomised controlled trials in Brazil

Oxford-AstraZeneca COVID-19 vaccine efficacy

Rare and extreme events: the case of COVID-19 pandemic

Multidimensional scaling analysis of virus diseases

Computational analysis of the SARS-CoV-2 and other viruses based on the Kolmogorov's complexity and Shannon's information theories

A computational perspective of the periodic table of elements

Multidimensional scaling and visualization of patterns in prime numbers

Information distance

Kolmogorov complexity with error

Taxonomy of nominal type histogram distance measures

Encyclopedia of distances

Error detecting and error correcting codes

A measure of DNA sequence similarity by Fourier transform with applications on hierarchical clustering complexity for DNA sequences

Relationship of bacteria using comparison of whole genome sequences in frequency domain

Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm

A comparison of categorical attribute data clustering methods

The distance function effect on k-nearest neighbor classification for medical datasets

Secure approximation of edit distance on genomic data

Normalized forms of two common metrics

A new study on distance metrics as similarity measurement

Feature extraction: foundations and applications

Perceptually based comparison of image similarity metrics

A block sorting lossless data compression algorithm

A technique for high-performance data compression

Comparison of lossless data compression algorithms for text data

A survey on multidimensional scaling

Clustering algorithms

Multidimensional scaling visualization using parametric similarity indices

Relativistic time effects in financial dynamics

IUPAC-IUBMB Joint Commission on Biochemical Nomenclature and Nomenclature Commission of IUBMB

Software survey: VOSviewer, a computer program for bibliometric mapping

A unified approach to mapping and clustering of bibliometric networks

Visualizing bibliometric networks

Constructing bibliometric networks: a comparison between full and fractional counting

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Shannon information and power law analysis of the chromosome code

Can power laws help us understand gene and proteome information?

Fractional order description of DNA

Compression and machine learning: a new perspective on feature space vectors

PHYLIP (phylogeny inference package), version 3.5 c

A primer to phylogenetic analysis using the PHYLIP package

Three approaches to the quantitative definition of information

The similarity metric

The similarity metric

A compression algorithm for DNA sequences and its applications in genome comparison

An information-based sequence distance and its application to whole mitochondrial genome phylogeny

Shared information and program plagiarism detection

Towards parameter-free data mining

ACM SIGKDD international conference on Knowledge discovery and data mining

A mathematical theory of communication

Entropy and information theory

Generalised information and entropy measures in physics

Mathematical foundations of information theory

Information theory and statistical mechanics

Inferring HIV transmission dynamics from phylogenetic sequence relationships

The authors thank all those who have contributed and shared sequences to the GISAID database available at https://www.gisaid.org/.

The reformatted data that support the findings of this study are available in http://ave.dee.isep.ipp.pt/ jtm/FILES/datasets/Archive.zip.

This appendix describes the metrics adopted in the paper. The Kolmogorov complexity and Shannon information theories are revisited and several distances are recalled. Additionally, the Hamming distance is also included.