key: cord-0930605-m8jm632d authors: Cilibrasi, Rudi L.; Vitányi, Paul M. B. title: Fast Phylogeny of SARS-CoV-2 by Compression date: 2022-03-22 journal: Entropy (Basel) DOI: 10.3390/e24040439 sha: e40467ea6abf778a399511a57a4c91eff6c452a6 doc_id: 930605 cord_uid: m8jm632d The compression method to assess similarity, in the sense of having a small normalized compression distance (NCD), was developed based on algorithmic information theory to quantify the similarity in files ranging from words and languages to genomes and music pieces. It has been validated on objects from different domains always using essentially the same software. We analyze the whole-genome phylogeny and taxonomy of the SARS-CoV-2 virus, which is responsible for causing the COVID-19 disease, using the alignment-free compression method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6500 viruses. The results suggest that the SARS-CoV-2 virus is closest in that database to the RaTG13 virus and rather close to the bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC45. Over 6500 viruses are identified (given by their registration code) with larger NCDs. The NCDs are compared with the NCDs between the mtDNA of familiar species. We address the question of whether pangolins are involved in the SARS-CoV-2 virus. The compression method is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here, we use it for the complex case of determining this similarity between the COVID-19 virus, SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely resemble earlier results from by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method, showing that one can achieve equivalent results both simply and quickly. In the 2019-2022 pandemic of the COVID-19 disease caused by the SARS-CoV-2 virus [1] , studies of the phylogeny and taxonomy of this virus use essentially two methods: alignment-based analysis, for example [2] ; and alignment-free machine learning [3] . In alignment-based methods, one customarily takes two or more sequences over a bounded alphabet of letters and makes them identical by changing letters, introducing gaps, etc., where each of these operations carries a cost. The total cost is computed by adding these sub costs. The alignment with the least total cost is preferable. Problems include, e.g., similar regions on sequences being compared may be far apart, causing massive alignment costs and a computation cost for an alignment which may be too high or forbidding. Alignment methods have many more drawbacks which make alignment-free methods more attractive [4] [5] [6] [7] . Machine learning algorithms learn a concept using an input consisting of many examples of that concept. It is alignment-free but uses complicated algorithms with many parameters to set. This was used in the phylogeny analysis of the COVID-19 virus in [3] together with decision trees and other devices. The purpose of this study was to identify the closest viruses to the SARS-CoV-2 virus isolate and determine their structural relations (taxonomy) in phylogeny trees in Figures 1 and 2 and the NCDs in Table 1 using the compression method. This method is alignment-free, based on the lossless compression of the whole-genome sequences of base pairs [8] of the involved viruses, and is called the normalized compression distance (NCD) method or compression method for short. The method computes the NCDs between pairs of genomes resulting in an NCD-matrix. Figure 1 . These 60 viruses have the least NCD with the SARS-CoV-2 virus. The figure therefore gives a visual representation of the structural relations close to the SARS-CoV-2 virus. We use human mtDNA as an outgroup; it is approximately the same size as the viruses and it can be assumed to be completely different. Using an unrelated sequence which is called the "outgroup" is a common method to determine where the root of a phylogenetic tree is: where it joins the tree, there is the root. Since S(T) = 0.998948, the tree almost perfectly represents the NCD distance matrix concerned in the Supplemental Material. We computed the NCDs between 6751 viruses and a selected SARS-CoV-2 virus but we can only visualize part of them. With this method, we quantify the proximity relations between pairs of viruses by their NCDs and compare them to similar relations between the mtDNAs of familiar mammal species in order to gain an intuition as to what they mean. Phylogeny analysis is widely used but there are challenges of, among others, accuracy and speed. As a measure of similarity, the NCD seems a good idea to solve these challenges. The compression method determines the similarity of two sequences using a single formula once, and thus does not use many examples. Since the NCD assesses similarity for sequences in all domains, we apply it to the complex problem of SARS-CoV-2 phylogeny. We verify whether the results using this method are comparable with those from the other used methods. The conclusion is that the NCD method is an alignment-free method that gives accurate phylogeny results, while being possibly faster than both alignment-based methods and other alignment-free methods. Previous studies have pointed to bats as the origin of the SARS-CoV-2 virus. It is thought to belong to lineage B (Sarbecovirus) of Betacoronavirus. From phylogenetic analysis and genome organization, it was identified as a SARS-like coronavirus, and to have the highest similarity to the SARS bat coronavirus RaTG13 and similar to the two bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC45, see for example [2, 3] . We computed the NCDs of 6751 + 15,409 = 22,160 virus pairs plus the NCDs used for Figures 1 and 2 . Altogether, this took approximately 5-10 h in a combined run on a home desktop computer (a mini-computer called Meerkat from a Linux computer company called System76). It uses less than 2 s per pair which comes to more than 2000 pairs per hour. The same program can be re-used for different phylogenies and questions. This is possibly the easiest and fastest method to establish whole-genome phylogeny. Table S1 in the Supplemtary Materials or [11] . A string is a sequence of finite length over a finite alphabet-here A, C, G, T. Let Z be a lossless compressor ('lossless' means that the original can be reconstructed from the compressed version, that is, nothing of the compressed string gets lost). Use Z(z) with z = x, y, xy, where xy is the concatenation of the x and y (file y appended to file x), to denote the length of the compressed version of string z. The normalized compression distance (NCD) is given by where we eliminate the subscript 'Z' when the compressor Z is understood. A complete mathematical derivation of this distance appeared in [12, 13] and especially in [11] where the name NCD was coined. A brief derivation is given in Section 7. An explanation using a lot of hand-waving is as follows. The compressed version of xy with length Z(xy) can be viewed as the concatenation of the compressed version of string x with the help of string y and the compressed version of string y with the help of string x. Subtracting Z(x) from Z(xy) yields at most the length of the compressed version of y with the help of x, which we may write as Z(y|x), while subtracting Z(y) from Z(xy) yields at most Z(x|y). Assume that Z compresses every string to the shortest string from which it can be reconstructed. Then, we can replace "at most" with "about". The numerator Z(xy) − min(Z(x), Z(y)) consists of the maximum of the two lengths Z(x|y) and Z(y|x). This in turn can be viewed as a distance expressing resemblance. For example, if x is a string and y is the empty string (the string of length 0 without any letters), then this distance is Z(x). Since string x may be larger than string y or vice versa, if we want to determine the similarity of x and y, then we need to normalize. The denominator max(Z(x), Z(y)) is the normalizing factor taking care that the NCD is between 0 and 1 in all cases. The NCD is a family of compression functions parameterized by the used data compressor Z. In a compressor, the "window size" is crucial. Window size is an upper bound on the number of bytes of the input: here, it is essentially the summed number of bytes in the two objects x and y being compared. Common compressors are gzip, which is a Lempel-Ziv compressor with a small sliding window size of 32 kB, bzip2, which is a blocksorting compressor based on the Burrows-Wheeler transform with a larger window size of 256 kB, and PPMZ, which uses an adaptive statistical data compression technique based on context modeling and prediction and with unlimited window size [14] . The objects being compared for similarity must fit in approximately one-half of the window size. In [11] , there are given necessary and sufficient conditions for computing the NCD with a compression program Z. The NCD is an approximation of a concept called NID with the same Formula (1) with Z replaced by the Kolmogorov complexity K (for Kolmogorov complexity, see Section 7). Since K(x) is close to a greatest lower bound on Z(x), and NCD K (x, x) = 0, the NCD Z turns out to be more accurate the better the compressor Z satisfies NCD Z (x, x) = 0 and this also holds for other properties (symmetry, identity, resistance to noise). These properties are to various degrees satisfied by good real-world compressors such as gzip, bzip2, PPMZ, and zpaq. In [15, 16] , the authors systematically investigated how far the performances of compressors gzip, bzip2, and PPMZ satisfy these properties. With Z = gzip usually NCD Z (x, x) is between 1 2 and 1 (very bad). With Z = bzip2, it is lower but nowhere near 0, and NCD PPMZ (x, x) usually is, for example in the genomic experiment of Table S1 in the Supplemental Materials, in between 0.002 and 0.006. We used zpaq as an compression algorithm since [17] "bench marked 430 settings of 48 representative compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on 27 representative FASTA-formatted test data sets including individual genomes of various sizes, DNA and RNA data sets, and standard protein data sets". Figure 1A of the cited paper compares the compressors with respect to their compression ratio: zpaq is the best of the considered general-purpose compressors and competitive with the best considered specialized sequence compressors.We computed the NCD zpac (x, x) with x, the selected SARS-CoV-2 virus isolate serving as standard for the NCDs with other viruses, as can be seen in Section 3, in two cases: • The first 1700 bytes of the selected sequence where the compressed size alone is 445 and paired with itself (concatenated) is 460, yielding an NCD of 0.0337078651685393. • For the full virus sequence, (approximately 28,000 bases) the compressed size alone is 7181 and when paired with itself, it has a compressed size of 7208, yielding an NCD of 0.00375992201643225. The second item shows that, in this respect, zpac is similar to PPMZ (the mtDNA sequences are typically 17,000 bases), while the two items together show that the NCD zpac is not linear. Researchers from the data-mining community noticed that the NCD is a parameterfree, feature-free, data-mining tool. They have tested a simplified version of the NCD on a large variety of sequence benchmarks. Comparing the compression method with 51 major parameter-loaded methods found in the seven major data-mining conferences (SIGKDD, SIGMOD, ICDM, ICDE, SSDB, VLDB, PKDD, and PAKDD) in a decade, on every database of time sequences used, ranging from heartbeat signals to stock market curves, they established the clear superiority of the compression-based method for clustering heterogeneous data, for anomaly detection, and competitiveness in the clustering domain data [18] . Many researchers use this method, introduced in [11, 13] , for example [19] [20] [21] [22] . In the Supplemental Material, we treat two examples from [11] , testing this method on known phylogenies before applying it to complex novel data such as those of the SARS-CoV-2 virus. Using the compression method to compute the phylogeny of the SARS-CoV-2 virus, we first obtained a list of over 6500 viruses that we wanted to compare to the SARS-CoV-2 virus, without duplicates, including partially sequenced viruses, and SARS-CoV-2 viruses (all apart from one). A particular SARS-CoV-2 virus isolate served as the standard against which to determine the NCDs of the other viruses. We selected this SARS-CoV-2 virus isolate from the many, at least 15,500, examples available on 17 July 2020 in the GISAID database. We computed for each virus in the list of over 6500 viruses its NCD with the selected SARS-CoV-2 virus isolate. As the compression program, we used zpaq. Subsequently, we ordered the resulting NCDs from the smallest to the largest. The virus causing the smallest NCD distance with the SARS-CoV-2 virus isolate was regarded as the most similar to that virus. We took the 60 viruses which had the least NCDs with the selected SARS-CoV-2 virus isolate and computed the phylogeny of those viruses and the SARS-CoV-2 virus isolate. We then compared the last mentioned virus with 37 close viruses including pangolin ones, to determine the relation between the SARS-CoV-2 virus and the pangolin viruses. We downloaded the data in two parts and stored them on the public repository GitHub. The two data sets were as follows. One data set was obtained from the authors of the machine-learning-approach study [3] and consisted of the 6751 virus sequences used in that paper. There was one SARS-CoV-2 virus among them. The second data set of 66,899 SARS-CoV-2 virus sequences was downloaded from the hCoV19 directory of GISAID [23] on 17 July 2020. After data cleaning, all 21 virus sequences with /2017, /2018, or /2019 in their name, presumably incorrectly classified, were excluded. This last set contains three pangolin viruses with NCDs against the selected SARS-CoV-2 virus of, respectively 0.738175, 0.874027, and 0.873367. All of the GISAID data with /2020 in the directory were SARS-CoV-2 viruses. Because one of them had the registration code "hcov-19", identical to the name of the directory itself, we reduced the data set by removing that sequence: gisaid_hcov-19_2020_07_17_22.fasta.fai hCoV-19 29868 1713295574 80 82. On the sequences initially downloaded from GISAID, we applied a lowercase transformation to each to reduce pointless variability. After that, we computed a histogram of all the characters in the sequence and counted the size of each group. Many sequences contained the base pairs A, C, G, N, T or other letters. We retained the viruses in the list after deduplication and filtering for A, C, G, T. This reduced the GISAID download to a set of unique 15,578 sequences with the known nucleotides A, C, G, and T. Each viral sequence is an RNA sequence and is approximately 30,000 RNA base pairs in size. The total size of all the sequence data together is in the order of two gigabytes. We then looked at whether there was much variation among the SARS-CoV-2 viruses themselves since this may invalidate the NCD distance between the inspected viruses and the selected SARS-CoV-2 virus. The worst NCD against the selected SARS-CoV-2 virus was 0.874027, namely gisaid hcov-19 2020 07 17 22.fasta