key: cord-0614073-gzcbwys1 authors: Gao, Yang; Li, Tao; Luo, Liaofu title: Phylogenetic Study of 2019-nCoV by Using Alignment Free Method (Evolutionary Bifurcation of Novel Coronavirus Mutants) date: 2020-03-03 journal: nan DOI: nan sha: 32c6705af826b69e37ffdf29de02cc64a13f5d31 doc_id: 614073 cord_uid: gzcbwys1 The phylogenetic tree of SARS-CoV-2 (nCov-19) viruses is reconstructed according to the similarity of genome sequences. The tree topology of Betacoronavirus is remarkably consistent with biologist's systematics. Because the tree construction contains enough information about virus mutants, it is suitable to study the evolutionary relationship between novel coronavirus mutants transmitted among humans. The emergences of 14 kinds of main mutants are studied and these strains can be classified as eight bifurcations of the phylogenetic tree. It is found that there exist three types of virus mutations, namely, the mutation among sub-branches of the same branch, the off-root mutation and the root-oriented mutation between large branches of the tree. From the point of the relation between viral mutation and host selection we found that individuals with low immunity provide a special environment for the positive natural selection of virus evolution. It gives a mechanism to explain why large mutations between two distant branches generally occur in the nCov-19 phylogenetic tree. The finding is helpful to formulate strategies to control the spread of COVID-19. The rapidly expanding genetic diversity of SARS-CoV-2 (nCov- 19) viruses provides a great deal of information about virus mutation. GISAID introduced a nomenclature system for major clades based on marker mutations [1] . Then, GISAID clades are augmented with more detailed lineages assigned by the Phylogenetic Assignment of Named Global Outbreak Lineages (PANGO lineage) tool [2] . Recently on 31-May-2021 World Health Organization (WHO) announced that the Greek alphabet will be used to refer to the main COVID-19 mutant virus in the future [3] . Now we found that more than ten variants -Alpha, Beta, Gamma, Delta, Kappa, Epsilon, Theta, Iota, Zeta, Lambda, Omicron etc., have been widely spread among humans. What is their evolutionary relationship? The nCov-19 viruses spread in human is thought to belong to Sarbecovirus of Betacoronavirus that infers a possible bat origin [4] [5] [6] [7] [8] [9] . However, there should exist some intermediate hosts in animal between the bat origin and the virus spreading in human. Although recent works discovered that the genome of Malayan pangolin coronaviruses shows high similarity to nCov-19, little is known about the problem of who is the real intermediate host [10] [11] [12] [13] . Let us put the problem aside for the time being and in this paper we will focus on how the nCov-19 mutates in human. It is hoped that this will enlighten the problem of tracing source of nCov-19 and searching for its intermediate host. About the calculation method of phylogenetic analysis, starting from information theory with the emphasize of base correlation property of the genome sequence we proposed IC-PIC method and used it in deducing the phylogenetic tree of Betacoronavirus [14] [15] [16] . The careful analysis of early 2019-nCoV tree was given in [16] . However, limited to the virus data at that time we were unable to study the various mutants of novel coronavirus. Now we will use IC-PIC method in this paper to make phylogenetic analysis based on a wealth of virus mutation data to reveal the law of how novel coronavirus mutates to produce evolutionary bifurcations. We suggested the average mutual information (AMI) and k-departed base correlation can be looked as the signature of a given genome sequence [17] . The average mutual information is called information correlation (IC) defined by and the k-departed base correlation is called partial information correlation (PIC) defined by where p i means the probability of base i in the sequence and p i(k)j means the joint probability of base pair ij departed by distance k (k=0,1,2,…). In the following we shall study the SARS-CoV-2 phylogeny by using IC-PIC algorithm based on the above set of signatures of the genome sequence. The nCov-19 viruses genomes used in our analyses were downloaded from the GISAID (https://gisaid.org) platform. The genome sequence is converted into an IC-PIC matrix with 17 rows (representing 1 IC for given k and 16 PICs of different base correlation categories) and d columns (representing the distance k between base pair, k=0,1 to d-1). The only parameter in the algorithm is the range of d, which is denoted as K. K is determined from the best-fit construction of tree. In general the deduced tree changes with K and attains stable at some large value. The work is carried out on IC-PIC web server [18] . After uploading input data in Fasta format, setting the parameter K-value and choosing the Neighbor-Joining (NJ) option, the server will run the program and for each run of given d (d=1 to K stepping 1) deduce a phylogenetic tree. In the calculation the evolutionary distance of any two genomes is calculated by Euclidean distance between their respective IC-PIC 17Xd matrices. Then an unrooted NJ tree is generated. Finally, K phylogenetic trees are combined to generate a consensus tree. All these trees were constructed by using NEIGHBOR and CONSENSE program in the PHYLIP package [19] . The robustness of the tree topology was estimated by branch support. The whole-genome-based phylogenetic trees for nCov-19 are deduced by use of IC-PIC method and given in Fig 1. To reconstruct the phylogenetic tree the sequence data of 150 viruses are used. The consensus tree is derived from 50 (K=50) trees based on IC-PIC matric. The robustness of the tree topology was estimated by branch support. The NJ consensus tree is drawn with Avian viruses as out-group. However, the comparison between K=50 and larger K shows that the deduced tree attains stable at K=50. 2) The spatio-temporal localization of mutants and their dynamical variation and evolutionary trajectory. L, S, V are early amino acid mutants. They are thought to have no important amino acid mutation effects. D614G mutation in spike protein was firstly reported in Italy in February 2020 that obviously increases the infectivity of the virus. The mutant is named G. Then, in May 2020, L452R mutation was reported in USA that has strong ability of immune escape. The mutant is named Epsilon or GH. From GH to Omicron there are 12 important new mutants given in Fig 2. Note that all these mutants appeared in a given area that is because the geographical isolation plays a role in species formation. Each of them appeared also in a given time interval of 2020 apart from two (Theta and needs external selection and the selection comes from incomplete immunity of the host (weak immunity to the virus). Therefore, from the relation between mutation and selection we infer that the immune escape mutations to be really selected can only happen in hosts with incomplete immunity [21] . The point that individuals with low immunity provide a special environment for virus evolution has been tested in experiments. As early as in the spread of Alpha in UK the nCov-19 evolution during treatment of chronic infection was studied [22] . It was observed that the Third, the relation of virus mutation and selection in humans is discussed. We found the individuals with low immunity provide a special environment for virus evolution. Since the condition of selection for virus evolution in humans might be very different from the usual pattern , large mutations between two distant branches and even the root-oriented mutations can occur. Above conclusions are helpful to formulate COVID-19's prevention and control strategy. The nCov-19 mutations occurred mainly in 2020, but it still happened in 2021. There is no evidence that COVID-19 pandemic is coming to an end at Omicron. Oppositely, the longer the disease spreads, the more likely the virus is to mutate. Since the individuals with low immunity provide the positive selection to virus mutation, to minimize the mutation of the virus the international society should pay special attention to the health of these people. What is more noteworthy is that asymptomatic infected persons may have incomplete immunity [24] and they need vaccinated to strengthen their immunity. The schematic diagram is deduced from the tree topology of Fig 1 for 115 human nCov-19 viruses. The number written at the node indicates the number of bifurcations. The number given in parenthesis indicates the number of viruses that the branch contains. Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology SARS-CoV-2 Variants of Interest and Concern naming scheme conducive for global discourse Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus Genomic and protein structure modeling analysis depicts the origin and infectivity of 2019-nCoV Complete genome characterization of a novel coronavirus associated with severe human respiratory disease in Wuhan, China A pneumonia outbreak associated with a new coronavirus of probable bat origin Isolation and Characterization of 2019-nCoV-like Coronavirus from Malayan Pangolins Identification of 2019-nCoV related coronaviruses in Malayan pangolins in southern China Phylogenetic network analysis of SARS-CoV-2 genomes Decoding evolution and transmissions of novel pneumonia coronavirus using the whole genomic data Coronavirus phylogeny based on base-base correlation Genome-based phylogeny of dsDNA viruses by a novel alignment-free method Phylogenetic study of 2019-nCoV by using alignment-free method The average mutual information profile as a genomic signature PHYLIP-Phylogeny inference package (ver. 3.69) Immune system modulation and viral persistence in bats Neutralization of variant under investigation B.1.!67 with sera of BBV152 vaccinees SARS-CoV-2 evolution during treatment of chronic infection Prospective mapping of viral mutations that escape antibodies used to treat COVID-19 We gratefully acknowledge the authors and originating and submitting laboratories of the sequences from GISAID's EpiFlu(TM) Database on which this research is based.