key: cord-0882256-hp6kt4ud authors: Kurpas, Monika; Jaksik, Roman; Kimmel, Marek title: Evolutionary analysis of genomes of SARS-CoV-2-related bat viruses suggests old roots, constant effective population size, and possible increase of fitness date: 2022-03-01 journal: bioRxiv DOI: 10.1101/2022.02.28.482287 sha: 0b0393716204fc42e19063d36afdcb77bb599378 doc_id: 882256 cord_uid: hp6kt4ud It is of vital practical interest to understand the co-evolution of bat β-coronaviruses with their hosts, since a number of these most likely crossed the species boundaries and infected humans. Complete sequences of 47 consensus genomes are available for bat β-coronaviruses related to the SARS-CoV-2 human virus. We carried out several types of evolutionary analyses using these data. First, using the publicly available BEAST 2 software, we generated phylogenetic trees and skyline plots. The roots of the trees, both for the entire sequences and subsequences coding for the E and S proteins as well as the 5’ and 3’ UTR regions, are estimated to be located from several decades to more than a thousand years ago, while the effective population sizes remained largely constant. Motivated by this, we developed a simple estimator of the effective population size in a Moran model with constant population, which, under the model is equal to the expected age of the MRCA measured in generations. Comparisons of these estimates to those produced by BEAST 2 shows qualitative agreement. We also compared the site frequency spectra (SFS) of the bat genomes to those provided by the Moran Tug-of-War model. Comparison does not exclude the possibility that overall fitness of the bat β-coronaviruses was increasing over time as a result of directional selection. Stability of interactions of bats and their viruses was considered likely on the basis of specific manner in which bat immunity is tuned, and it seems consistent with our analysis. It seems to be a predominant belief that the recent β-coronavirus (Severe Acute Respiratory Syndrome Coronavirus 2, SARS-CoV-2), which causes acute respiratory disease COVID-19, is of zoonotic origin [23] . Other hypotheses, including genetic engineering of the virus, are considered controversial in the research community [2] . For previous epidemics such as the Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS) caused by β-coronaviruses, it has been proven that the reservoir of viruses capable of infecting humans is, among other, the population of bats [23] . It is an interesting question, what is the nature of the interaction of the population of the viruses with bat host. In this study, we used consensus genomic sequences of bat β-coronaviruses to infer the pattern of their evolution and to estimate the relevant parameters. The simulation analysis was supported by a phylogenetic analysis carried out using the Bayesian Evolutionary Analysis by Sampling Trees 2 (BEAST 2) [4] software, which allows the creation of a phylogenetic tree, estimation of the age of the most recent common ancestor (MRCA) of the studied sequences and estimation of size of the studied population using the Bayesian Skyline algorithm. We cross-checked the sophisticated and comprehensive BEAST 2 estimates by using the maximum parsimony analysis [11, 12] . We also developed estimates of mutation rate and effective population size under the hypothesis of neutrality and under sampling that provided genomes captured at different dates. We also proposed to use a version of the Tug-of-War model introduced by McFarland [18] , to model the site frequency spectra (SFS) of the population of viruses and, by comparison with the data, to further refine conclusions from phylogenetic analysis. In addition, we carried out the analyses using the 3' and 5' UTR regions which may include regulatory elements and the genes coding for the envelope (E) and spike (S) proteins. This latter is a glycoprotein that binds to the ACE2 receptor present in the cytoplasmic membrane of the host cells and thus allows the virus to penetrate them [22] . The analysis was carried out using 47 nucleotide sequences of SARS genomes isolated from bat hosts, downloaded from the National Center for Biotechnology Information (NCBI) Virus database [1] (full list of sequences can be found in Supplementary Information). Samples were dated from the period 2004-2017. The sequences were aligned using MUSCLE multiple sequence aligner [9, 8] based on which we assessed the variability at each genomic position relative to the consensus sequence. For each variant we calculated the fraction of sequences in which an alternative allele was observed, and summarized the data for each gene. Gene positions inside the created consensus sequence were determined using local sequence alignment, based on coordinates of SARS genes obtained from the Reference Sequence (RefSeq) database under accession number NC 004718.3. RefSeq database, created by NCBI, is an open access, annotated and curated collection providing single records for each natural biological molecule (DNA, RNA or protein) for major organisms including viruses, bacteria and eukaryotes. Based on aligned samples, we deduced the ancestral sequence; see further on. Mutation frequency study was carried out in the R computer language, using seqinr, msa and Biostrings libraries and visualized using Gviz and ggplot2. Individual parts of the genome (3'UTR, 5'UTR regions; envelope and spike gene coding fragments) were cut from the alignment based on Multiple Sequence Alignment (MSA) annotations (see Supplementary Information) . The length of the complete sequence in the alignment is 30821 nucleotides. BEAST 2 (Bayesian Evolutionary Analysis Sampling Trees 2) is a cross-platform program for Bayesian phylogenetic analysis of molecular sequences. It uses Markov chain Monte Carlo (MCMC) method to estimate rooted, time-measured phylogenies. BEAST 2 uses MCMC algorithm to average over tree space, so that each tree is weighted proportional to its posterior probability. BEAST's graphical user interface, BEAUti (Bayesian Evolutionary Analysis Utility), allows to specify the inference parameters and create .xml files used for calculation. We performed analyses using the following parameters in corresponding BEAUti booktabs: • Partitions -allowing browsing uploaded nucleotide/amino acid sequences and division into partitions corresponding to reading frames. In our case nucleotide sequences were either not partitioned (3'UTR, 5'UTR regions) or partitioned (gene E and S; whole genome) in accordance with the assumption of higher mutation rates of third nucleotide in the codon (split {1,2} + 3) with an appropriate reading frame. We linked trees and molecular clocks of both partitions, to generate one set of phylogenetic trees. Site models remained independent. • Tip Dates -allowing appropriate collection date assignment for each sequence. • Site Model -allowing choosing substitution model for nucleotide/amino acid sequence. We used HKY (Hasegawa, Kishino, and Yano [16] ) substitution model with estimated substitution rate and empirical nucleotide frequencies. • Clock Model -allowing choosing clock model type. In our inference we used the strict clock model. We analyzed the output from BEAST 2 using Tracer (version 1.7.1) [20] , which graphically and quantitively summarizes the distributions of continuous parameters and provides diagnostic information. We used Tracer also to obtain data for Bayesian Skyline plots. [11] . This program carries out the unrooted maximum parsimony tree-making algorithm, analogous to the Wagner trees [17] , on DNA sequences. The method of Fitch [13] was used to count the number of base changes needed for a given tree. In the maximum parsimony method the phylogenetic tree that minimizes the total number of character-state changes is sought. The assumptions of Dnapars method [10] are briefly listed. Each site evolves independently. Different lineages evolve independently. The probability of a base substitution at a given site is small over the lengths of time involved in a branch of the phylogeny. The expected amounts of change in different branches of the phylogeny do not vary by so much that two changes in a highrate branch are more probable than one change in a low-rate branch. The expected amounts of change do not vary enough among sites that two changes in one site are more probable than one change in another. ' The input for this method was the MSA described earlier on. All parameters were left unchanged except the parameter 5: "Print sequences at all nodes of tree". As a result we obtained the sequences at tree nodes and the ancestral sequence. The most-parsimonious tree often underestimates the evolutionary change that has occurred. In order to make tree estimation more reliable we used the bootstrap method seqboot from PHYLIP package. As an input MSA was used. As an output the set of 100 bootstrapped samples was obtained. In the next step it was used in the DNApars method in order to obtain tree estimates with the parameter "Analyze multiple data sets" enabled. In the final step we checked the support for specific forks in consensus tree using Consense algorithm from PHYLIP package. The analysis was performed using "Majority Rule extended" parameter. We used for simulations a version of the Moran Model, inspired by the Tug-of-War model of McFarland [18] . Consider a population of a fixed number N of individuals, each of them characterized by a pair of integers γ i = (α i , β i ), corresponding to the numbers of drivers and passengers in its genotype, respectively. This pair determines the fitness x i of the i-th individual by the formula where s > 0 and d ∈ (0, 1) are parameters describing selective advantage of driver mutations over passenger mutations. This model is a version of the "Tug-of-War" model introduced by McFarland [18] . We provide hypotheses for the complete version. In this paper we will also use the neutral version, with The entire population may be identified with the vector The accompanying vector of fitness values determines the future evolution of the population, seen as being under drift and selection pressure, as follows: The time to the first selection/drift event for the entire Γ is exponentially distributed with After this time, one individual dies and is replaced by an exact copy of one of the remaining individuals, the probability that the i-th individual dies and is replaced by the j-th (j = i) being x j (N −1)Σx . The process then continues with Γ modified by replacing its i-th coordinate by its j-th coordinate. Moreover, each individual may, after an independent exponentially distributed time with parameter λ, and independently of other individuals, undergo a mutation event, changing its state to either (α + 1, β) or (α, β + 1) with probabilities p ∈ (0, 1) and q = 1 − p, respectively. Mathematical properties of this model are subject of separate publications, in preparation. Here, we used stochastic simulations to explore predictions of this model. The following list (see also Fig. 1 ) summarizes the notation used. Expectation of the pairwise difference under assumption that l = k equals to since the expected count of mutations in time T lk equals µT lk , while under Moran model and the ISM the expected time to the MRCA equals N , so that the expected count of pairwise differences between k and l, equals 2N µ. Given sample values of T k , k = 1, 2, . . . , n and d lk , 1 ≤ k < l ≤ n, we obtain estimating equations for µ and 2N , by minimizing the sum of squares Differentiating with respect to µ and 2N and equating the derivatives to 0, we further obtain where Σd = 1≤k