key: cord-0726742-bnqmj8v6 authors: Rodríguez-Román, Eduardo; Gibbs, Adrian J. title: Comparisons of the genome of SARS-CoV-2 and those of other betacoronaviruses date: 2020-07-13 journal: bioRxiv DOI: 10.1101/2020.07.12.199521 sha: 7a539696b3b7a770b1fbc808cc0741662f712c1c doc_id: 726742 cord_uid: bnqmj8v6 The genome of SARS-CoV-2 virus causing the worldwide pandemic of COVID-19 is most closely related to viral metagenomes isolated from bats and, more distantly, pangolins. All are of sarbecoviruses of the genus Betacoronavirus. We have unravelled their recombinational and mutational histories. All showed clear evidence of recombination, most events involving the 3’ half of the genomes. The 5’ region of their genomes was mostly recombinant free, and a phylogeny calculated from this region confirmed that SARS-CoV-2 is closer to RmYN02 than RaTG13, and showed that SARS-CoV-2 diverged from RmYN02 at least 26 years ago, and both diverged from RaTG13 at least 37 years ago; recombinant regions specific to these three viruses provided no additional information as they matched no other Genbank sequences closely. Simple pairwise comparisons of genomes show that there are three regions where most non-synonymous changes probably occurred; the DUF3655 region of the nsp3, the S gene and ORF 8 gene. Differences in the last two of those regions have probably resulted from recombinational changes, however differences in the DUF3655 region may have resulted from selection. A hexamer of the proteins encoded by the nsp3 region may form the molecular pore spanning the double membrane of the coronavirus replication organelle (Wolff et al., 2020), and perhaps the acidic polypeptide encoded by DUF3655 lines it, and presents a novel target for pharmaceutical intervention. Sequences were downloaded from the Genbank and GISAID databases. They were edited 22 using BioEdit (Hall, 1999) , aligned using the neighbor-joining (NJ) option of ClustalX 23 (Jeanmougin et al., 1998) , and the maximum likelihood (ML) method PhyML 3.0 (ML) (Guindon 24 and Gascuel, 2003) . Sequences were tested for the presence of phylogenetic anomalies using the 25 full suite of options in RDP4 with default parameters (Maynard-Smith, 1992; Holmes et al., 1999; 26 Padidam et al., 1999; Gibbs et al., 2000; Martin and Rybicki, 2000; McGuire and Wright, 2000; 27 Posada and Crandall, 2001; Martin et al., 2005; Boni et al., 2007; Lemey et al., 2009; Martin et 28 al., 2015) ; anomalies found by four or fewer methods and with greater than 10 -5 random probability 29 were ignored; statistical support for their topologies was assessed using the SH method 30 (Shimodaira and Hasegawa, 1999) . Trees were drawn using Figtree Version 1.3 31 (http://tree.bio.ed.ac.uk/soft ware/figtree/; 12 May 2018) and a commercial graphics package. Patristic distances within trees were calculated using Patristic 1.0 (Fourment and Gibbs, 2006) to 33 convert trefiles to matrices of pairwise branch lengths. 34 Pairs of sequences were individually aligned using the TranslatorX server (Abascal et al., 35 2010; http://translatorx.co.uk). They were then compared using the DnDscan method (Gibbs et 36 al., 2007) , which is a simple heuristic method for scanning aligned sequences, codon-by-codon and codon position-by-position, to identify the NS and S changes that may have occurred 38 converting one codon to the other. NS and S variation is taken to be the sum of the scores for all 39 pairwise position comparisons within that codon. Each comparison involves substituting a 40 nucleotide of one codon with the homologous nucleotide of the other codon and then checking 41 how this affects the amino acid it encodes using the standard genetic code. The process is then 42 reversed, replacing the nucleotides of the second codon of the pair with those of the first, again 43 only one at a time. Thus, there are six possible exchanges between a single codon pair. If, say, the 44 first codon is ACT (Thr) and the second GGA (Gly), then all three nucleotides differ and six out 45 of six changed codons are produced. Substituting of the first position of ACT (Thr) with the first 46 nucleotide of the second codon (GGA) will generate GCT (Ala), a NS change, and similarly 47 substituting of the second position C with the second position G generates AGT (Ser), also a NS 48 change, and the third generates ACA (Thr), a S change. Likewise swopping the second GGA 49 (Gly) generates AGA (Arg), GCA (Ala) and GGT (Gly), which are NS, NS and S changes 50 respectively. In all, the pairwise comparison provides a score of 2/6 S changes, and 4/6 to the NS. Pairs of codons that are identical merely contribute 0/6 to both the total S and NS scores for the 52 window position, and indels are treated as 6/6 NS changes. These calculations make no assumption 53 about the direction of evolutionary change nor of the optimal or most parsimonious path of 54 substitution between two codons. The aim is to assess each of the single possible substitutions 55 indicated by two homologous but different codons. The results for each codon position in the 56 alignment are recorded in a CSV file so that they can be further processed for viewing. The scores 57 used for Fig. 3 , for example, were running (overlapping) sums of 5 codon scores, and thus the 58 NS=5.0 maxima represent five adjacent codons each with a maximum NS score of one. The theoretical isoelectric points of the DUF3655 peptides were calculated using the online 60 ProtParam facility of the ExPASy (Gasteiger et al., 2005 ; https://web.expasy.org/protoparam/). In mid-May 2020 a BLAST search (Altschul et al., 1990) of the Genbank databases was 64 made using the SARS-CoV-2 Wuhan-Hu-1 sequence (NC_045512) as a query, and over 100 65 related full-length genomic sequences were identified. These were downloaded, and two from the 66 GISAID database that had been discussed in reports, were added (Rodríguez-Román and Gibbs, 67 2020). The sequences were aligned using MAFFT with its L option. A Neighbor-Joining (NJ) 69 phylogeny of these sequences identified eight distinct genomic sequences in the SARS-CoV-2 70 lineage, together with eleven others in a more distant divergence that included the SARS-CoV 71 reference sequence (NC_004718), and with an outgroup of ten other coronavirus genomes. These 72 were checked for recombination using the Recombination Detection Program (RDP 4.95) (Martin 73 et al., 2015) . Recombinants were detected in, and between, all betacoronaviruses, but not between 74 them and the outgroup sequences. Eleven were chosen for analysis; all eight from the SARS-CoV- Genes are found in all three reading frames of coronavirus genomes, therefore the 11 78 sequences were aligned using MAFFT-L, and BioEdit (Hall, 1999) was used to create, for each of 79 the eleven, a single concatenated alignment of their open reading frames (i.e. all the genes in the 80 same reading frame). We call these, concats. The concats were aligned, using their encoded amino 81 acids as guide, by the TranslatorX online server (Abascal et al., 2010; http://translatorx.co.uk) with 82 its MAFFT option (Katoh and Standley, 2013) , and further refined by hand resulting in a concat 83 alignment of 29,286 nts. The maximum likelihood (ML) phylogeny of the eleven complete sarbecovirus concats 85 (Fig. 1A) , calculated by the PhyML method, confirmed that they form two lineages diverging from indicating that they resulted from a recombination event that occurred after the crown group 117 diverged from the ZXC21 and ZC45 branch, but before RaTG13 diverged. Confusingly however 118 the second of these recombinant regions was also found in the pangolin G/1/19 concat! The recombination map also shows the complex recombinational history of the spike gene, 120 the position of which is coloured yellow in the simplified genomic map at the top of Fig. 2 . This 121 is confirmed by the phylogeny of that region (Fig. 1C) , which is fully supported statistically except 122 for the SARS-1, Rf4092 and HKU3-8 cluster (mean 0.91 SH). The spike phylogeny has pangolin 123 genes immediately basal to the SARS-2 and RaTG13 twig, and the spike region of YN02 gene is 124 shown to be from the SARS-1 lineage. However, it is essential to realize that, although we know 125 the hosts from which the isolates were collected, other hosts may have been infected en route. (Pybus et al., 2020) and 8 x 10 -4 s/s/y (Resende et al., 2020) . The mean of these rate estimates is 0.99 x10 -3 s/s/y, and, assuming that the virus is 134 evolving at the same rate as the '1-11496' region of its genome, then the mean patristic distances Finally, we compared the concat sequences directly in pairs, not only to identify any 147 regions that were evolving abnormally, but also to confirm the recombination map patterns shown 148 in Fig. 2 . We used the DnDscan method (Gibbs et al., 2007 -see Methods) as this enables simple 149 visual comparisons to be made, as well as numerical. Fig. 3 shows the synonymous (S -blue) and 150 non-synonymous (NS -gold) differences in five of 45 pairwise possible comparisons of eleven concats. It can be seen that S differences occur throughout most of the comparisons, but NS 152 differences are most obvious in three regions of the genomes. There are slightly fewer S 153 differences between SARS-2 and YN02 than between SARS-2 and RatG13 or between YN02 and 154 RaTG13, and this confirms the phylogenetic tree (Fig. 1B) ; it shows that SARS-2 is closest to 155 YN02 as the total DnDscan scores for the '1-11496' regions of SARS-2 v YN02 are S 100.0 NS 156 27.5, but, for the other combinations, S134.0 NS 30.1 and S131.5 NS37.6, respectively. The 157 largest NS differences are in the DnDscans of the spike protein gene, especially its RBD region 158 and an adjacent "-PRRA-" insertion (Andersen et al. 2020) . Again, the recombination map results 159 (Fig. 2) are confirmed by the DnDscan as the spike region of the SARS-2 x RaTG13 comparison, 160 especially its 5' end, has few NS differences, as they share an 'unknown' recombinant (nts 21257 -161 22152). There are also two other regions of the concats consistently showing larger numbers of NS 163 differences. One is centred on the 'Domain of Unknown Function' (DUF) 3655 region of the nsp3, 164 a "disordered binding region" (Prates et al. 2020) , that is N' terminally adjacent to the ADP-ribose 165 phosphatase. This region of increased NS differences was found to some extent in all concat 166 comparisons suggesting that its differences result from evolution/selection, whereas the other, the 167 ORF 8 region near the 3' end of the genome, was not found in some comparisons, such as SARS-168 2 x RaTG13, and may therefore have resulted from recombination. The DUF3655 region is 169 discussed below. We have discombobulated the recombinational and mutational history of the SARS-2 lineage 173 of betacoronaviruses and their metagenomes using the published genomic sequences, despite the 174 possibilities, in this metagenomic age, of the sort of problems outlined by Chan and Zhan (2020) . 175 We have shown that the 5' third of their genome is largely free of recombinant regions, n-rec, 176 whereas the remainder is a mélange of recombinant regions from various 'parental' genomes. The RatG13, ZXC21 and ZC45 all of which come from bats, and in that phylogeny SARS-2 is more 182 closely related to YN02 than RaTG13. This is confirmed by the DnDscan comparisons of the 183 three viruses. YN02, however, has a recombinant region in its 3' half (nts 21098-24042) of 184 'unknown' parentage, but which is probably close to the pangolin virus G/1/19, and which is not 185 present in SARS-2 or RaTG13. Thus, most of the SARS-2 concat, especially its 5' 39%, is closest 186 to the homologous regions of YN02, but the intact concats of SARS-2 and RaTG13 are more where their 'structural genes' are translated, and together they are assembled to form progeny 204 virions (Hsin et al., 2018) . Table 2 shows the DUF3655 peptides of the eleven betacoronaviruses 205 with the acidic and basic residues outlined with different colours; acidic residues in red, and the 206 few basic residues in blue, and with the theoretical pI of these peptides ranging from 3. 01 -3.40, 207 in sharp contrast to the nucleocapsid protein encoded by ORF9 which binds the progeny genomes 208 in the cytosol and has a pI of 10.07 (McBride et al., 2014; Verheije et al., 2010) . Table 2 termini that are more variable in length and composition. The fact that the DUF3655 protein is so 213 acidic indicates its likely function in the pore where it may both electrostatically stabilize the lumen 214 of the pore (Desikan et al., 2020) and ensure that long negatively charged nucleic acid molecules, 215 like progeny viral genomes, are held centrally in the lumen of the pore as they pass through. The FFPred Prediction database of PSIPRED (Cozzetto et al., 2016) found that the most 217 likely "biological process" of SARS-2, YN02 and RaTG13 that involves their DUF3655 proteins 218 is "regulation of metabolic process" (mean probability 0.978) and "regulation of gene expression" 219 (0.908), their "molecular function" is "nucleic acid binding" (0.966) and "DNA binding" (0.890) 220 and their "cellular compartment" is "membrane" (0.785). The DUF3655 region seems to have evaded virological, medical and pharmaceutical 222 scrutiny so far (e.g. Chen and Zhong, 2020; Wei et al., 2020) . We suggest that it is probably 223 involved in a unique rate-limiting step of the coronavirus replicative cycle, and may make CoV 224 infections susceptible to drugs, like chloroquine, that increase cellular pH 225 (https://www.sciencemediacentre.org/expert-reaction-to-questions-around-potential-treatments-226 for-covid-19/ March 18 2020). The detailed analysis of this region, specially from residues 9 to 227 27, which have many negatively charged amino acids (Asp and Glu) ( Table 2) TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations Basic local alignment search tool The proximal origin of SARS-CoV-2 The role of the nsp2 and nsp3 in its pathogenesis Global patterns in coronavirus diversity An exact nonparametric method for inferring mosaic structure in sequence triplets The PSIPRED Protein Analysis Workbench: 20 years on Evolution and epidemic spread of SARS-CoV-2 in Brazil Single source of pangolin CoVs with a near identical Spike RBD to SARS-CoV-22 Genomics functional analysis and drug screening of SARS-CoV-2 FFPred 3: feature-based function prediction for all Gene Ontology domains. Sci Rep An In silico Algorithm for Identifying Amino Acids that Stabilize Oligomeric Membrane-Toxin Pores through Electrostatic Interactions bioRxiv preprint doi Temporal signal and the phylodynamic threshold of SARS-CoV-2 Analyses of evolutionary dynamics in viruses are hindered by a time-dependent bias in rate estimates Structural and molecular basis of mismatch correction and ribavirin excision from coronavirus RNA PATRISTIC: a program for calculating patristic distances and graphically comparing the components of genetic change Protein identification and analysis tools on the ExPASy server The variable codons of H3 influenza A virus haemagglutinin genes Sister-Scanning: a Monte Carlo procedure for assessing signals in recombinant sequences Coronaviridae Study Group of the International Committee on Taxonomy of V (2020) The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 The coronavirus proofreading exoribonuclease mediates extensive viral recombination bioRxiv preprint doi A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT Phylogenetic evidence for recombination in dengue virus Nucleocapsid protein-dependent assembly of the RNA packaging signal of Middle East respiratory syndrome coronavirus Multiple sequence alignment with Clustal X MAFFT multiple sequence alignment software version 7: improvements in performance and usability Origin and cross-species transmission of bat coronaviruses in China Identifying recombinants in human and primate immunodeficiency virus sequence alignments using quartet scanning Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection Short Title: Recombination and origin of SARS-CoV-2 bioRxiv preprint doi Major Concerns on the Identification of Bat Coronavirus Strain RaTG13 and Quality of Related Nature Paper RDP4: Detection and analysis of recombination patterns in virus genomes RDP: detection of recombination amongst aligned sequences A modified bootscan algorithm for automated identification of recombinant sequences and recombination breakpoints Analyzing the mosaic structure of genes Review The Coronavirus Nucleocapsid Is a Multifunctional Protein TOPAL 2.0: Improved detection of mosaic sequences within multiple alignments Crystal structures of SARS-CoV-2 ADP-ribose phosphatase (ADRP): from the apo form to ligand complexes Possible emergence of new geminiviruses by frequent recombination Evaluation of methods for detecting recombination from DNA sequences: computer simulations Functional Immune Deficiency Syndrome via Intestinal Infection in COVID-19 Preliminary analysis of SARS-CoV-2 importation & establishment of UK transmission lineages Genomic surveillance of SARS-CoV-2 reveals community transmission of a major lineage during the early pandemic phase in Brazil Ecology and Evolution of Betacoronaviruses The trouble with sliding windows and the selective pressure in BRCA1 Multiple comparisons of log-likelihoods with applications to phylogenetic inference The Coronavirus Nucleocapsid Protein Is Dynamically Associated with the Replication-Transcription Complexes Synonymous mutations and the molecular evolution of SARS-Cov-2 origins bioRxiv preprint doi Genome-wide CRISPR screen reveals host genes that regulate SARS-CoV-2 infection bioRxiv preprint doi A molecular pore spans the double membrane of the coronavirus replication organelle bioRxiv preprint doi A new coronavirus associated with human respiratory disease in China Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins Zoonotic origins of human coronaviruses A pneumonia outbreak associated with a new coronavirus of probable bat origin