key: cord-0995061-9olc3cmu authors: Dimonaco, Nicholas J.; Salavati, Mazdak; Shih, Barbara title: Hacking the diversity of SARS-CoV-2 and SARS-like coronaviruses in human, bat and pangolin populations date: 2020-11-24 journal: bioRxiv DOI: 10.1101/2020.11.24.391763 sha: 3104fbee1b2b64f41ae935664d3ad33037badba7 doc_id: 995061 cord_uid: 9olc3cmu In 2019, a novel coronavirus, SARS-CoV-2/nCoV-19, emerged in Wuhan, China, and has been responsible for the current COVID-19 pandemic. The evolutionary origins of the virus remain elusive and understanding its complex mutational signatures could guide vaccine design and development. As part of the international “CoronaHack” in April 2020 (https://www.coronahack.co.uk/), we employed a collection of contemporary methodologies to compare the genomic sequences of coronaviruses isolated from human (SARS-CoV-2;n=163), bat (bat-CoV;n=215) and pangolin (pangolin-CoV;n=7) available in public repositories. Following de novo gene annotation prediction, analysis on gene-gene similarity network, codon usage bias and variant discovery were carried out. Strong host-associated divergences were noted in ORF3a, ORF6, ORF7a, ORF8 and S, and in codon usage bias profiles. Lastly, we have characterised several high impact variants (inframe insertion/deletion or stop gain) in bat-CoV and pangolin-CoV populations, some of which are found in the same amino acid position and may be highlighting loci of potential functional relevance. Background 24 The continued and increasing occurrence of pandemics that threaten worldwide public health due 91 We were able to collate 215 bat-CoV genomes of varying families (Alphacoronaviruses and Beta-92 coronaviruses) with only one exhibiting a small proportion or genomic uncertainty (presence of 93 0.45% 'N' nucleotide). However, only 7 pangolin-CoV genomes, of which 5 were annotated as Be-94 tacoronaviurs, were available at the start of this study. 3 pangolin-CoV genomes also contained 95 levels of the ambiguous 'N' nucleotide, two of them at high levels (6. 88 The tree produced was used as an analytical anchor for which we could use to refer to in the 105 codon usage bias and variant analysis. High impact variants and codon usage clusters were plotted 106 on the tree to show their distribution across the different clades along the topology of the tree. Wuhan 7 11 11 13 46 Charite 9 11 11 12 117 Bat 2 9 9 12 215 Pangolin 10 11 12 17 7 Table 1 . This table presents the distribution of the number of predicted genes for each dataset. Bat-CoV exhibit the widest distribution of gene count, and pangolin-CoV has the highest number of gene count, with one genome having 17 predicted genes. These outliers have low sequence or assembly quality. In the case of the pangolin-CoV genome reporting 17 genes, it has low quality ('NNNN') nucleotide regions spanning the centre of genes, which causes PROKKA to identify the two ends of one gene. The median gene count only varying in bat-CoVs, likely attributed to the large phylogenetic variation exhibited across the bat genomes. BLAST was utilised in attempts to capture a number of genes with strong homology to the SARS- There also appears to be some degree of genera and species separation for bat hosts. The majority of the Sarbecovirus affect the bat genus Rhinolophus (column b, light blue, dark blue and purple), whereas a much smaller proportion of the Alphacoronavirus are found in bats of this genus. Some clades overlap with specific bat species, including Rhinolophus ferrumequinum, Rhinolophus sinicus and Scotophilus kuhlii. The results from the analysis made in later parts of this study are also highlighted, including c) codon usage bias clusters, d-f) high impact variants with multiple variants are found in the same amino acid position, g-j) other high impact variants with a single amino acid change found in > 10 genomes, k-l) other high impact variants. found in coronaviruses that can promote ribosomal frameshifting Baranov et al. (2005) . The gene 127 sequences generated by PROKKA and BLAST (E and ORF10) were used for downstream analysis, 128 including gene-gene network graph, codon usage bias analysis, and a gene-presence summary A gene-gene similarity network analysis was used to compare genes across SARS-CoV-2, bat-CoV 135 and pangolin-CoV. The advantage of using a 3D network approach to visualise this information was 136 that it simplifies complex information as patterns. Genes sharing high similarity form independent 137 clusters. In cases where there is a high degree of dissimilarity in a gene for different host species, 138 a pattern of 2 or more distinct clusters would take place, with each cluster comprised of genes 139 derived from samples of the same host-species. In genes where there is a medium level of dissimi-140 larity across host-species, two or more cluster would appear fused and potentially break apart into 141 distinct clusters if the edge threshold were increased. Both of these patterns are observed within 142 this dataset. Distinct separation by host species are seen in ORF1a, ORF3a,ORF6, ORF7a, ORF8 143 and S (Figure 2 ). The strongest host-species separation observed were between SARS-CoV-2 and 144 bat-CoV; pangolin-CoV always group closer to SARS-CoV-2 than to bat-CoV. In the cases of ORF3a, 145 ORF8 and S, complete separation was observed between bat-CoV and human SARS-CoV-2 ( Figure 146 2B & C). One bat-CoV genome, RaTG13, was more similar to SARS-CoV-2 and pangolin-CoV than 147 the remainder of the bat-CoV for S (2C). For ORF3a, three bat genomes (MG772933, MG772934 148 and MN996532; named bat-SL-CoVZC45, bat-SL-CoVZXC21 and RaTG13 respectively) clustered to-149 gether with SARS-CoV-2 and pangolin-CoV rather than with the remainder of the bat genomes ( Fig-150 ure 2). These same three genomes are the only bat-CoV with ORF8 that co-cluster with SARS-CoV-2 ORF8 under the percentage identity threshold (≥80%) set for building the network graph. Other 152 bat-CoV ORF8 were so distinct from SARS-CoV-2 ORF8 that they do not co-cluster, even when edge 153 filtering was removed. To investigate whether if potential gene transfer or recombination that may have come from 155 more distantly related bat-CoV, we sought for unusual co-clustering between genes characterised 156 from bat-CoV and SARS-CoV-2. We did not observe such pattern; RatG13 co-cluster with SARS-CoV-157 2 for many genes, but it is also the most similar bat-CoV to SARS-CoV-2 at a genome level. In summary, the use of gene-gene network analysis enables us to determine groups of closely 165 related genes, which not only highlights genes showing strong host-species separation, but also 166 characterise clusters of related genes that may be absent or highly different from the reference 167 genome of interest, such as ORF8. 6 genes, ORF1ab, ORF3, ORF6, ORF7a, ORF8 and S, showed a 168 strong host-species separation in the network graph. In particular, with the exception of S, where 169 bat-SL-CoVZC45, bat-SL-CoVZXC21 clustered closer to bat-CoVs, the bat genomes, bat-SL-CoVZC45, 170 bat-SL-CoVZXC21 and RaTG13, clustered together with SARS-CoV-2 than the remainder of the bat-171 CoV for these 5 genes. 172 Figure 2 . Gene-gene similarity network analysis. Each node represents a gene defined by PROKKA or a DNA segment similar to genes from the SARS-CoV-2 reference genome. The nodes were compared against each other using BLAST, and nodes with high similarity (BLAST score ≥ 60 and a query coverage ≥ 80%) were connected with an edge. The network graph is labelled with host species. The black font in the graph indicates the corresponding SARS-CoV-2 gene names ("ORF" omitted) for the larger clusters, whereas blue font indicate additional non-coding sequences defined by PROKKA. Instead of the full length ORF1ab ( 21k in length), ORF1a and ORF1b were defined by PROKKA as two separate genes. Notably ORF1a, ORF3a, ORF6, and ORF8 and S, show strong separations between nodes from different species. ORF8 from 3 bat-CoV co-cluster with ORF8 from SARS-CoV-2 (RaTG13, bat-SL-CoVZC45 and bat-SL-CoVZXC21 respectively). The remaining bat-CoV ORF8 do not co-cluster with SARS-CoV-2 ORF8 even without the edge filtering threshold. For S, the bat-CoV RaTG13 co-cluster with COVID-19 and pangolin. A cluster of bat-CoVs break off for ORF1b and M, suggesting a large amount of variation amongst bat-CoV for these genes. 173 In our exploratory analysis during the Hackathon event, we attempted to capture gene-level ex-174 pression evidence for each of the predicted ORFs. However, following the event, we recognise that 175 RNA virus gene expression cannot be captured through standard RNAseq analysis pipeline. We During the 5 day hackathon, we endeavoured to utilise the genomic data aggregated by the sci-229 entific community and undertook a multifaceted and comprehensive exploration of the genomic 230 sequences (or "similarities and differences") of coronaviruses infecting bat and pangolin hosts, 231 available at the time. We have compared SARS-Cov-2 to all bat-CoV and pangolin-CoV genomes 232 from the listed data repositories (NCBI, VIPR and Databiology) without selecting for strains to rep-233 resent any specific genera, species or sub-strain. Our comparisons spanned across several levels: More recently, evidence has also been found for inter-host recombination events in a SARS-CoV-2 254 patient, which may have lead to new traits such as increased virulence from multiple strains Yi 255 (2020). In the attempts to address the potential recombination events or gene transfers between a 257 strain distantly related to SARS-CoV-2 and a strain more closely related to SARS-CoV-2, we sought 258 to annotate, characterise and compare genes from our diverse sets of coronaviruses. RNA virus 259 genomes are often compact, with little intergenic distance between genes, even those of the Coro- In addition to examining the overall sequence similarity of between genes derived from bat-CoV, 319 pangolin-CoV and SARS-CoV-2, we have also examined the codon usage within and across genes. There has been little research on ORF10 function, and its expression has been debated over. expressed in severe COVID-19 patient cases but barely detectable in moderate cases Liu et al. 381 (2020). Discrepancies in ORF10 expression may be due to differences in the level of infection and 382 host cell-type used in the studies, however the variants noted, show potential functions due to 383 host-species-level conservation. The RNASeq dataset (n=4) was obtained from the publicly available project PRJCA002326 at Na-533 tional Genomic Data Centre of Beijing Genomics Institute. The details of the samples can be found in https://bigd.big.ac.cn/bioproject/browse/PRJCA002326. Gene-gene network graph using amino acid sequences 762 Appendix 0 Figure 5 . Gene-gene similarity network analysis. Each node represents a amino acid sequence defined by PROKKA or BLAST (ORF10 and E). The nodes were compared against each other using BLAST, and nodes with high similarity (BLAST score ≥ 60 and a query coverage ≥ 80%) were connected with an edge. The network graph is labelled with with SARS-CoV-2 gene names ("ORF" omitted). When the network graph is coloured by host species, genes showing higher degree of variability across species are highlighted. Similar to the network analysis on nucleotide sequences (Figure 2 ). Genes ORF3a, ORF6, ORF7b, ORF8, ORF10 and S show strong separation between nodes from different species. The degree of separation in ORF1ab are stronger than ORF10 in the nucleic acid network graph; the reverse is true for the amino acid network graph. Database resources of the national center for biotechnology information. Nucleic acids 598 research BCFtools/csq: haplotype-aware variant consequences A severe acute respiratory syndrome coronavirus that lacks the E gene is attenuated in vitro and in vivo Intra-genome variability in the dinucleotide composition of SARS CoV-2. bioRxiv Programmed-1 Ribosomal Frameshifting in SARS Coronavirus An-609 tonarakis SE, Taschner PEM. HGVS Recommendations for the Description of Sequence Variants: 2016 Up-610 date Data, disease and diplomacy: GISAID's innovative contribu-613 tion to global health Nonstructural proteins NS7b and NS8 are likely to be phylogenetically associated 615 with evolution of 2019-nCoV. Infection Graphia: A platform for the graph-based visualisation and analysis of complex data Multivariate analyses of codon usage of SARS-CoV-2 and other betacoron-620 aviruses ORF8 and ORF3b antibodies are accurate serological markers of early and late SARS-CoV-2 infection. Nature 623 Publishing Group Notable sequence homology of the ORF10 protein introspects the architecture of SARS-COV-2. 626 bioRxiv SARS-CoV2 envelope protein: non-synonymous mutations and its conse-628 quences SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven 631 protease inhibitor The SARS epidemic in Hong Kong: what lessons have we learned Prodigal: prokaryotic gene recognition and 635 translation initiation site identification The species Severe acute respiratory syndrome-related coronavirus: classifying 637 2019-nCoV and naming it SARS-CoV-2 Codon usage similarity between viral and some host genes suggests a codon-specific translational regulation Identification of 691 two critical amino acid residues of the severe acute respiratory syndrome coronavirus spike protein for its 692 variation in zoonotic tropism transition via a double substitution strategy R: A language and environment for statistical computing. R Foundation for Statistical Computing phytools: An R package for phylogenetic comparative biology (and other things) Ev-699 idence for strong mutation bias towards, and selection against, U content in SARS-CoV-2: implications for 700 vaccine design. Molecular biology and evolution The structure of a rigorously conserved 703 RNA element within the SARS virus genome Mechanisms of viral mutation. Cellular and molecular life sciences Coronavirus envelope protein: current knowledge Prokka: rapid prokaryotic genome annotation Codon usage in yeast: cluster analysis clearly differentiates highly and lowly 710 expressed genes Clustal Omega for making accurate alignments of many protein sequences Thinking outside the triangle: replication fidelity of the largest RNA viruses RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioin-716 formatics Epidemiology, genetic recombination, and 718 pathogenesis of coronaviruses One severe acute respiratory syndrome coronavirus protein complex integrates processive RNA polymerase 721 and exonuclease activities Severe 723 acute respiratory syndrome coronavirus ORF7a inhibits bone marrow stromal antigen 2 virion tethering 724 through a novel mechanism of glycosylation interference A mobile genetic element in the SARS-CoV-2 genome is shared with multiple 726 insect species. bioRxiv Distribution and evolutionary history of the mobile genetic element s2m in coron-728 aviruses E protein interacts with PALS1 and alters tight junction formation and epithelial morphogenesis Signal hotspot mutations in SARS-CoV-2 genomes evolve as the virus spreads 733 and actively replicates in different parts of the world Cryo-EM structure 739 of the 2019-nCoV spike in the prefusion conformation A virus-binding hot spot on human 742 angiotensin-converting enzyme 2 is critical for binding of two different coronaviruses Isolation of SARS-CoV-2-related 745 coronavirus from Malayan pangolins 2019 novel coronavirus is undergoing active recombination Using ggtree to Visualize Data on Tree-Like Structures. Current protocols in bioinformatics Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak A Genomic Perspective on the Origin and Emergence of SARS-CoV-2 From SARS and MERS to COVID-19: a brief summary and 757 comparison of severe acute respiratory infections caused by three highly pathogenic human coronaviruses. 758 Respiratory research GWHACBD01000001 GWHACAY01000001 GWHACAZ01000001 GWHACBA01000001 GWHACBB01000001 hCoV−19/Wuhan/WH05/2020 Appendix 0 Figure 6 . Relative synonymous codon usage (RSCU) was calculated as the ratio of the observed frequency of codon to the expected frequency under the assumption of equal usage between synonymous codons for the same amino acids. This was carried out for each gene. The total number of genomes used in each plot are indicated in the top left corner for bat-CoV (b), pangolin-CoV (p),SARS-CoV-2 Charite dataset (c) and Wuhan dataset (w). As well as a strong separation between bat-CoV and SARS-CoV-2, there is also some separation within bat-CoV for most genes. Whilst we have illustrated the PCA based on RSCU for all genes, the interpretation for some of the shorter genes should be done with caution as they do not encompass the full spectrum of amino acids.Appendix 0 Figure 7 . Relative synonymous codon usage (RSCU) was calculated as the ratio of the observed frequency of codon to the expected frequency under the assumption of equal usage between synonymous codons for the same amino acids. This was carried out for each gene. The total number of genomes used in each plot are indicated in the top left corner for pangolin-CoV (p), SARS-CoV-2 Charite dataset (c), Wuhan dataset (w), cluster 1 bat-CoV (cl1), cluster 2 bat-CoV (cl2) and cluster 3 bat-CoV (cl3). The clustering for bat-CoV refers to the k-means clustering performed on PCA of RSCU using multiple genes ( Figure 3) . As well as a strong separation between bat-CoV and SARS-CoV-2, there is also some separation within bat-CoV for most genes. The clusters seen in RSCU PCA with multiple gene remain together for all genes where the majorities of the genomes are present. Whilst we have illustrated the PCA based on RSCU for all genes, the interpretation for some of the shorter genes should be done with caution as they do not encompass the full spectrum of amino acids.Appendix 0 Figure 8 . Transcript level expression estimated using Kallisto on SARS-CoV-2 in broncho-alveolar flush (BALF) samples (n=4) collected from 2 patients in Wuhan outbreak. The results is shown here is inaccurate and for record purpose only. This analysis was done during the Hackathon event, during which we had not appreciated the importance of removing reads from the host organism nor did we recognised the lack of distinction between reads mapping to the viral genome or mRNA using this method.Appendix 0 Figure 9 . Synonymous codon ratios are the ratio between the number of a given codon divided by the total number of codon coding for the same amino acid. By sorting this ratio in blocks of synonymous codons, this heatmap illustrate the preferential codons for each amino acid for each dataset across all genes. A number of codon usage bias are consistent across most genes and datasets. For instance, GCT is preferentially used for Alanine and GTT for Valine. On the whole, there seem to be less of a preferential codon use for bat, especially in longer genes or when multiple genes are accounted for, as per indicated by the higher frequency of more evenly distributed codon within each amino acid (i.e. for the bat dataset, the heatmap colours are of a similar level within each amino acid ). Codons with GCs are generally underrepresented, such as in Arg (Arginine), Pro (Proline) and Ser (Serine). * The values in this row is generated by combing codons from multiple genes, E, N, S, ORF1ab, ORF3a, ORF10.Appendix 0 Figure 10 . The coordinate map of all variants called against the human reference SARS-Cov-2 genome. Each horizontal track shows the variants present in the host-specie group. The colours shows the gene annotation origin of the variant and the shape consequence