key: cord-0888694-ewvdl06h authors: Vandelli, Andrea; Monti, Michele; Milanetti, Edoardo; Armaos, Alexandros; Rupert, Jakob; Zacco, Elsa; Bechara, Elias; Ponti, Riccardo Delli; Tartaglia, Gian Gaetano title: Structural analysis of SARS-CoV-2 genome and predictions of the human interactome date: 2020-07-24 journal: bioRxiv DOI: 10.1101/2020.03.28.013789 sha: 41059e8d45463949a055a3ba3591705b188ef9a4 doc_id: 888694 cord_uid: ewvdl06h Specific elements of viral genomes regulate interactions within host cells. Here, we calculated the secondary structure content of >2000 coronaviruses and computed >100000 human protein interactions with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The genomic regions display different degrees of conservation. SARS-CoV-2 domain encompassing nucleotides 22500 – 23000 is conserved both at the sequence and structural level. The regions upstream and downstream, however, vary significantly. This part codes for the Spike S protein that interacts with the human receptor angiotensin-converting enzyme 2 (ACE2). Thus, variability of Spike S may be connected to different levels of viral entry in human cells within the population. Our predictions indicate that the 5’ end of SARS-CoV-2 is highly structured and interacts with several human proteins. The binding proteins are involved in viral RNA processing such as double-stranded RNA specific editases and ATP-dependent RNA-helicases and have strong propensity to form stress granules and phase-separated complexes. We propose that these proteins, also implicated in viral infections such as HIV, are selectively recruited by SARS-CoV-2 genome to alter transcriptional and post-transcriptional regulation of host cells and to promote viral replication. comparison, balanced lists of single and double stranded regions were used for the calculations: A 139 confidence score of 10% indicates that we compared the SHAPE reactivity values of 3000 140 nucleotides associated with the highest CROSS scores (i.e., double stranded) and 3000 nucleotides 141 associated with the lowest CROSS scores (i.e., single stranded). From low (10%) to high (0.1%) 142 confidence scores, we observed that the predictive power, measured as the Area Under the Curve 143 (AUC) of Receiver Operating Characteristics (ROC), increases monotonically reaching the value of 144 0.73 (the AUC is 0.74 for the 10 highest/lowest scores; Fig. 1C ), which indicates that CROSS is 145 able to reproduce SHAPEMaP in great detail. 146 147 We also assessed CROSS performances on structures of betacoronavirus 5' and 3' ends [33] [34] [35] [36] (Fig. 148 1D) . In this analysis, we used RFAM multiple sequence alignments of betacoronavirus 5' and 3' 149 ends and relative consensus structures (RF03117 and RF03122) [33] [34] [35] [36] . We generated the 2D 150 representation of nucleotide chains of consensus structures. We extracted the 'secondary structure 151 occupancy', as defined in a previous work 20 , and counted the contacts present around each 152 nucleotide. Following the procedure used for the comparison with SHAPEMaP, different 153 progressive cut-offs were used for ranking all the structures using balanced lists of single and 154 double stranded regions: 10% indicates that we compared 600 nucleotides associated with the 155 highest amount of contacts and 600 nucleotides associated with the lowest amount of contacts. 156 From low (10%) to high (0.1%) confidence scores we observed that the AUC of ROC increases 157 monotonically reaching the value of 0.75 (10 highest/lowest scores have an AUC of 0.78; Fig. 1C) , 158 which indicates that CROSS is able to identify known double and single stranded regions reported 159 in great detail. We also tested the ability of CROSS to recognize specific secondary structures in 160 representative cases for which we studied both the 3' and 5' ends: NC_006213 or Human 161 coronavirus OC43 strain ATCC VR-759, NC_019843 or Middle East respiratory syndrome 162 coronavirus, NC_026011 or Betacoronavirus HKU24 strain HKU24-R05005I, NC_001846 or 163 Mouse hepatitis virus strain MHV-A59 C12 and NC_012936 or Rat coronavirus Parker (Supp. Fig. 164 2). 165 166 In summary, our analysis identifies several structural elements in SARS-CoV-2 genome 11 . 167 Different lines of experimental and computational evidence indicate that transcripts containing a 168 large amount of double-stranded regions have a strong propensity to recruit proteins 14 and can act 169 as scaffolds for protein assembly 15, 16 . We expected that the 5' end attracts several host proteins 170 because of the enrichment in secondary structure elements. The binding would not just involve end of a long RNA stem, the overall region is enriched in double-stranded nucleotides but the 173 specific interaction takes place in a single-stranded element. To demonstrate the conservation of nucleotides 22000-23000 (fragment 23), we divided this region 193 and the adjacent ones (nucleotides 21000-22000 and 23000-24000) into sub-fragments. We then 194 used the RF-Fold algorithm of the RNAFramework suite 30 to fold the different sub-regions using 195 CROSS predictions as soft-constraints. The structural motives identified with this procedure were 196 employed to build covariance models (CMs) that were then searched in our set of coronaviruses 197 using the 'Infernal' package 38 . We found that nucleotides 501-750 within fragment 23 have the 198 highest number of matches for different confidence thresholds, implying a higher chance of 199 sequence and structure conservation across coronaviruses (E-values of 10,1, 0.1; Fig. 2B ). We 200 specifically counted the matches falling in the Spike S region (+/-1000 nucleotides to take into 201 account the division of the genome into fragments; Supp. Our analysis suggests that the structural region between nucleotides 22000 and 23000 of Spike S 244 region is conserved among coronaviruses (Fig. 2) and that the binding site for ACE2 has poor 245 variation in human SARS-CoV-2 strains (Fig. 3B) . By contrast, the region upstream, which has 246 propensity to bind sialic acids 10,44,45 , showed poor structural content and high variability (Fig. 3B) . 247 248 replication 250 In order to obtain insights on how the virus replicates in human cells, we predicted SARS-CoV-2 252 interactions with the whole RNA-binding human proteome. Following a protocol to study structural 253 conservation in viruses 13 , we first divided the Wuhan sequence in 30 fragments of 1000 nucleotides 254 each moving from the 5' to 3' end and then calculated the protein-RNA interactions of each 255 fragment with catRAPID omics (3340 canonical and putative RNA-binding proteins, or RBPs, for a 256 total 102000 interactions) 18 . Proteins such as Polypyrimidine tract-binding protein 1 PTBP1 257 (Uniprot P26599) showed the highest interaction propensity (or Z-score; Materials and Methods) 258 at the 5' end while others such as Heterogeneous nuclear ribonucleoprotein Q HNRNPQ (O60506) 259 showed the highest interaction propensity at the 3'end , in agreement with previous studies on 260 coronaviruses ( Fig. 4A) 46 . 261 262 For each fragment, we predicted the most significant interactions by filtering according to the Z 263 score. We used three different thresholds in ascending order of stringency: Z ≥ 1.50, 1.75 and 2 264 respectively and we removed from the list the proteins that were predicted to interact promiscuously 265 with more than one fragment. Fragment 1 corresponds to the 5' end and is the most contacted by 266 RBPs (around 120 with Z≥2 high-confidence interactions; Fig. 4B ), which is in agreement with the 267 observation that highly structured regions attract a large number of proteins 14 . Indeed, the 5' end 268 contains multiple stem loop structures that control RNA replication and transcription 47,48 . By 269 contrast, the 3' end and fragment 23 (Spike S), which are still structured but to a lesser extent, 270 attract fewer proteins (10 and 5, respectively) and fragment 20 (between Orf1ab and Spike S) that is 271 predicted to be unstructured, does not have predicted binding partners. The interactome of each fragment was analysed using cleverGO, a tool for Gene Ontology (GO) 274 enrichment analysis 49 . Proteins interacting with fragments 1, 2 and 29 were associated with 275 annotations related to viral processes ( Fig. 4C ; Supp. Table 2 ). Considering the three thresholds 276 applied (Materials and Methods), we found 23 viral proteins (including 2 pseudogenes), for 277 fragment 1, 2 proteins for fragment 2 and 11 proteins for fragment 29 (Fig. 4D) . Among the high-278 confidence interactors of fragment 1, we discovered RBPs involved in positive regulation of viral 279 processes and viral genome replication, such as double-stranded RNA-specific editase 1 ADARB1 280 (Uniprot P78563), 2-5A-dependent ribonuclease RNASEL (Q05823) and 2-5-oligoadenylate 281 synthase 2 OAS2 (P29728; Fig. 5A ). Interestingly, 2-5-oligoadenylate synthase 2 OAS2 has been 282 reported to be upregulated in human alveolar adenocarcinoma (A549) cells infected with SARS-283 CoV-2 (log fold change of 4.2; p-value of 10 -9 and q-value of 10 -6 ) 50 . While double-stranded RNA-284 specific adenosine deaminase ADAR (P55265) is absent in our library due to its length that does not 285 meet catRAPID omics requirements 18 , the omiXcore extension of the algorithm specifically 286 developed for large molecules 51 attributes the same binding propensity to both ADARB1 and 287 ADAR, thus indicating that the interactions with SARS-CoV-2 are likely to occur (Materials and 288 Methods). Moreover, experimental works indicate that the family of ADAR deaminases is active in 289 bronchoalveolar lavage fluids derived from SARS-CoV-2 patients 52 and is upregulated in A549 290 cells infected with SARS-CoV-2 (log fold change of 0.58; p-value of 10 -8 and q-value of 10 -5 ) 50 . 291 We also identified proteins related to the establishment of integrated proviral latency, including X-293 ray repair cross-complementing protein 5 XRCC5 (P13010) and X-ray repair cross-complementing 294 protein 6 XRCC6 (P12956; Fig. 5A ). In accordance with our calculations, comparison of A549 295 cells responses to SARS-CoV-2 and respiratory syncytial virus, indicates upregulation of XRRC6 in 296 SARS-CoV-2 (log fold-change of 0.92; p-value of 0.006 and q-value of 0.23) 50 . Moreover, 297 previous evidence suggests that the binding of XRCC6 takes places at the 5' end of SARS-CoV-2, 298 thus giving further support to our predictions 53 . Nucleolin NCL (P19338), a protein known to be 299 involved in coronavirus processing, was also predicted to bind tightly to the 5' end (Supp. Some DNA-binding proteins such as Cyclin-T1 CCNT1 (O60563), Zinc finger protein 175 ZNF175 305 (Q9Y473) and Prospero homeobox protein 1 PROX1 (Q92786) were included because they could have potential RNA-binding ability (Fig. 5A) 55 . As for fragment 2, we found two canonical RBPs: 307 Heterogeneous nuclear ribonucleoprotein Q HNRNPQ and Nucleolin NCL 54 . In addition, we 495 discovered that the highly structured region at the 5'end has the largest number of protein partners 496 including ATP-dependent RNA helicase DDX1, which was previously reported to be essential for 497 HIV-1 and coronavirus IBV replication 63,64 , and the double-stranded RNA-specific editases 498 ADAR and ADARB1, which catalyse the hydrolytic deamination of adenosine to inosine. Other with ATP-dependent RNA helicase DHX9 90 as well as and 2-5A-dependent ribonuclease RNASEL 501 and 2-5-oligoadenylate synthase 2 OAS2 that control viral RNA degradation 91,92 . Interestingly, 502 DDX1, XRCC6 and OAS2 were found upregulated in human alveolar adenocarcinoma cells 503 infected with SARS-CoV-2 50 and DDX1 knockdown has been shown to reduce the number of 504 sgmRNA in SARS-CoV-1 infected cells 60 The idea that SARS-CoV-2 sequesters different elements of the transcriptional machinery is 529 particularly intriguing and is supported by the fact that a large number of proteins identified in our 530 screening are found in stress granules 75 . Indeed, stress granules protect the host innate immunity 531 and are hijacked by viruses to favour their own replication 87 . Moreover, as coronaviruses 532 transcription uses discontinuous RNA synthesis that involves high-frequency recombination 54 , it is could act as scaffold to attract host proteins 14,15 . In agreement with our hypothesis, it has been very 535 recently shown that he coronavirus nucleocapsid protein N can form protein condensates based on 536 viral RNA scaffold and can merge with the human cell protein condensates 84 , which provides a 537 potential mechanism of host protein sequestration. Representative cases are shown in black (Gordon et al. 67 ) and grey (Schmidt et al. 73 A Novel Coronavirus from Patients with Pneumonia in China Coronaviruses and immunosuppressed patients. The facts during the third 666 epidemic Treatment Coronavirus (COVID-19) Isolation and characterization of a bat SARS-like coronavirus that uses the 670 ACE2 receptor Furin cleavage of the SARS coronavirus spike 672 glycoprotein enhances cell-cell fusion but does not affect virion entry Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins Structures of MERS-CoV spike glycoprotein in complex with sialoside 677 attachment receptors Cryo-electron microscopy structure of a coronavirus spike glycoprotein 679 trimer Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its 681 immune cross-reactivity with SARS-CoV Identification of sialic acid-binding function for the Middle East respiratory 683 syndrome coronavirus spike glycoprotein The Structure and Functions of Coronavirus Genomic 3' and 5 A high-throughput approach to profile 688 RNA structure Prediction Shows Evidence for Structure in lncRNAs RNA structure drives interaction with proteins An Integrative Study of Protein-RNA Condensates Identifies Scaffolding 694 RNAs and Reveals Players in Fragile X-Associated Tremor/Ataxia Syndrome Phase separation drives X-chromosome inactivation: a hypothesis catRAPID omics: a web server for large-scale prediction of protein-RNA 701 interactions Quantitative predictions of protein interactions with long noncoding RNAs Predicting protein associations with 705 long noncoding RNAs RNAct: Protein-RNA interaction predictions for 707 model organisms with supporting experimental data Cloaked similarity between HIV-1 and SARS-CoV suggests an 710 anti-SARS strategy Inhibition of furin-mediated cleavage activation of HIV-1 glycoprotein 712 gp160 Differential downregulation of ACE2 by the spike proteins of severe acute 714 respiratory syndrome coronavirus and human coronavirus NL63 Conserved structural RNA domains in regions coding for cleavage site motifs in hemagglutinin 718 genes of influenza viruses Genome Composition and Divergence of the Novel Coronavirus ViennaRNA Package 2.0 CROSSalive: a web server for 723 predicting the in vivo structure of RNA molecules Structural and functional conservation of the programmed -1 ribosomal frameshift signal of SARS-CoV-2 RNA Framework: an all-in-one toolkit 728 for the analysis of RNA structures and post-transcriptional modifications A Phylogenetically Conserved Hairpin-Type 3′ Untranslated Region Pseudoknot Functions in Coronavirus RNA Replication Genome-wide mapping of therapeutically-relevant SARS-CoV-2 RNA 734 structures The 3' cis-acting genomic replication element of the 736 severe acute respiratory syndrome coronavirus can function in the murine coronavirus genome RNA-RNA and RNA-741 protein interactions in coronavirus replication and transcription Cd-hit: a fast program for clustering and comparing large sets of protein or 745 nucleotide sequences Infernal 1.1: 100-fold faster RNA homology searches The EMBL-EBI search and sequence analysis tools APIs in 2019 RNA-Seq methods for transcriptome analysis The proximal origin 753 of SARS-CoV-2 A pneumonia outbreak associated with a new coronavirus of probable bat origin T-Coffee: a web server for the multiple sequence alignment of protein 757 and RNA sequences using structural information and homology extension Distinct Roles for Sialoside and Protein 760 Receptors in Coronavirus Infection In-Silico evidence for two receptors based strategy of SARS Structural determinants and mechanism of HIV-1 genome 766 packaging An Overview of Their Replication and Pathogenesis Protein aggregation, structural disorder 770 and RNA-binding ability: a new approach for physico-chemical and gene ontology 771 classification of multiple datasets Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19 omiXcore: a web server for prediction of 775 protein interactions with large RNA Evidence for 777 host-dependent RNA editing in the transcriptome of SARS-CoV-2 Specific viral RNA drives the SARS CoV-2 nucleocapsid to phase separate RNA-RNA and RNA-782 protein interactions in coronavirus replication and transcription Insights into RNA biology from an atlas of mammalian mRNA-binding 784 proteins HIV Gag polyprotein: processing and early viral particle 786 assembly The GeneMANIA prediction server: biological network integration for 788 gene prioritization and predicting gene function What retroviruses teach 792 us about the involvement of c-Myc in leukemias and lymphomas Nucleocapsid Phosphorylation and RNA Helicase DDX1 The Cellular RNA Helicase DDX1 Interacts with Coronavirus Nonstructural 798 Protein 14 and Enhances Viral Replication SARS-Coronavirus-2 Nsp13 Possesses NTPase and RNA Helicase Activities That 800 Can Be Inhibited by Bismuth Salts A DEAD box protein facilitates HIV-1 replication as a cellular co-factor of Rev The cellular RNA helicase DDX1 interacts with coronavirus nonstructural protein 804 14 and enhances viral replication Cyclin T1 domains involved in complex formation with Tat and TAR RNA are 806 critical for tat-activation Role of the human and murine cyclin 808 T proteins in regulating HIV-1 tat-activation A SARS-CoV-2 protein interaction map reveals targets for drug 810 repurposing A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug 812 Targets and Potential Drug-Repurposing The role of A-kinase anchoring protein 95-like 815 protein in annealing of tRNALys3 to HIV-1 RNA The La-related protein LARP7 is a component of the 7SK ribonucleoprotein 817 and affects transcription of cellular and viral polymerase II genes The Architecture of SARS-CoV-2 Transcriptome Systematic Analysis of the Protein Interaction Network for the Human Transcription 821 Machinery Reveals the Identity of the 7SK Capping Enzyme A direct RNA-protein interaction atlas of the SARS-CoV-2 RNA in infected 824 human cells ChEMBL: towards direct deposition of bioassay data Context-Dependent and Disease-Specific Diversity in Protein Interactions 828 within Stress Granules A Concentration-Dependent Liquid Phase Separation Can Cause Toxicity 830 upon Increased Protein Expression DrLLPS: a data resource of liquid-liquid phase separation in eukaryotes Liquid Nuclear Condensates Mechanically Sense and Restructure the Genome Phase separation of signaling molecules promotes T cell receptor signal 836 transduction Regulation of stress granules in virus systems organelles, phase separation, and intrinsic disorder A stimulatory role for the La-related protein 4B in translation LARP4B is an AU-rich sequence associated factor that promotes mRNA 845 accumulation and translation SARS-CoV-2 nucleocapsid protein undergoes liquid-liquid phase 847 separation stimulated by RNA and partitions into phases of human ribonucleoproteins The SARS-CoV-2 nucleocapsid protein is dynamic, disordered, and phase 850 separates with RNA A proposed role for the SARS-CoV-2 nucleocapsid protein in 852 the formation and regulation of biomolecular condensates Viral Regulation of RNA Granules in 854 Infected Cells Binding of the SARS-CoV-2 Spike Protein to Glycans The architecture of SARS-CoV-2 transcriptome. bioRxiv (now Cell in press phosphorylates nuclear DNA helicase II/RNA helicase A and hnRNP proteins in an RNA-861 dependent manner The nature of the catalytic 863 domain of 2'-5'-oligoadenylate synthetases Supp. Figure 1 . We employed CROSSalign 13, 12