key: cord-0301434-re645zfz authors: Delli-Ponti, Riccardo; Mutwil, Marek title: Computationally-inferred structural landscape of the complete genomes of Dengue serotypes and other viral hemorrhagic fevers date: 2020-04-25 journal: bioRxiv DOI: 10.1101/2020.04.23.056986 sha: 66185cf46f33bb30a8c596981161eb084429d567 doc_id: 301434 cord_uid: re645zfz With more than 300 million potentially infected people every year, and with the expanded habitat of mosquitoes due to climate change, dengue cannot be considered anymore only a tropical disease. The RNA secondary structure is a functional characteristic of RNA viruses, and together with the accumulated high-throughput sequencing data could provide general insights towards understanding virus biology. Here, we profiled the RNA secondary structure of >7500 complete viral genomes from 11 different families of viral hemorrhagic fevers, including dengue serotypes, ebola, and yellow fever. Our results suggest not only an interesting lack of secondary structure for very aggressive and virulent viruses such as DENV-2 and ebola but also a different correlation between secondary structure and the number of interaction sites with human proteins, for example, with an anti-correlation of −0.84 for chikungunya. We demonstrate that the secondary structure and presence of protein-binding domains in the genomes to build similarity trees can be used to classify the viruses. We also used structural data to study the geographical distribution of dengue, finding a significant difference between DENV-3 from Asia and South-America, which could imply different evolutionary routes of this subtype. Dengue is a mosquito-borne virus that can potentially infect more than 300 million people a year in more than 120 countries 1, 2 . Dengue infection can further evolve into a severe hemorrhagic fever, which could lead to shock and death. Due to climate change, the disease is now threatening an increasing number of countries, with cases reported in Europe 2 . The existence of four different serotypes (DENV-1, 2, 3, 4), with also a fifth recently reported 3 , complicates the development of an effective vaccine 4 . The four serotypes show not only significant differences in sequence similarity 5 but also distinctive infection dynamics. For example, DENV-1 is the most wide-spread serotype, followed by DENV-2 6, 7 , which is also more often associated with severe cases 8 . However, the mechanisms behind Dengue infections and the complete set of differences between the serotypes are still unclear. Dengue is only one of the different Viral Haemorrhagic Fevers (VHF), comprising single-stranded RNA viruses from different families, such as flavivirus and filovirus. These viruses can be extremely lethal in humans, as in the case of Ebola. Other VHFs show not only similarities to Dengue in terms of symptoms, but also in the transmission vector. For example, Yellow Fever is also a mosquito-borne VHFs, with higher mortality but a slower rate of evolutionary change compared to Dengue 9 . The mild VHF Chikungunya shares the same vector with Dengue, mosquito Aedes aegypti, and the two viruses can even coexist in the same mosquito 10 . However, even with thousands of viral genomes available, the most common approach to understand the similarities between those viruses is still based only on the sequence. The secondary structure of RNA viruses is fundamental for many viral functions, from encapsidation to egression from the cell and host defence [11] [12] [13] . Specific structures in the UTRs were found to be functional, for example, in Dengue, but also in HIV and coronaviruses 11, 14, 15 . Other structural regions, including the 3' UTR, were found conserved not only in Dengue serotypes but also between Dengue and Zika 16 . Moreover, the singlestranded RNA (ssRNA) viruses preserve their structure (folding) even if their sequence mutates rapidly 17, 18 . Thus, the folding shows the potential to be used to classify different viral species and subspecies. However, while the RNA secondary structure is an informative element to characterise viruses, the secondary structure of only a few viral genomes has been experimentally characterized 16, 19 . Consequently, while thousands of viral genomes have been sequenced, we can only rely on in silico data to study their secondary structure. Furthermore, predicting the RNA secondary structure of entire viral genomes can be challenging, due to usually large sizes of >10000 nucleotides (nt), where most thermodynamic algorithms used to model the secondary structure drop in performance after 700 nt 20 . In our work, we computationally profiled the RNA secondary structure of >7500 viral genomes (prioritising Dengue serotypes and in general VHFs) using Computational Recognition of Secondary Structure (CROSS), a neural network trained on experimental data, which was successfully applied to predict HIV genome structure 21 . We mapped the secondary structure properties of the viruses on the world map, to study the genome interaction with proteins, and to further classify and understand the viruses. The viral genomes were downloaded from NCBI, selecting for each specific virus the fasta sequences containing the keyword complete genome. NCBI data was also used to extract the geographical information of Dengue viruses as well as the serotypes. Fragmented or incomplete genomes passing the filter were also removed by looking at the sequence length. The secondary structure profiles were computed using the CROSS algorithm. CROSS is a neural network-based machine learning approach trained on experimental data, able to quickly profile large and complex molecules such as viral genomes without length restrictions. CROSS was already used to profile the complete HIV genome, also showing an AUC of 0.75 with experimental SHAPE data 21 . For a comprehensive analysis, we used the Global Score model, considering nucleotides with a score >0 as double-stranded, and <0 as single-stranded. For the purpose of computing the structural content of a complete genome (i.e., % double-stranded nucleotides), the total number of nucleotides with a score >0 was averaged for the total length of each genome. For the plots showing the complete secondary structure of dengue regions, we used the MFE structure computed using RNAfold 22 . To analyse the protein binding motifs in the viral genomes, we selected 5-mer motifs from the table S3 of Dominguez et al. 23 . All of the 520 possible redundant motifs (270 nonredundant) were selected for further analysis. We scanned for the motifs on the complete genomes of the different strains, selecting only perfect matches. The number of motifs normalized by the average genome length of the different families of viruses was used to define a score for the number of potential interactions with proteins, according to the following formula: where m is the number of exact motifs found in a genome, and avg(n) is the average length of the genome of a specific species. The structural content, the averaged number of binding domains, and their correlation were employed to build dendrograms. To this end, we computed the Euclidean distance between the values associated with each virus using statistical software R. We then used the hclust function based on a ward.D2 module to build the dendrograms according to the hierarchical clustering. Here, we analysed, for the first time to our knowledge, the secondary structure profiles of the complete genomes of more than 7500 ss-RNA viruses ( Table 1 ; Materials and Methods: Source of viral genomes). The structural profiles were generated using the CROSS algorithm, a fast and comprehensive alternative to profile the structural content (i.e., % of double-stranded nucleotides) of long and complex RNA molecules, such as viruses ( 21 ; see Materials and Methods: RNA secondary structure). To analyse Dengue secondary structure, we selected one strain for each serotype, selecting strains that were widely used in previous publications 24 . In general, the four serotypes show significant differences in sequence, with around 65-70% sequence similarity 5 . Their secondary structure also shows notable differences ( Figure 1 ). For example, the 3' UTR of DENV-1 shows a peculiar structural peak, compared to the others. Interestingly, DENV-1 and DENV-2 share the highest structural peak around 6000 nucleotides, while DENV-3 and DENV-4 also have the highest structural peak in common, but at position 4000. We further expanded our analysis to cover the 4 different serotypes of the dengue virus, comprising ~4000 genomes ( Figure 2 ; Table 1 ). The analysis revealed that DENV-2 and DENV-3 are less structured than DENV-1 and DENV-4. A lower amount of complex secondary structure can improve the translation efficiency 25 , suggesting that DENV-2 and DENV-3 could be more efficiently translated in the host cell. Interestingly, dengue serotypes tend to be less structured than other Flaviviruses, such as West Nile Fever, Yellow Fever, Tick-borne Encephalitis, and Japanese Encephalitis ( Figure 2 ). Even if not properly a VHF, we also used as comparison Zika, due to the similarities with Dengue not only in the vector (Aedes aegypti), but also in terms of secondary structure domains 16 . Interestingly, while Tick-borne Encephalitis and Zika genomes are more structured (average double-stranded nucleotides > 56%), Nile Fever and Japanese Encephalitis have a similar structural distribution, especially since they are also close in the species tree 26 . To further compare and classify the secondary structure of other VHFs outside of flavivirus, we also included >500 genomes of Ebola and Chikungunya ( Figure 2 , Table 1 ). The analysis revealed that the other viruses are significantly more structured than Dengue (mean structural content for Flaviviruses and Dengue serotypes is 0.55, 0.51, respectively; Kolmogorov-Smirnov < 2.2e-16), with the exception of Ebola, which is not only the most deadly virus in our dataset (50% mortality rate according to WHO; 27 ), but also one of the less structured (mean structural content 0.50). To further study the secondary structure content for the >7500 viral genomes, we also analysed the 5' and 3' UTRs (first 1000 nt considered 5' UTR; last 1000 nt considered 3' UTR; Figure 3a , b). DENV-3 is the only serotype with both UTRs more structured than the entire genome (5' UTR= 0.55 and 3' UTR=0.53; Figure 2 ), while DENV-1 has a highly structured 5' UTR (structural content = 0.53). This result is in line with the experimental Parallel Analysis of RNA Structure (PARS) data coming from human RNAs, where the UTRs were more structured than the CDS 28 . This suggests that some viruses tend to mimic the secondary structure of human mRNAs to be efficiently translated by the cellular machinery 29 . This is also further supported in Dengue, where a complex structure at the 3' UTR was shown to mimic the absent polyA, to enhance translation 30 . Interestingly, Ebola has the least structured UTRs (structural content 5' UTR = 0.46; 3' UTR = 0.41). In Zika, the 3' UTR is more structured than the 5' (Figure 3c ; 3' UTR=0.56, 5' UTR = 0.50). Chikungunya shows not only the highest structural variability in the 3' UTR (3' UTR = 0.43, Figure 3b ), but also a very structured 5' UTR ( Figure 3a) . Finally, Ebola, DENV-1, and DENV-3 exhibit a more structured 5' UTR, especially when compared with DENV-2, DENV-4 and Japanese Encephalitis, which tend to be more structured (Figure 3c ). The overall similarities and differences in structure are an additional feature that could be employed to characterise the different viruses. For the next step, the structural content (mean of the % of double-stranded nucleotides for all the viral genomes) in a specific species was used to hierarchically group the 11 different viruses (Materials and Methods: Hierarchical clustering; Table 1 ). The resulting dendrogram clustered the Dengue serotypes, showing that they are more structurally similar compared to other viruses ( Figure 4 ). The structural similarities of Dengue serotypes, together with the similarity between West Nile Fever and Japanese Encephalitis, are in agreement with the Phylogenetic Tree of Viral Hemorrhagic Fever 26 . The structural content revealed interesting clustering of the viruses. For example, while having a different genome sequence, Dengue viruses also cluster together with Ebola, since they share a less structured genome. Interestingly, DENV-2, the most virulent and spread serotype 7, 8 , is similar in structural content to Ebola, probably the most lethal viral hemorrhagic fever 27 . According to its structural content, Zika is also part of the sub-cluster, together with West Nile Fever and Japanese Encephalitis. The mosquito-transmitted Yellow Fever and Chikungunya form a cluster, indicating that their structural content is similar. This is partially in agreement with the VHFs tree 26 . Interestingly, since Tick-borne Encephalitis is more structured than any of these viruses, it forms an outlier. To conclude, these results indicate that the level of secondary structure inside a viral genome can be used as a metric to build a tree of similarities, which could be further employed to classify viruses. During translation and replication, ss-RNA viruses are naked RNA molecules inside human host cells. Previous studies already showed that genomes of the Dengue viruses interact with multiple human proteins during the infection and that the protein binding can enhance or inhibit the virulence 31 . Furthermore, RNA binding proteins tend to exhibit an altered activity during viral infection, in some cases due to the presence of highly abundant viral RNA, which can compete for the interaction with cellular RNA 32 . To study the relationship between human proteins and the viral RNA structures, we selected binding motifs from RNA Bind-n-Seq (RBNS) data of 78 human RNA-binding proteins 23 , and searched the complete viral genomes for these motifs (Materials and Methods: Protein-RNA interactions). We observed that the 4 dengue serotypes have a different presence of protein binding domains, with DENV-2 showing the highest number of motifs, followed by DENV-4 (Supplementary Figure 1) . Similarly to the structural content analysis above, we used the number of protein binding domains to classify the viruses. To further understand how the connection between structure and interaction with proteins can classify viruses, we compared the resulting trees ( Figure 4) . Interestingly, the Dengue cluster is almost perfectly maintained, except that, for the number of protein interactions, DENV-2 is more similar to Chikungunya than Ebola, which in turn is more related to Yellow Fever. Furthermore, clustering of West Nile Fever, Japanese Encephalitis and Zika is partially maintained when using structure and interaction with proteins. To conclude, by analyzing thousands of different viral genomes, we identified specific clusters, conserved both in terms of secondary structure content and the potential number of interactions with proteins. Since both the structural content and the number of binding motifs could be used to classify the viruses, we hypothesized that there is a correlation between these two features. We found an overall high anti-correlation (r=-0.74; p-value < 2.2e-16) between the number of protein-binding motifs and the structural content in Dengue, meaning that less structured Dengue viruses tend to bind more proteins (Figure 5a) . Interestingly, the different dengue serotypes cluster together according to their structure and the interaction with proteins ( Figure 5a) . Also, the serotypes show a different trend when independently analysed, with DENV-3 and DENV-4 exhibiting the highest influence of the structure on the number of possible interacting proteins (r is -0.31 and -0.36; p-value < 2.2e-16 and 7.177e-07, respectively; Figure 5b ). Next, we compared the secondary structure and protein binding motifs of the other VHFs. The general picture is quite complex, with some viruses showing a propensity to interact with proteins if they are more or less structured. Similarly to dengue, the mild hemorrhagic fever Chikungunya shows a high anticorrelation (r=-0.84; p-value < 2.2e-16, Figure 6 ). Conversely, Tick-borne Encephalitis and Zika show a positive correlation (Pearson correlation of 0.23 and 0.89; p-value 0.01 and <2.2e-16 respectively; Figure 6 ). The high positive correlation, especially for Zika, suggests that specific highly-structured viruses are also prone to interact with human proteins. DENV-3 and DENV-4 act similarly to Chikungunya, having unstructured genomes, allowing potentially more interactions with proteins ( Figure 5b) . Finally, Japanese Encephalitis and West Nile Fever show almost no correlation of the structure and on the possible interaction with proteins (r~0). These results indicate a complex relationship between the structural content and protein-binding motifs. To further understand this phenomenon, we used the correlation between structure and number of motifs to cluster the different viruses (Supplementary Figure 2) . Similarly to what we observed above, Japanese Encephalitis and West Nile Fever are grouped together, but this time also close to DENV-2 and Ebola. While Zika is the outlier together with Chikungunya, the first is a Flavivirus but not a VHF, while the second is a mild VHF but also a Togavirus (Supplementary Figure 2) . We also studied the geographical connection between Dengue serotypes and the secondary structure content. We found that the African DENV-1 is predicted to be more structured than the other serotypes, even when compared with the Asian strains (Supplementary Figure 23 Kolmogorov-Smirnov = 0.001), while on the contrary, Asian DENV-3 is more structured than the African strains (Kolmogorov-Smirnov = 0.07; Supplementary Figure 3) . The correlation between binding motifs and secondary structure at the geographical level portrays a quite complex scenario (Figure 7) . The clearest clusters are identifiable for DENV-3, where the South-American strains are not only less structured, but also highly interacting with proteins. DENV-3 from South-America is also more distant from the Asian strains (Euclidean distance centroids x1000=5), compared, for example, with DENV-1 (Euclidean distance centroids x1000=1). The genomes of viral hemorrhagic fevers show a high level of secondary structure, especially in the UTR regions 14, 16 . This secondary structure is thought to be needed for different viral mechanisms, such as packaging and egression [11] [12] [13] . However, a comprehensive secondary structural landscape for their genomes was lacking. In our work, we computationally profiled and analysed the secondary structure profiles of more than 7500 complete viral genomes, including almost 4000 dengue samples, and 3500 other viral hemorrhagic fever-causing viruses. By studying the structural profiles, we observed that Dengue viruses are predicted to be less structured compared to viruses such as Zika, Yellow Fever, and West Nile Fever ( Figure 2) . Conversely, Dengue serotypes still tend to retain structured UTRs, probably to be efficiently translated by the cellular machinery, similarly to human mRNAs 25, 28 . We also identified a correlation between the secondary structure and the number of protein binding domains, implying that the secondary structure is employed to regulate potential binding with proteins, as observed for human RNAs 33 . For viruses, the situation is more complex, as we observed a significantly positive (Tick-borne Encephalitis and Zika) and negative (Chikungunya, DENV-3, and DENV-4) relationship between secondary structure and the potential interaction with proteins. We speculate that the opposite behaviors of the viruses are caused by the fundamental differences between viruses. For example, Zika (positive RNA genome, positive correlation between secondary structure and interaction with proteins) and Chikungunya (negative genome, negative correlation) belong to different viral families (Flavivirus and Togavirus respectively), and have a completely different capsid. We also analysed the secondary structure at a geographical level, showing that DENV-3 strains from South-America and Asia have different patterns in their structure and the potential interaction with proteins, especially when compared with DENV-1 and DENV-2 ( Figure 7) . Interestingly, this is in line with DENV-3 being the youngest serotype and the only one with a proposed origin not in Asia but in America 34 . This could explain the niched behaviour of DENV-3 in terms of structure and protein interactions, as well as supporting a possible independent origin of Asia and and American DENV-3. Interestingly, DENV-4 shows a similar trend, but there are too few samples available to explain its evolution. To our knowledge, this is the first study that employed secondary structural content and the presence of protein-binding domains to build similarity trees between VHFs. The secondary structure and interaction with proteins can be used to cluster the viruses in agreement with previous phylogenetic trees, such as Dengue serotypes, and Japanese Encephalitis with West Nile Fever. Conversely, some relationships are surprising, as for example, DENV-2 is closer to Ebola when the secondary structure is used to establish similarity, but not when using the interaction with proteins. This result suggests how different measures, especially the secondary structure content, could be used to classify further and characterise different classes of viruses. Our massive computational analysis provided novel results regarding the secondary structure and the interaction with human proteins, not only for Dengue serotypes but also for other viral hemorrhagic fevers. We envision that these approaches can be used by the scientific community to classify further and characterise these complex viruses. The global distribution and burden of dengue Refining the global spatial limits of dengue virus transmission by evidence-based consensus Surprising new dengue virus throws a spanner in disease control efforts The Lancet Infectious Diseases. The dengue vaccine dilemma Complete genome sequencing and evolutionary analysis of dengue virus serotype 1 isolates from an outbreak in Kerala, South India Global spread of dengue virus types: mapping the 70 year history Dengue viruses -an overview Serotype-specific differences in the risk of dengue hemorrhagic fever: an analysis of data Yellow fever virus exhibits slower evolutionary dynamics than dengue virus Concurrent chikungunya and dengue virus infections during simultaneous outbreaks, Gabon Structural determinants and mechanism of HIV-1 genome packaging Identification of novel RNA secondary structures within the hepatitis C virus genome reveals a cooperative involvement in genome packaging The influence of viral RNA secondary structure on interactions with innate host cell defences RNA structure duplication in the dengue virus 3' UTR: redundancy or host specificity? The structure and functions of coronavirus genomic 3' and 5' ends Structure mapping of dengue and Zika viruses reveals functional long-range interactions Conservation of mRNA secondary structures may filter out mutations in Escherichia coli evolution Simian immunodeficiency virus RNA is efficiently encapsidated by human immunodeficiency virus type 1 particles Architecture and secondary structure of an entire HIV-1 RNA genome Analysis of energy-based algorithms for RNA secondary structure prediction A high-throughput approach to profile RNA structure SHAPE directed RNA folding Sequence, structure, and context preferences of human RNA binding proteins Complete genome sequences of four serotypes of dengue virus prototype continuously maintained in the laboratory Does RNA secondary structure drive translation or vice versa? Controlling dengue: effectiveness of biological control and vaccine in reducing the prevalence of dengue infection in endemic areas Subsequent mortality in survivors of Ebola virus disease in Guinea: a nationwide retrospective cohort study Insights into RNA structure and function from genome-wide studies Dengue Virus Selectively Annexes Endoplasmic Reticulum-Associated Translation Machinery as a Strategy for Co-opting Host Cell Protein Synthesis Recent advances in deciphering viral and host determinants of dengue virus replication and pathogenesis Identification of RNA Binding Proteins Associated with Dengue Virus RNA in Infected Cells Reveals Temporally Distinct Host Factor Requirements System-wide Profiling of RNA-Binding Proteins Uncovers Key Regulators of Virus Infection An Integrative Study of Protein-RNA Condensates Identifies Scaffolding RNAs and Reveals Players in Fragile X-Associated Tremor/Ataxia Syndrome Comparative evolutionary epidemiology of dengue virus serotypes