key: cord-0812451-myem2b54 authors: Ferrarini, Mariana G.; Lal, Avantika; Rebollo, Rita; Gruber, Andreas; Guarracino, Andrea; Gonzalez, Itziar Martinez; Floyd, Taylor; de Oliveira, Daniel Siqueira; Shanklin, Justin; Beausoleil, Ethan; Pusa, Taneli; Pickett, Brett E.; Aguiar-Pulido, Vanessa title: Genome-wide bioinformatic analyses predict key host and viral factors in SARS-CoV-2 pathogenesis date: 2020-08-19 journal: bioRxiv DOI: 10.1101/2020.07.28.225581 sha: 485823ec809e68d06c8f741b9d97c612a88d3621 doc_id: 812451 cord_uid: myem2b54 The novel betacoronavirus named Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) caused a worldwide pandemic (COVID-19) after initially emerging in Wuhan, China. Here we applied a novel, comprehensive bioinformatic strategy to public RNA sequencing and viral genome sequencing data, to better understand how SARS-CoV-2 interacts with human cells. To our knowledge, this is the first meta-analysis to predict host factors that play a specific role in SARS-CoV-2 pathogenesis, distinct from other respiratory viruses. We identified differentially expressed genes, isoforms and transposable element families specifically altered in SARS-CoV-2 infected cells. Well-known immunoregulators including CSF2, IL-32, IL-6 and SERPINA3 were differentially expressed, while immunoregulatory transposable element families were overexpressed. We predicted conserved interactions between the SARS-CoV-2 genome and human RNA-binding proteins such as hnRNPA1, PABPC1 and eIF4b, which may play important roles in the viral life cycle. We also detected four viral sequence variants in the spike, polymerase, and nonstructural proteins that correlate with severity of COVID-19. The host factors we identified likely represent important mechanisms in the disease profile of this pathogen, and could be targeted by prophylactics and/or therapeutics against SARS-CoV-2. Graphical Abstract . Overview of the bioinformatic workflow applied in this study. 46 SARS-CoV-2 infection elicits a specific gene expression and pathway 47 signature in human cells 48 We wanted to identify genes that were differentially expressed across multiple SARS-CoV-2 infected 49 samples and not in samples infected with other respiratory viruses. As a primary dataset, we 50 selected GSE147507 [10] , which includes gene expression measurements from three cell lines derived 51 from the human respiratory system (NHBE, A549, Calu-3) infected either with SARS-CoV-2, 52 influenza A virus (IAV), respiratory syncytial virus (RSV), or human parainfluenza virus 3 (HPIV3), 53 with different multiplicity of infection (MOI). We also analyzed an additional dataset GSE150316, 54 which includes RNA-seq extracted from formalin fixed, paraffin embedded (FFPE) histological 55 sections of lung biopsies from COVID-19 deceased patients and healthy individuals (see Figure 2A and Materials and Methods for further details). 57 Hence, we retrieved 41 differentially expressed genes (DEGs) that showed significant and 58 consistent expression changes in at least three datasets from cell lines infected with SARS-CoV-2, 59 and that were not significantly affected in cell lines infected with other viruses within the same 60 dataset (Supplementary Table 1A ). To these, we added 23 genes that showed significant and 61 consistent expression changes in two of four cell line datasets infected with SARS-CoV-2 and at least 62 Ferrarini & Lal et al., 2020 Comprehensive SARS-CoV-2 Computational Analyses one lung biopsy sample from a SARS-CoV-2 patient. Results coming from FFPE sections were less 63 consistent presumably due to the collection of biospecimens from different sites within the lung. 64 Thus, the final set consisted of 64 DEGs: 48 up-regulated and 16 downregulated of which 38 had an 65 absolute Log2FC > 1 in at least one dataset (relevant genes from this list are shown in Table 1 ). 66 SERPINA3, an antichymotrypsin which was proposed as an interesting candidate for the 67 inhibition of viral replication [13] , was the only gene specifically upregulated in the four cell line 68 datasets tested (Table 1) . Other interesting up-regulated genes were the amidohydrolase VNN2, the 69 pro-fibrotic gene PDGFB, the beta-interferon regulator PRDM1 and the proinflammatory cytokines 70 CSF2 and IL-32. FKBP5, a known regulator of NF-kB activity, was among the consistently 71 downregulated genes. We also generated additional lists of DEGs that met different filtering criteria 72 (Supplementary Table 1B , see Supplementary File 1 for the complete DEG results for each dataset). 73 In order to better understand the underlying biological functions and molecular mechanisms 74 associated with the observed DEGs, we performed a hypergeometric test to detect statistically 75 significant overrepresented gene ontology (GO) terms among the DEGs having an absolute Log2FC 76 > 1 in each dataset separately. 77 Table 1 . Log2FC for selected genes that showed significant up-or down-regulation in SARS-CoV-2 78 infected samples (FDR-adjusted p-value < 0.05), and not in samples infected with the other viruses 79 tested. Log2FC values are only provided for statistically significant samples. 80 Gene Cell 86 Non-redundant functional enrichment of DEGs. Here we report a subset of non-redundant reduced terms consistently 87 enriched in more than one SARS-COV-2 cell line which were not detected in the other viruses' datasets. We added 88 generic categories of immunity, metabolism, cellular processes and signaling/epigeneitcs to the GO terms as colored 89 dots. (C) Top 20 differentially expressed isoforms (DEIs) in SARS-CoV-2 infected samples. Y-axis denotes the 90 differential usage of isoforms (dIF) whereas x-axis represents the overall log2FC of the corresponding gene. Thus, 91 DEIs also detected as DEGs by this analysis are depicted in blue. (D) The upper right diagram depicts different 92 manners by which TE family overexpression might be detected. While TEs may indeed be autonomously expressed, 93 the old age of most TEs detected points toward either being part of a gene (exonization or alternative promoter), or a 94 result of pervasive transcription. We report the functional enrichment for neighboring genes of differentially expressed 95 TEs (DETEs) specifically upregulated in SARS-CoV-2 Calu-3 and A549 cells (MOI 2). The same categories used in 96 subfigure (B) were attributed to the GO terms reported here. Comprehensive SARS-CoV-2 Computational Analyses Consistent with the findings of Blanco-Melo et al. [10] , GO enrichment analysis returned terms 98 associated with immune system processes, Pi3K/AKT signaling pathway, response to cytokine, 99 stress and virus, among others (see Supplementary File 2 for complete results). In addition, we 100 report 285 GO terms common to at least two cell line datasets infected with SARS-CoV-2, and 101 absent in the response to other viruses ( Figure 2B , Supplementary Table 2A) , including neutrophil 102 and granulocyte activation, interleukin-1-mediated signaling pathway, proteolysis, and stress 103 activated signaling cascades. Next, we wanted to pinpoint intracellular signaling pathways that may be modulated specifically 105 during SARS-CoV-2 infection. A robust signaling pathway impact analysis (SPIA) enabled us to 106 identify 30 pathways, including many involved in the host immune response, that were significantly 107 enriched among differentially expressed genes in at least one virus-infected cell line dataset 108 (Supplementary Table 3 ). More importantly, we predicted four pathways to be specific to 109 SARS-CoV-2 infection and observed that the significant pathways differed by cell type and 110 multiplicity of infection. The significant results included only one term common to A549 (MOI 0.2) 111 and Calu-3 cells (MOI 2), namely the interferon alpha/beta signaling. Additionally, we found the 112 amoebiasis (A549 cells, MOI 0.2), the p75(NTR)-mediated and the trka receptor signaling pathways 113 (A549 cells, MOI 2) as significantly impacted. 114 We also used a classic hypergeometric method as a complementary approach to our SPIA 115 pathway enrichment analysis. While there were generally higher numbers of significant results using 116 this method, we observed that the vast majority of enriched terms (FDR < 0.05) described 117 infections with various pathogens, innate immunity, metabolism, and cell cycle regulation Table 3 ). Interestingly, we were able to detect enriched KEGG pathways common to 119 at least two SARS-CoV-2 infected cell types and absent from the other virus-infected datasets Table 4 ). The consensus solution (obtained taking 127 into account the enumeration of all solutions) in A549 cells (MOI 2) also recovered decreased fluxes 128 in several lipid pathways: fatty acid, cholesterol, sphingolipid, and glycerophospholipid. In addition, 129 we detected an increased flux common to A549 and Calu-3 cell lines in reactive oxygen species (ROS) 130 detoxification, in accordance with previous terms recovered from functional enrichment analyses. SARS-CoV-2 infection induced an isoform switch of genes associated 132 with immunity and mRNA processing 133 We wanted to analyze changes in transcript isoform expression and usage associated with 134 SARS-CoV-2 infection, as well as to predict whether these changes might result in altered protein 135 function. We identified isoforms experiencing a switch in usage greater than or equal to 30% in 136 absolute value, and retrieved those with a Bonferroni-adjusted p-value less than 0.05. After 137 calculating the difference in isoform usage (dIF) per gene (in each condition), we performed 138 predictive functional consequence and alternative splicing analyses for all isoforms globally as well as 139 at the individual gene level. Figure 1A) . These biological consequences may result from increased multiple 153 exon skipping events and alternative transcription start sites via alternative 5' acceptor sites 154 (Supplementary Figure 1B) . While not significant, these trends implicate that the SARS-CoV-2 virus 155 may globally trigger host cell machinery to generate shorter isoforms that, while not shuttled for 156 degradation, either do not produce functional proteins or produce alternative aberrant proteins not 157 utilized in non-SARS-CoV-2 tissue conditions. MamRep137. Most of the TE families uncovered are ancient elements, incapable of transposing, or 197 harboring intrinsic regulatory sequences [37, 57, 70] . Eleven of the 16 TE families specifically 198 upregulated in SARS-COV-2 infected cells are long terminal repeat (LTR) elements, and include well 199 known TE immune regulators. For instance, the MER41B (primate specific TE family) is known to 200 contribute to Interferon gamma inducible binding sites (bound by STAT1 and/or IRF1) [14, 66] . 201 Other LTR elements are also enriched in STAT1 binding sites (MLT1L) [14] , or have been shown to 202 act as cellular gene enhancers (LTR13A [16, 32] ). Given the propensity for the TE families detected to impact nearby gene expression, we further 204 investigated the functional enrichment of genes near upregulated TE families (+-5kb upstream, 1kb 205 downstream). We detected GO functional enrichment of several immunity-related terms (e.g. MHC 206 protein complex, antigen processing, regulation of dendritic cell differentiation, T-cell tolerance 207 induction), metabolism related terms (such as regulation of phospholipid catabolic process), and 208 more interestingly a specific human phenotype term called "Progressive pulmonary function 209 impairment" (Figure 2D ). Even though we did not limit our search only to neighboring genes which 210 were also DE, we found several similar (and very specific) enriched terms in both analyses, for 211 instance related to immune response, endosomes, endoplasmic reticulum, vitamin (cofactor) 212 metabolism, among others. This result supports the idea that some responses during infection could 213 be related to TE-mediated transcriptional regulation. Finally, when we searched for enriched terms 214 related to each one of the 16 families separately, we also detected immunity related enriched terms 215 such as regulation of interleukins, antigen processing, TGFB receptor binding and temperature Figure 2D ). The SARS-CoV-2 genome is enriched in binding motifs for 40 human 220 RBPs, most of them conserved across SARS-CoV-2 genome isolates 221 Our next aim was to predict whether any host RNA binding proteins interact with the viral genome. 222 To do so, we first filtered the AtTRACT database [23] to obtain a list of 102 human RBPs and 205 223 associated Position Weight Matrices (PWMs) describing the sequence binding preferences of these 224 proteins. We then scanned the SARS-CoV-2 reference genome sequence to identify potential binding 225 sites for these proteins. Figure 3 illustrates our analysis pipeline. 226 We identified 99 human RBPs with 11,897 potential binding sites in the SARS-CoV-2 227 positive-sense genome. Since the SARS-CoV-2 genome produces negative-sense intermediates as part 228 of the replication process [36] , we also scanned the negative-sense molecule, where we found 11,333 229 potential binding sites for 96 RBPs (Supplementary Table 6 ). To find RBPs whose binding sites occur in the SARS-CoV-2 genome more often than expected by 231 chance, we repeatedly scrambled the genome sequence to create 1,000 simulated genome sequences 232 with an identical nucleotide composition to the SARS-CoV-2 genome sequence (30% A, 18% C, 20% 233 G, 32% T). We used these 1,000 simulated genomes to determine a background distribution of the 234 number of binding sites found for a specific RBP. This allowed us to pinpoint RBPs with 235 significantly more or significantly fewer binding sites in the actual SARS-CoV-2 genome than 236 expected based on the background distribution (two-tailed z-test, FDR-corrected P < 0.01). To 237 retrieve RBPs whose motifs were enriched in specific genomic regions, we also repeated this analysis 238 independently for the SARS-CoV-2 5'UTR, 3'UTR, intergenic regions, and for the sequence from the 239 negative sense molecule. Motifs for 40 human RBPs were found to be enriched in at least one of the 240 tested genomic regions, while motifs for 23 human RBPs were found to be depleted in at least one of 241 the tested regions (Supplementary Table 7) . 242 We next examined whether any of the 6,936 putative binding sites for these 40 enriched RBPs 243 were conserved across SARS-CoV-2 isolates. We found that 6,581 putative binding sites, Comprehensive SARS-CoV-2 Computational Analyses RBP binding sites in coding regions are likely to be conserved due to evolutionary pressure on 247 protein sequences rather than RBP binding ability. We therefore repeated this analysis focusing only 248 on putative RBP binding sites in the SARS-CoV-2 UTRs and intergenic regions. We found 124 249 putative RBP binding sites for 21 enriched RBPs in the UTRs and intergenic regions. Of these, 50 250 putative RBP binding sites for 17 RBPs were conserved in >95% of the available genome sequences; 251 6 in the 5'UTR, 5 in the 3'UTR, and 39 in intergenic regions (Supplementary Table 8 Table 9 ). According to GTEx data [25] [25, 64] ), indicating that 263 9/25 they are present in cells that are susceptible to SARS-CoV-2 infection. We next checked whether any 264 of these RBPs are known to interact with SARS-CoV-2 proteins and found that human poly-A 265 binding proteins C1 and C4 (PABPC1 and PABPC4) bind to the viral N protein [24] . Thus, it is 266 conceivable that these RBPs interact with both the SARS-CoV-2 RNA and proteins. Finally, we 267 combined these results with our analysis of differential gene expression to identify SARS-CoV-2 268 interacting RBPs that also show expression changes upon infection. The results of this analysis are 269 summarized for selected RBPs in Table 2 . Table 11 ), a bat coronavirus 279 with a genome that is 96% identical with that of SARS-CoV-2 [4, 88] . 280 We found that the pattern of enrichment and depletion of RBP binding motifs in SARS-CoV-2 is 281 different from that of the other two viruses. Specifically, the SARS-CoV-2 genome is uniquely 282 enriched for binding sites of CELF5 in its 5'UTR, PPIE on its 3'UTR, and ELAVL1 in the viral To test whether any viral sequence variants were associated with a change in disease severity in 290 human hosts, we analyzed 1,511 complete SARS-CoV-2 genomes that had associated clinical 291 metadata. The FDR-corrected statistical results from this analysis revealed four nucleotide 292 variations that were significantly associated with a change in viral pathogenesis. Three of these 293 nucleotide changes resulted in non-synonymous variations at the amino acid level, while the last one 294 was silent at the amino acid level. The first position was a T → G (L37F) substitution located in the 295 Nsp6 coding region (p < 1.48E-5), the second position was a C → T (P323L) substitution located in 296 the RNA-dependent RNA polymerase coding region (p < 2.01E-4), the third position was an A → G 297 (D614G) substitution located in the spike coding region (p < 1.61E-4), and the fourth was a 298 synonymous C → T substitution located in the Nsp3 coding region (p < 1.77E-4). As a further 299 validation step, we performed the same analysis comparing viral sequence variants against potential 300 confounders, such as the biological sex or age group of the patients. These comparisons validated 301 that these four positions were only identified as significant in the results of the disease severity 302 analysis. Airway epithelial cells are the primary entry points for respiratory viruses and therefore constitute 305 the first producers of inflammatory signals that, in addition to their antiviral activity, promote the 306 initiation of the innate and adaptive immune responses. Here, we report the results of a 307 complementary panel of analyses that enable a better understanding of host-pathogen interactions 308 which contribute to SARS-CoV-2 replication and pathogenesis in the human respiratory system. However, GM-CSF is considered more proinflammatory than other members of its family, such as 316 G-CSF, and is associated with tissue hyper-inflammation [52] . In accordance with our results, high 317 levels of GM-CSF were found in the blood of severe COVID-19 patients [84] , and several clinical 318 trials are planned using agents that either target GM-CSF or its receptor [53] . GM-CSF, together 319 with other proinflammatory cytokines such as IL-6, TNF, IFNg, IL-7 and IL-18, is associated with 320 the cytokine storm present in a hyperinflammatory disorder named hemophagocytic 321 lymphohistiocytosis (HLH) which presents with organ failure [12] . Moreover, cytokines related to 322 cytokine release syndrome (such as IL-1A/B, IL-6, IL-10, IL-18, and TNFA), showed increased 323 positive association to the severity of the disease in the blood from COVID-19 patients [49] . Another 324 proinflammatory cytokine specifically upregulated in SARS-CoV-2 infected cells was IL-32, which 325 together with CSF2, promotes the release of TNF and IL-6 in a continuous positive loop and 326 therefore contribute to this cytokine storm [90] . Interestingly, IL-6, IL-7 and IL-18 were found to be 327 upregulated in two of the four data sets of SARS-CoV-2 infected cells. Moreover, not only 328 upregulation, but also a shift in isoform usage of IL-6 was detected in NHBE and A549 infected 329 cells. A shift in 5' UTR usage in the presence of SARS-CoV-2 may be attributed to indirect host cell 330 signaling cascades that trigger changes in transcription and splicing activity, which could also 331 explain the overall increase in IL-6 expression. 332 SERPINA3, a gene coding for an essential enzyme in the regulation of leukocyte proteases, is also 333 induced by cytokines [28] . This was the only gene consistently upregulated in all cell line samples Comprehensive SARS-CoV-2 Computational Analyses carried out to validate this hypothesis [13] . Another interesting candidate gene, which has not been 337 implicated experimentally in respiratory viral infections and was upregulated in our analysis, was 338 VNN2. Vanins are involved in proinflammatory and oxidative processes, and VNN2 plays a role in 339 neutrophil migration by regulating b2 integrin [56] . In contrast, the downregulated genes included 340 SNX8, which has been previously reported in RNA virus-triggered induction of antiviral 341 genes [13, 26] ; and FKBP5, a known regulator of NF-kB activity [27] . These results suggest that the 342 SARS-CoV-2 virus tends to indirectly target specific genes involved in genome replication and host 343 antiviral immune response without eliciting a global change in cellular transcript processing or 344 protein production. proinflammatory cytokines by other cell types [34] . Signaling pathway analysis showed that type I 359 IFN response was greatly impacted in SARS-CoV-2 infected cells (A549 and Calu-3 cells at a MOI 360 of 0.2 and 2 respectively). In the same direction, a higher expression of PRDM1 (Blimp-1) that we 361 observed in the SARS-CoV-2 infected cells, could also contribute to the critical regulation of IFN 362 signaling cascades; interestingly, the TE family LTR13, which was also upregulated upon 363 SARS-CoV-2 infection, is enriched in PRDM1 binding sites [78] . Therefore, it is possible that 364 regulatory factors involved in IFN and immune response in the context of SARS-CoV-2 infection 365 could also be attributed to TE transcriptional activation. In the same direction, we detected the 366 upregulation of several TE families in SARS-CoV-2 infected cells that have been previously 367 implicated in immune regulation. Moreover, 16 upregulated families were specific to SARS-CoV-2 368 infection in Calu-3 and A549 cell lines. The MER41B family, for instance, is known to contribute to 369 interferon gamma inducible binding sites (bound by STAT1 and/or IRF1) [14] . Functional 370 enrichment analysis of nearby genes were in accordance with these findings, since several immunity 371 related terms were enriched along with "progressive pulmonary impairment". In parallel, TEs seem 372 to be co-regulated with phospholipid metabolism, which directly affects the Pi3K/AKT signaling 373 pathway, central to the immune response and which were detected in our functional enrichment and 374 metabolism flux analyses. RBPs are another example of host regulatory factors involved either in the response of human 376 cells to SARS-CoV-2 or in the manipulation of human machinery by the virus. We aimed at finding 377 RBPs which potentially interact with SARS-CoV-2 genomes in a conserved and specific way. Five of 378 the proteins predicted to be interacting with the viral genome by our pipeline (EIF4B, hnRNPA1, 379 PABPC1, PABPC4, and YBX1) were experimentally shown to bind to SARS-CoV-2 RNA in an 380 infected human liver cell line, based on a recent preprint [67] . Among the RBPs whose potential binding sites were enriched and conserved within the 382 SARS-CoV-2 virus genomes is the EIF4B, suggesting that the SARS-CoV-2 virus protein translation 383 could be EIF4B-dependent. We also detected the upregulation of EIF4B in A549 and Calu-3 cells, 384 which might indicate that this protein is sequestered by the virus and therefore the cells need to 385 increase its production. Moreover, this protein was predicted to interact specifically with the 386 intergenic region upstream of the gene encoding the SARS-CoV-2 membrane (M) protein, one of 387 four structural proteins from this virus. Another conserved RBP, which was also upregulated in infected cells, is the Poly(A) Binding 389 Protein Cytoplasmic 1 (PABPC1), which has well described cellular roles in mRNA stability and 390 translation. PABPC1 has been previously implicated in multiple viral infections. The activity of 391 PABPC1 is modulated to inhibit host protein transcript translation, promoting viral RNA access to 392 the host cell translational machinery [72] . Importantly, the 3' UTR region of SARS-CoV2 is also 393 enriched in binding sites of the PABPC1 and the PPIE RBPs, the latter of which is known to be 394 involved in multiple processes, including mRNA splicing [9, 33] . Interestingly, PABPC1 and PABPC4 395 interact with the SARS-CoV-2 N protein, which stabilizes the viral genome [24] . This raises the 396 possibility that the viral genome, N protein, and human PABP proteins may participate in a joint 397 protein-RNA complex that assists in viral genome stability, replication, and/or 398 translation [1, 59, 62, 71, 72] . An interesting result was that the binding motifs for hnRNPA1, which has been shown to 400 interact with other coronavirus genomes, were enriched specifically in the 3'UTR of SARS-CoV-2 401 even though they were depleted in the genome overall. The hnRNPA1 protein was described to Indeed, the D → G mutation at amino acid position 614 in the Spike protein found in our analysis 410 has recently been proposed to have increased viral infectivity [38] . In addition, this same mutation 411 has also been associated with an increase in the case fatality rate [6] , however, these hypotheses need 412 further verification. The P323L mutation in the RNA-dependent RNA polymerase (RdRP) has been 413 identified previously, although in that study it was associated with changes in geographical location 414 of the viral strain [58] . Finally, the L37F mutation in the Nsp6 protein has been reported to be 415 located outside of the transmembrane domain [11] , being present at a high frequency [86] , and 416 proposed to negatively affect protein structure stability [8] . Our statistics may contain bias based on 417 the number of genome sequences being collected earlier versus later in the pandemic, genomes 418 lacking clinical outcome metadata, and in the case of the Spike D614G a potential increase of fitness 419 associated with this mutation. However, the fact that more than one of our predictions has also been 420 detected by different studies justifies future wet lab experiments to compare the effect of the other 421 identified mutations. were downloaded from GISAID on 19 May, 2020 [69] . 455 RNAseq data processing and differential expression analysis 456 Data was downloaded from SRA using sra-tools (v2.10.8; https://github.com/ncbi/sra-tools) and 457 transformed to fastq with fastq-dump. FastQC (v0.11.9; https://github.com/s-andrews/FastQC) 458 and MultiQC (v1.9) [20] were employed to assess the quality of the data used and the need to trim 459 reads and/or remove adapters. Selected datasets were mapped to the human reference genome 460 (GENCODE Release 19, GRCh37.p13) utilizing STAR (v2.7.3a) [17] . Alignment statistics were used 461 to determine which datasets should be included in subsequent steps. Resulting SAM files were 462 converted to BAM files employing samtools (v1.9) [43] . Next, read quantification was performed 463 using StringTie (v2.1.1) [60] and the output data was postprocessed with an auxiliary Python script 464 provided by the same developers to produce files ready for subsequent downstream analyses. For the 465 second gene expression dataset, raw counts were downloaded from GEO. DESeq2 (v1.26.0) [47] was 466 used in both cases to identify differentially expressed genes (DEGs). Finally, an exploratory data 467 analysis was carried out based on the transformed values obtained after applying the variance 468 stabilizing transformation [3] implemented in the vst() function of DESeq2 [48] . Hence, principal 469 component analysis (PCA) was performed to evaluate the main sources of variation in the data and 470 remove outliers. Gene ontology enrichment analysis 472 The DEGs produced by DESeq2 with an absolute Log2FC > 1 and FDR-adjusted p-value < 0.05 473 were used as input to a general GO enrichment analysis [5, 76] . Each term was verified with a 474 hypergeometric test from the GOstats package (v2.54.0) [21] and the p-values were corrected for 475 multiple-hypothesis testing employing the Bonferroni method [42] . GO terms with a significant 476 adjusted p-value of less than 0.05 were reduced to representative non-redundant terms with the use 477 of REVIGO [73] . 478 Host signaling pathway enrichment 479 The DEG lists produced by DESeq2 with an absolute Log2FC > 1 and FDR-adjusted p-value < 0.05 480 were used as input to the SPIA algorithm to identify significantly affected pathways from the R 481 graphite library [65, 75] . Pathways with Bonferroni-adjusted p-values less than 0.05 were included in 482 downstream analyses. The significant results for all comparisons from publicly available data from 483 KEGG, Reactome, Panther, BioCarta, and NCI were then compiled to facilitate downstream 484 comparison. Hypergeometric pathway enrichments were performed using the Database for 485 Annotation, Visualization and Integrated Discovery (DAVID, v6.8) [30] . Integration of transcriptomic analysis with the human metabolic network 487 To detect increased and decreased fluxes of metabolites we projected the transcriptomic data onto 488 the human reconstructed metabolic network Recon (v2.04) [77] . First, we ran EBSeq [40] [81] . In summary, we filtered for isoforms that experienced |≥ 30% | switch in usage for 497 each gene and were corrected for false discovery rate (FDR) with a q-value < 0.05. Following 498 filtering for significant isoforms, we externally predicted their coding capabilities, protein structure 499 15/25 stability, peptide signaling, and shifts in protein domain usage using The Coding-Potential 500 Assessment Tool (CPAT) [82] , IUPred2 [18] , SignalP [2] and Pfam tools respectively [19] . These 501 external results were imported back into IsoformSwitchAnalyzeR and used to further identify 502 alternative splicing events and functional consequences as well as visualize the overall and individual 503 effects of isoform switching data. Specifically, to calculate differential analysis between samples, 504 isoform expression and usage were measured by the isoform fraction (IF) value, which quantifies the 505 individual isoform expression level relative to the parent gene's expression level: isof orm expression gene expression By proxy, the difference in isoform usage between samples (dIF) measures the effect size between 507 conditions and is calculated as follows: 508 dIF = IF 2-IF 1 dIF was measured on a scale of 0 to 1, with 0 = no (0%) change in usage between conditions and 509 1 = complete (100%) change in usage. The sum of dIF values for all isoforms associated with one 510 gene is equal to 1. Gene expression data was imported from the aforementioned DESeq2 results. The top 30 isoforms per dataset comparison were identified by ranking isoforms by gene switch 512 q-value, i.e. the significance of the summation of all isoform switching events per gene between mock 513 and infected conditions. Transposable Element Analysis 515 TE expression was quantified using the TEcount function from the TEtools software [41] . TEcount 516 detects reads aligned against copies of each TE family annotated from the reference genome. DETEs 517 in infected vs mock conditions were detected using DEseq2 (v1.26.0) [47] with a matrix of counts for 518 genes and TE families as input. Functional enrichment of nearby genes (upstream 5kb and 519 downstream 1kb of each TE copy within the human genome) was calculated with GREAT [51] using 520 options "genome background" and "basal + extension". We only selected occurrences statistically 521 significant by region binomial test. was scanned with the remaining PWMs using the TFBSTools R package (v1.20.0) [74] . A minimum 528 score threshold of 90% was used to identify putative RBP binding sites. Enrichment analysis for putative RBP binding sites 530 The sequences of the SARS-CoV-2 genome, 5'UTR, 3'UTR, intergenic regions and negative sense 531 molecule were each scrambled 1,000 times. Each of the 1,000 scrambled sequences was scanned for 532 RBP binding sites as described above. The number of binding sites for each RBP was counted, and 533 the mean and standard deviation of the number of sites was calculated for each RBP, per region, 534 across all 1,000 simulations. A minimum FDR-adjusted p-value of 0.01 was taken as the cutoff for 535 enrichment. This analysis was repeated with the reference genomes of SARS-CoV and RaTG13 . Conservation analysis for putative RBP binding sites 537 The multiple sequence alignment of 27,592 SARS-CoV-2 genome sequences was downloaded from 538 GISAID [69] on May 19th 2020. For each putative RBP binding site, we selected the corresponding 539 columns of the multiple sequence alignment. We then counted the number of genomes in which the 540 sequence was identical to that of the reference genome. Viral genotype-phenotype correlation 542 All complete SARS-CoV-2 genomes from GISAID, together with the GenBank reference sequence, 543 were aligned with MAFFT (v7.464) within a high-performance computing environment using 1 544 thread and the -nomemsave parameter [55] . Sequences responsible for introducing excessive gaps in 545 this initial alignment were then identified and removed, leaving 1,511 sequences that were then used 546 to generate a new multiple sequence alignment. The disease severity metadata for these sequences 547 was then normalized into four categories: severe, moderate, mild, and unknown. Next, the sequence 548 data and associated metadata were used as input to the meta-CATS algorithm to identify aligned 549 positions that contained significant differences in their base distribution between 2 or more disease 550 severities [61] . The Benjamini-Hochberg multiple hypothesis correction was then applied to all 551 positions [7] . The top 50 most significant positions were then evaluated against the annotated 552 protein regions of the reference genome to determine their effect on amino acid sequence. Code availability 554 Code for these analyses is available at https://github.com/vaguiarpulido/covid19-research. The authors received no specific funding to support this work. We would like to thank the Virtual BioHackathon on COVID-19 that took place during April 2020 582 (https://github.com/virtual-biohackathons/covid-19-bh20) for fostering an environment that 583 triggered this collaboration and in particular the Gene Expression group for the fruitful discussions. 584 This work was performed using the computing facilities of the CC/PRABI/LBBE in France and the 585 France Génomique e-infrastructure (ANR-10-INBS-09-08), the HPC facilities of the University of 586 Luxembourg [80] . We would also like to thank Slack for providing us with free access to the 587 professional version of the platform. 588 We would also like to thank Slack for providing us with free access to the professional version of the 589 platform. Host factors in positive-strand RNA virus genome replication SignalP 5.0 improves signal peptide predictions using deep neural networks Differential expression analysis for sequence count data The proximal origin of SARS-CoV-2 Gene ontology: tool for the unification of biology. the gene ontology consortium SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate Controlling the false discovery rate: A practical and powerful approach to multiple testing Evolutionary analysis of SARS-CoV-2: how mutation of Non-Structural protein 6 (NSP6) could affect viral autophagy Cryo-EM structure of a human spliceosome activated for step 2 of splicing Imbalanced host response to SARS-CoV-2 drives development of COVID-19 An exclusive 42 amino acid signature in pp1ab protein provides insights into the evolutive history of the 2019 novel human-pathogenic coronavirus (SARS-CoV-2) Macrophage activation syndrome in adults: recent advances in pathophysiology, diagnosis and treatment Integrating transcriptomic and proteomic data using predictive regulatory network models of host response to pathogens Regulatory evolution of innate immunity through co-option of endogenous retroviruses Regulatory activities of transposable elements: from conflicts to benefits Endogenous retroviruses are a source of enhancers with oncogenic potential in acute myeloid leukaemia STAR: ultrafast universal RNA-seq aligner IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content The pfam protein families database in 2019 MultiQC: summarize analysis results for multiple tools and samples in a single report Using GOstats to test gene lists for GO term association Human coronavirus: Host-Pathogen interaction ATtRACT-a database of RNA-binding proteins and associated motifs A SARS-CoV-2-Human Protein-Protein interaction map reveals drug targets and potential Drug-Repurposing. bioRxiv Human genomics. the Genotype-Tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans SNX8 modulates the innate immune response to RNA viruses by regulating the aggregation of VISA Signal responsiveness of IκB kinases is determined by cdc37-assisted transient interaction with hsp90 Immune system disturbances in schizophrenia Clinical features of patients infected with 2019 novel coronavirus in wuhan, china Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources Heterogeneous nuclear ribonucleoprotein a1 binds to the 3'-untranslated region and mediates potential 5'-3'-end cross talks of mouse hepatitis virus RNA Systematic identification and characterization of regulatory elements derived from human endogenous retroviruses Purification and characterization of native spliceosomes suitable for three-dimensional structural analysis Natural type I interferon-producing cells as a link between innate and adaptive immunity Targeting SARS-CoV-2: a systematic drug repurposing approach to identify promising inhibitors against 3c-like proteinase and 2'-o-ribose methyltransferase The architecture of SARS-CoV-2 transcriptome Human transposable elements in repbase: genomic footprints from fish to humans Identifying SARS-CoV-2-related coronaviruses in malayan pangolins EBSeq: an empirical bayes hierarchical model for inference in RNA-seq experiments TEtools facilitates big data expression analysis of transposable elements and reveals an antagonism between their activity and that of piRNA genes An open-source software program for performing bonferroni and related corrections for multiple comparisons The sequence Alignment/Map format and SAMtools Heterogeneous nuclear ribonucleoprotein A1 binds to the transcription-regulatory region of mouse hepatitis virus RNA Diverse roles of host RNA binding proteins in RNA virus replication Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19 Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Virus-induced transposable element expression up-regulation in human and mouse host cells GREAT improves functional interpretation of cis-regulatory regions Manson, and HLH Across Speciality Collaboration, UK. COVID-19: consider cytokine storm syndromes and immunosuppression Computational studies of drug repurposing and synergism of lopinavir, oseltamivir and ritonavir binding with SARS-CoV-2 protease against COVID-19 Parallelization of MAFFT for large-scale multiple sequence alignments Linkage between coenzyme a metabolism and inflammation: roles of pantetheinase The evolutionary history of human DNA transposons: evidence for intense activity in the primate lineage Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Translational control of the abundance of cytoplasmic poly(a) binding protein in human cytomegalovirus-infected cells StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Metadata-driven comparative analysis tool for sequences (meta-CATS): an automated process for identifying significant sequence variations that correlate with virus attributes Poly(A)-binding protein binds to the non-polyadenylated 3' untranslated region of dengue virus and modulates translation efficiency MOOMIN -mathematical exploration of 'omics data on a MetabolIc network Single-Cell transcriptomic analysis of human lung provides insights into the pathobiology of pulmonary fibrosis graphite -a bioconductor package to convert pathway topology to gene network MER41 repeat sequences contain inducible STAT1 binding sites A direct RNA-protein interaction atlas of the SARS-CoV-2 RNA Heterogeneous nuclear ribonucleoprotein A1 regulates RNA synthesis of a cytoplasmic virus GISAID: Global initiative on sharing all influenza data -from vision to reality Interspersed repeats and other mementos of transposable elements in mammalian genomes Viral and cellular mRNA-specific activators harness PABP and eIF4G to promote translation initiation downstream of cap binding Poly(A)-binding protein (PABP): a common viral target REVIGO summarizes and visualizes long lists of gene ontology terms TFBSTools: an R/bioconductor package for transcription factor binding site analysis A novel signaling pathway impact analysis The Gene Ontology Consortium and The Gene Ontology Consortium. The gene ontology resource: 20 years and still GOing strong A community-driven global reconstruction of human metabolism Transposable elements generate regulatory novelty in a tissue-specific fashion A cluster of cases of severe acute respiratory syndrome in hong kong Management of an academic hpc cluster: The ul experience IsoformSwitchAnalyzeR: analysis of changes in genome-wide patterns of alternative splicing and its functional consequences CPAT: Coding-Potential assessment tool using an alignment-free logistic regression model Immune cell profiling of COVID-19 patients in the recovery stage by single-cell sequencing TH17 responses in cytokine storm of COVID-19: An emerging target of JAK2 inhibitor fedratinib Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2 Genotyping coronavirus SARS-CoV-2: methods and implications Isolation of a novel coronavirus from a man with pneumonia in saudi arabia A pneumonia outbreak associated with a new coronavirus of probable bat origin Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2 Important role of the IL-32 inflammatory network in the host response against viral infection