key: cord-0746657-5a65q3no authors: Cheng, Xuanjin; Yan, Junran; Liu, Yongxing; Wang, Jiahe; Taubert, Stefan title: eVITTA: a web-based visualization and inference toolbox for transcriptome analysis date: 2021-05-21 journal: Nucleic Acids Res DOI: 10.1093/nar/gkab366 sha: 8f8d11843d4ad5824f1fcd7136022caa858e3cdc doc_id: 746657 cord_uid: 5a65q3no Transcriptome profiling is essential for gene regulation studies in development and disease. Current web-based tools enable functional characterization of transcriptome data, but most are restricted to applying gene-list-based methods to single datasets, inefficient in leveraging up-to-date and species-specific information, and limited in their visualization options. Additionally, there is no systematic way to explore data stored in the largest transcriptome repository, NCBI GEO. To fill these gaps, we have developed eVITTA (easy Visualization and Inference Toolbox for Transcriptome Analysis; https://tau.cmmt.ubc.ca/eVITTA/). eVITTA provides modules for analysis and exploration of studies published in NCBI GEO (easyGEO), detailed molecular- and systems-level functional profiling (easyGSEA), and customizable comparisons among experimental groups (easyVizR). We tested eVITTA on transcriptomes of SARS-CoV-2 infected human nasopharyngeal swab samples, and identified a downregulation of olfactory signal transducers, in line with the clinical presentation of anosmia in COVID-19 patients. We also analyzed transcriptomes of Caenorhabditis elegans worms with disrupted S-adenosylmethionine metabolism, confirming activation of innate immune responses and feedback induction of one-carbon cycle genes. Collectively, eVITTA streamlines complex computational workflows into an accessible interface, thus filling the gap of an end-to-end platform capable of capturing both broad and granular changes in human and model organism transcriptomes. Transcriptome profiling is an essential technique to study gene regulation in development and disease (1) . The emergence of affordable high-throughput microarray and sequencing technologies has resulted in the rapid expansion of transcriptome experiments, which in turn greatly increased the demand for robust analytical tools for data interpretation. Effective transcriptome interpretation involves three key aspects: drawing inference from published studies, translating the data into meaningful biological knowledge, and comparing multiple conditions to each other to discern underlying regulatory changes. First, knowledge from past studies is essential for hypothesis generation and data interpretation. The Gene Expression Omnibus (GEO) database, funded by the Na-tional Center for Biotechnology Information (NCBI), is the largest public repository for transcriptome datasets (144 751 data series, 4.2 million samples on 23 February 2021) (2, 3) . Despite the treasure trove of data in GEO, no tool yet exists that can systematically extract and process such data for inferential use. Most web-based GEO data analysis tools are limited in functionality: some only provide access to microarray data without differential expression (DE) analysis (4) , while others analyzing RNA-sequencing (RNA-seq) data rely on their own processed data repositories, which tend to update slowly and often exclude datasets due to unsupported species or experiment type (5) . Second, uncovering mechanistic insights from gene expression data is central to all types of transcriptomic studies. Functional enrichment analysis (aka functional profiling) is the primary technique for this purpose, and is commonly used to interpret gene lists derived from many omics platforms (6) . To date, a variety of web-based enrichment analysis tools have been developed (7) (8) (9) (10) (11) (12) (13) (14) , but these are sometimes suboptimal for interpreting transcriptome data. Surveys have shown that most tools are outdated in their gene annotation (gene set, GS) databases, sometimes by several years, which can severely impact functional interpretation and follow-up experiments (15) . When multiple GS databases are analyzed together, most tools list results separately in tables or simple graphs, which is ineffective in integrating the information (8) (9) (10) (11) (12) . Some tools rely on literature-curated resources such as Gene Ontology (GO), which are sometimes not precise enough to capture the functions of genes in the biological system of interest (8) (9) (10) (11) (12) . Approach-wise, gene-list-based overrepresentation analysis (ORA) remains predominant (7-13); alternative methods for transcriptomic studies, such as pre-ranked Gene Set Enrichment Analysis (GSEA) based on gene-set scoring (16) , are not supported by most tools. The only tool supporting GSEA to our knowledge (14) requires users to supply a separately generated rank file, setting a hurdle for nonbioinformaticians. Furthermore, most existing tools provide limited options for visualizing transcriptome patterns. Third, multiple dataset comparisons play a crucial role in interpreting multi-group experiments and in comparing new results to published data. Despite the growth in demand, no web-based tool exists yet to our knowledge that provides a complete pipeline for identifying and visualizing intersections and disjoints among multiple transcriptome profiles. Tools exist for filtering gene lists (17) or plotting certain types of visualizations such as Venn diagrams (18) (19) (20) , UpSet plots (19, 21) , and heatmaps (22) , but stringing these modules into an inference workflow remains tedious. The one tool that does provide a graphical workflow for intersection analysis, to our knowledge, is only tailored to pairwise comparisons, and does not provide interactive or customizable visualizations (23) . To address these challenges in transcriptome analysis and interpretation, we have developed eVITTA (easy Visualization and Inference Toolbox for Transcriptome Analysis; https://tau.cmmt.ubc.ca/eVITTA/). It consists of three modules that can work together or as standalones: easy-GEO accesses, analyzes, and visualizes transcriptome data in NCBI GEO; easyGSEA visually delineates gene expression patterns by functional profiling; and easyVizR com-pares and contrasts multiple datasets via an integrated intersection analysis workflow. Above all, eVITTA's interactive and user-friendly interface makes transcriptome analysis accessible for wet and dry lab biologists alike. The multiple entry and exit points in the workflow also allow users to adapt one or more individual tool(s) into their own custom analysis pipeline ( Figure 1 , Table 1 ). To test eVITTA and demonstrate its capabilities, we performed two independent evaluation studies on published gene expression datasets: (i) transcriptomes of SARS-CoV-2 infected human nasopharyngeal (NP) swab samples and (ii) transcriptomes of Caenorhabditis elegans worms deficient in sams-1/MAT1A or sbp-1/SREBP. We were able to recapitulate original findings and also discovered additional biological insights that were experimentally confirmed in studies following the original profiling experiments, demonstrating eVITTA's effectiveness in transcriptome interpretation. The eVITTA web server runs on Ubuntu Linux (18.04.5 LTS) with 32 GB memory, 16-core CPUs and a 10TB hard drive with Apache (version 2.4.29, https://httpd. apache.org), R (version 4.0.2, https://www.r-project.org/), and R Shiny Server (version 1.5.14.948, https://rstudio.com/ products/shiny/download-server/). eVITTA utilizes several third-party tools, including GEOquery (24), edgeR (25) , limma (26) , plotly (https://plotly.com/r/), Tidyverse (27) , fgsea (28) , gprofiler2 (29) , pathview (30), visNetwork (https://datastorm-open.github.io/visNetwork/), VennDiagram (31), UpsetR (32), RRHO (33) and others (Supplementary Table S1 ). eVITTA has been tested successfully on several browsers, including Safari v13.1.2, Firefox v65.0, and Chrome v86.0.4240.111. Detailed methods in Supplementary Data Section 1. Inputs: The unique GEO identifier of an NCBI GEO series, which begins with 'GSE'. Data processing: Gene expression data matrix and design matrix are retrieved automatically. If the count table in an RNA-seq study is specified as raw, genes expressed at a level less than 1 count per million (CPM) reads in at least five of the samples, or the minimum number of biological repeats in each condition, whichever less, are excluded from further analysis; if the count table in an RNAseq experiment is normalized, or if the dataset is based on microarrays, a threshold of 1 is applied likewise to exclude barely expressed genes. Next, for RNA-seq datasets, raw read counts are normalized using the trimmed mean of Mvalues (TMM) in edgeR (25) to adjust samples for differences in library size and Limma-voom (26) transformed using the default parameters; for microarray and normalized RNA-seq counts, Limma-voom (26) transformation is applied with the normalize = 'quantile' option. Then, a linear model using weighted least squares for each gene is fitted with Limma-lmFit (26); batch effect, if any, is processed as Outputs: (a) converted rank and DE tables (GSEA module); (b) enrichment table (importable into easyVizR); (c) results summary with interactive bar, bubble, keyword, Manhattan and volcano plots; (d) individual GS's statistics, and its distribution in the genome background (GSEA module) delineated with enrichment, density, box and violin plots; (e) pathway maps (KEGG (34), Reactome (35) and Wikipathways (36) ); and (f) enrichment network with clustering dendrogram. Inputs: Comma-delimited data table(s) containing identifiers, differential expression metric (e.g. log 2 -transformed fold change, enrichment score), P-value (pval), and FDR or adjusted P-value (padj). Data processing: For each selected dataset, filtered lists of genes or GSs are generated from user-selected filters (default: pval < 0.05). From filtered lists, users may select an intersection of interest by defining set relationships (Supplementary Figure S1 ). The selected intersection is highlighted in Venn and Upset plots, and terms in the intersection are used to generate interactive visualizations. Outputs: (a) filtered gene lists (importable into easyGSEA or other ORA tools) and corresponding expression tables; (b) Venn and UpSet plots; (c) heatmap for terms in chosen intersection; (d) 2D and 3D scatter plots, rank-rank hypergeometric overlap (RRHO) plot and rank-rank scatter for correlation analysis (33) ; (e) volcano and bar plot for single datasets; (f) leading-edge network (for GSEA outputs); and (g) text enrichment word cloud for identifiers. Since early 2020, the global spread of SARS-CoV-2 has led to concerted efforts to characterize its etiology in human patients. To test the analytical pipeline of eVITTA, we reanalyzed a published RNA-seq dataset (GSE152075 (37)) of nasopharyngeal (NP) swabs from 430 SARS-CoV-2-infected individuals and 54 uninfected controls (for detailed steps, see Supplementary Data Section 2.1). First, using easyGEO, we retrieved the count data and design matrix submitted by the authors, and performed DE analysis. In line with the original findings (37), we found that SARS-CoV-2 infection induced an interferon-driven antiviral response in the nasopharynx, upregulating genes encoding antiviral factors (e.g. IFIT1/2/3/6, RSAD2) and chemokines (e.g. CXCL9/10/11) (Figure 2A) . Next, to test if eVITTA's analytical capacity using combinatorial GS databases improves the sensitivity of finding molecular patterns, we performed GSEA using the default selection of biological process and pathway databases in easyGSEA. We found that olfactory transduction GSs were downregulated ( Figure 2B , Supplementary Figure S2A , Supplementary Table S3) . At the gene level, key olfactory transducers, including G protein subunit alpha L (GNAL) (38) and cyclic nucleotide-gated channel subunit alpha 4 (CNGA4) (39) , showed reduced expression ( Figure 2C , D, Supplementary Figure S2B ). The observation of deregulated olfactory signaling during SARS-CoV-2 infection agrees well with the clinical presentation of anosmia in COVID-19 cases (40) and with a recent report of transient olfactory dysfunction in mice infected with SARS-CoV-2 (41). Together, this demonstrates eVITTA's capacity to capture both broad and granular patterns in gene expression, which facilitates the identification of biological insights. To test eVITTA's functional profiling capacity and computational reproducibility, we analyzed published C. elegans transcriptomes characterizing the response to Sadenosylmethionine (SAM) deficiency. The universal methyl donor SAM is produced by SAM synthase (SAMS in C. elegans; MAT in mammals) in the one-carbon cycle (1CC). We reanalyzed published microarray and RNA-seq data characterizing responses to sams-1 RNA interference (RNAi), which were previously analyzed by ORA (42) and used to validate the C. elegans functional database and annotation tool WormCat (43) . We first compared three independent transcriptome analyses of sams-1 RNAi-treated worms, including two microarray datasets and one RNAseq dataset (42, 44) (Supplementary Table S4 ; detailed steps in Supplementary Data Section 2.2). Despite the difference of the RNA-seq study compared to the two microarrays in terms of experimental platform and upstream processing, enrichment analysis with eVITTA showed a substantial overlap ( Figure 3A ) and strong correlation in terms of significantly regulated GSs (R 2 = 0.95 and 0.87; Figure 3B , C), and also in terms of overall enrichment profiles (rho = 0.77 and 0.72; Figure 3D , E). This suggests that the eVITTA pipeline is robust enough to handle comparisons of studies with differences in upstream platforms and processing. In C. elegans, SAM deficiency induces immune responses in the absence of pathogen infection, and a similar response occurs upon depletion of the SAM-regulated lipid synthesis regulator sbp-1/SREBP (42) . We thus tested eVITTA's efficacy to capture convergent and divergent regulations following sams-1 or sbp-1 RNAi. Consistent with previous findings (42, 43) , we confirmed a strong immune signature in both sams-1 and sbp-1 deficiency (Figure 4A-B) . Interestingly, eVITTA's comprehensive GS databases allowed us to discover specific changes in one branch of innate immunity, Toll-like receptor (TLR) signaling ( Figure 4B ). Prior studies have shown that, in C. elegans, TLR signaling is required for the innate immune response against Gram-negative bacteria (45, 46) ; this may explain why the original study (44) found sams-1 RNAi worms to be exquisitely susceptible to infection by P. aeruginosa, a Gram-negative bacterium. Although sams-1 and sbp-1 RNAi affected the transcriptome in similar ways, a small set of GSs were upregulated in sams-1 RNAi but downregulated in sbp-1 RNAi ( Figure 4C -E; Supplementary Table S5 ). Most of these GSs pertain to lipid metabolism, recapitulating published findings that lipogenesis is elevated by sams-1 deficiency but suppressed by sbp-1 deficiency (42, 47) . Interestingly, the 1CC also follows this pattern, not only confirming a known negative feedback loop from sbp-1 to the 1CC (42, 47) , but also indicating that SAM deficiency alone causes compensatory induction of the 1CC, in line with a recent study (48) . Overall, this exemplifies the utility of eVITTA in revealing both (E) The rank scatter plot shows the Spearman correlation between unfiltered datasets A and C. X and Y axes: ranks of −log 10 -transformed P-values signed by ES (−log 10 (pval)*sign(ES)) in datasets A and C, respectively. n = 961; rho = 0.72; pval< 2.2e−16. ES, enrichment score, also displayed as 'Value'; GEO, Gene Expression Omnibus; GS, gene set; GSEA, pre-ranked gene set enrichment analysis; intersection, selected parameters in easyVizR '3.2 Intersection of Interest'; padj, adjusted P-value, also displayed as 'FDR'; pval, P-value, also displayed as 'PValue'; RA, Reactome Pathways; rho, Spearman's rank coefficient; RNA-seq, RNA sequencing. convergent and differential patterns in multiple datasets at high resolution. Assembling a dedicated analytical pipeline to interpret transcriptomes is a complex task with many challenges. eVITTA addresses these challenges by automating the query and analysis of NCBI GEO transcriptome data with a standardized pipeline (easyGEO), performing functional profiling with 100+ monthly-updated, species-specific GS libraries (easyGSEA), and providing a workflow for systematic comparison of expression patterns in multiple datasets (easyVizR). As illustrated in the evaluation studies, eVITTA's workflow and interactive visualizations enable efficient discovery of both broad and subtle changes in expression, which other tools were unable to fully capture. Although we developed eVITTA for transcriptome analysis and interpretation, its tools can also be applied to other omics studies. For instance, easyGSEA can functionally characterize lists of genes or proteins generated from any omics platform, and easyVizR can handle any differential expression data with statistical significance (https: //tau.cmmt.ubc.ca/eVITTA/#userguide). Like all similar web servers, eVITTA has some limitations. easyGEO cannot handle datasets where count data are missing; it also relies on user-supplied count data, which may be processed using different methods and thus cannot be used for between-study comparisons. In addition, it does not yet support datasets deposited in ArrayExpress or the European Nucleotide Archive (ENA). Future iterations of eVITTA may include access to these resources Supplementary Table S5 . (E) The graph shows the enrichment network of 11 GSs from Figure 4D . Edges reflect significant overlap of leading-edge genes in dataset A; other parameters are as in B. 1CC, one-carbon cycle; GO, gene ontology; BP, biological process; C2, WormCat Category 2; ES, enrichment score, also displayed as 'Value'; GEO, Gene Expression Omnibus; GS, gene set; GSEA, pre-ranked gene set enrichment analysis; Intersection, selected parameters in easyVizR '3.2 Intersection of Interest'; KEGG, Kyoto Encyclopaedia of Genes and Genomes; padj, adjusted P-value, also displayed as 'FDR'; pval, P-value, also displayed as 'PValue';RA, Reactome Pathways; RNA-seq, RNA sequencing; TLR, Toll-like receptor. and offer customizable DE analysis with limma (26), edgeR (25) and DESeq2 (49) . The GSEA method in easyGSEA assumes genes in a GS change in one direction (either predominantly up-or down-regulated); methods to evaluate GSs regardless of the direction are to be incorporated (50) . Future iterations of eVITTA may also adopt more effective weighing techniques in prioritizing GSs with high phenotype relevance, especially in the context of other omics data such as ChIP-seq and genomic mutation data (51) . In easyVizR, most modules rely on comparisons between filtered gene lists; more options for unbiased comparisons, such as Spearman's correlation heatmap, may be included in the future. Future iterations of eVITTA may also address challenges in comparing transcriptomes across species (51, 52) . Lastly, future releases of eVITTA may provide a more seamless user experience by providing direct links be-tween its three modules. Overall, eVITTA aims to both improve existing pipelines for omics data analysis and to make transcriptome interpretation more accessible to the wider research community. eVITTA is free and open to all users and there is no login requirement (https://tau.cmmt.ubc.ca/eVITTA/). Its source code is available in the GitHub repository (https://github. com/easygsea/eVITTA.git). Supplementary Data are available at NAR Online. RNA-Seq: a revolutionary tool for transcriptomics Gene Expression Omnibus: NCBI gene expression and hybridization array data repository The Gene Expression Omnibus database shinyGEO: a web-based application for analyzing gene expression omnibus datasets GREIN: an interactive web platform for re-analyzing GEO RNA-seq data Pathway and network analysis of cancer genomes Metascape provides a biologist-oriented resource for the analysis of systems-level datasets Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Babelomics 5.0: functional interpretation for new generations of genomic data KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs Impact of outdated gene annotations on pathway enrichment analysis Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles A-Lister: a tool for analysis of differentially expressed omics entities across multiple pairwise comparisons VennDiagramWeb: a web application for the generation of highly customizable Venn and Euler diagrams Intervene: a tool for intersection and visualization of multiple gene or genomic region sets InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams UpSet: visualization of Intersecting Sets iSEE: Interactive SummarizedExperiment Explorer. F1000Research GRACOMICS: software for graphical comparison of multiple results with omics data GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor edgeR: a Bioconductor package for differential expression analysis of digital gene expression data ) limma powers differential expression analyses for RNA-sequencing and microarray studies Welcome to the Tidyverse Fast gene set enrichment analysis 2020) gprofiler2 -an R package for gene list functional enrichment analysis and namespace conversion toolset g:Profiler Pathview: an R/Bioconductor package for pathway-based data integration and visualization VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R UpSetR: an R package for the visualization of intersecting sets and their properties Rank-rank hypergeometric overlap: identification of statistically significant overlap between gene-expression signatures KEGG: new perspectives on genomes, pathways, diseases and drugs The reactome pathway knowledgebase WikiPathways: connecting communities In vivo antiviral host transcriptional response to SARS-CoV-2 by viral load, sex, and age Golf: an olfactory neuron specific-G protein involved in odorant signal transduction Calcium/calmodulin modulation of olfactory and rod cyclic nucleotide-gated ion channels Smell and taste disorders in COVID-19: from pathogenesis to clinical features and outcomes SARS-CoV-2 infection causes transient olfactory dysfunction in mice 2015) s-Adenosylmethionine levels govern innate immunity through distinct methylation-dependent pathways WormCat: an online tool for annotation and visualization of Caenorhabditis elegans genome-scale data Stress-responsive and metabolic gene regulation are altered in low S-adenosylmethionine A conserved Toll-like receptor is required for Caenorhabditis elegans innate immunity Toll-like receptor signaling promotes development and function of sensory neurons required for a C. elegans pathogen-avoidance behavior A conserved SREBP-1/phosphatidylcholine feedback circuit regulates lipogenesis in metazoans Caenorhabditis elegans methionine/S-adenosylmethionine cycle activity is sensed and adjusted by a nuclear hormone receptor Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Toward a gold standard for benchmarking gene set enrichment analysis Integrative pathway enrichment analysis of multivariate omics data Amalgamated cross-species transcriptomes reveal organ-specific propensity in gene expression evolution