key: cord-0273248-ys0mybgl authors: Ahmaderaghi, Baharak; Amirkhah, Raheleh; Jackson, James; Lannagan, Tamsin RM; Gilroy, Kathryn; Malla, Sudhir B; Redmond, Keara L; Maughan, Tim; Leedham, Simon; Campbell, Andrew S; Sansom, Owen J; Lawler, Mark; Dunne, Philip D title: The Molecular Subtyping Resource (MouSR): a user-friendly tool for rapid biological discovery from human or mouse transcriptional data date: 2021-08-13 journal: bioRxiv DOI: 10.1101/2021.08.12.456127 sha: 867229b8c0b1175aac57dd680fb76a4a94e13dfe doc_id: 273248 cord_uid: ys0mybgl Generation of transcriptional data has dramatically increased in the last decade, driving the development of analytical algorithms that enable interrogation of the biology underpinning the profiled samples. However, these resources require users to have expertise in data wrangling and analytics, reducing opportunities for biological discovery by “wet-lab” users with a limited programming skillset. Although commercial solutions exist, costs for software access can be prohibitive for academic research groups. To address these challenges, we have developed an open source and user-friendly data analysis platform for on-the-fly bioinformatic interrogation of transcriptional data derived from human or mouse tissue, called “MouSR”. This internet-accessible analytical tool, https://mousr.qub.ac.uk/, enables users to easily interrogate their data using an intuitive “point and click” interface, which includes a suite of molecular characterisation options including QC, differential gene expression, gene set enrichment and microenvironmental cell population analyses from RNA-Seq. Users are provided with adjustable options for analysis parameters to generate results that can be saved as publication-quality images. To highlight its ability to perform high quality data analysis, we utilise the MouSR tool to interrogate our recently published tumour dataset, derived from genetically engineered mouse models and matched organoids, where we rapidly reproduced the key transcriptional findings. The MouSR online tool provides a unique freely-available option for users to perform rapid transcriptomic analyses and comprehensive interrogation of the signalling underpinning transcriptional datasets, which alleviates a major bottleneck for biological discovery. The MouSR online tool provides a unique freely-available option for users to perform rapid 48 transcriptomic analyses and comprehensive interrogation of the signalling underpinning 49 transcriptional datasets, which alleviates a major bottleneck for biological discovery. In the years since the first whole genome was sequenced, the costs associated with the generation 56 of molecular "big data" have decreased rapidly, to a point where the data handling, rather than data 57 generation, is the limiting factor in large biological discovery programmes. Furthermore, large 58 repositories (such as the TCGA (NCI and National Human Genome Research Institute, 2006) and 59 Gene Expression Omnibus (Edgar et al., 2002) ), now provide free access to publicly-available 60 molecular data. Large international molecular subtyping projects have markedly improved our 61 biological understanding of cancer (Sohn et al., 2017) , but in doing so they have created a critical 62 bottleneck in terms of data reduction, analysis and interpretation, resulting in an urgent need for 63 solutions that enable rapid biological interrogation of large datasets (Cerami et al., 2012) . 64 Given the relative paucity of translational bioinformaticians within many research groups (Gao et 65 al., 2013), there is a need for wet-lab researchers to have access to user-friendly analytic platforms 66 that provide rapid and statistically controlled algorithms to perform common transcriptional 67 analysis tasks, alongside an array of tools for visualising and interrogating the resulting data. For 68 these tools to be widely adopted, they will need to provide the non-computational user with 69 intuitive "point and click" options for transcriptional analyses, rather than programming-based Xeon Gold 6130 CPU @2.10 GHZ, 16 Core. The service was given extra security and protection 81 by being placed behind a proxy service, which meant the server itself is never directly exposed to to perform MDS analysis (Young, 2013) . Unlike the PCA method that minimises dimensions while 103 preserving covariance of the data, MDS minimises dimensions and preserves distance between 104 data points. However, both methods can provide similar results, if the covariance in data and 105 Euclidean distance measure between data points in high dimension is equal. MDS uses the 106 similarity matrix as input, which has an advantage over PCA as it can be applied directly to DESeq2 also estimates the gene-wise dispersion and logarithmic fold changes, a dispersion value 115 is estimated for each gene through a model fit procedure, and differential expression is tested, 116 based on a model using the negative binomial generalized linear distribution . 117 We used the DESeq2 package to normalize the data and identify genes, which are differentially 118 expressed between the two main groups selected by the user. MouSR Interface: 165 The MouSR (https://mousr.qub.ac.uk/) platform is implemented as an open-source application that 166 enables non-computational users to rapidly go from an existing data matrix (multiple formats) to 167 biologically meaningful results in a user-friendly way. At each step of the process, users have the 168 option to modify outputs, through a series of on-the-fly customisable graphics that can all be 169 downloaded at high resolution for future use. The standard pipeline includes initial data QC Accessing the MouSR website presents the user with a general introduction landing page, 195 providing an overview of the application with instructions and exemplar formats that are required 196 for use. The user interface structure has two main sections, namely 1) Data Input and 2) Data 197 analysis, which will be described briefly, followed by an analysis of the CRC mouse exemplar The Data Input section is designed to have flexibility in terms of acceptable file/data formats, to 205 enable users to upload their own data derived using a variety of transcriptional profiling platforms 206 and normalisation procedures. Users are required to have two separate files; a transcriptional data 207 matrix (input 1) and a sample information file (input 2) that will enable data analysis and 208 generation of results ( Figure 1 ). In terms of data types for input 1, MouSR has successfully been tested using human and mouse 210 data derived from a variety of microarray and RNAseq platforms and is adaptable enough to accept 211 data that has been processed using a range of pipelines resulting in either integers (whole numbers, 212 i.e. RNAseq read counts) or decimals (i.e. estimated read counts). However, as DESeq2 requires 213 non-normalised counts as input, for users who select decimal input the differential expression 214 options will not be accessible. Additionally, the MouSR system has been designed to accept the 215 most common file formats; including csv, txt, and also to accept data files with various separators; 216 including comma, semicolon or tabs. The data required for input 2 is a summary of basic 217 information that relates to sample labels and groups. Prior to uploading their data, users must ensure both files are in the recommended format described 219 on the introduction page, which is aligned with a standard data matrix output containing gene ID . Using MouSR, we first perform differential gene expression analysis 308 comparing primary tumour data from four mouse genotypes (AP, APN vs KP, KPN), and plot 309 the resulting differential genes using the heatmap tool (Fig 3B) . for data analyses, therefore features in a testing phase will be indicated as such (i.e. beta version). Data analytical pipelines require specific skill sets, such as data informatics and specific 426 programming, which are not currently in the armamentarium of traditional "wet-lab" scientists. results. However, based on the version of R and shiny that we are using, we preferred to deploy 456 the "heatmaply" package. It is worth noting that although this single purpose tool can provide MouSR Estimating the population abundance of tissue-504 infiltrating immune and stromal cell populations using gene expression The cBio Cancer Genomics Portal: An Open Platform for Exploring 507 Multidimensional Cancer Genomics Data Shiny: Web Application Framework for R Adobe Systems 515 Incorporated, Canonical Ltd.(2021). shinythemes Transpose CSV Tool(convertcsv) Gene Expression Omnibus: NCBI gene expression and 524 hybridization array data repository Personalized Medicine: Recent Progress in Cancer Therapy. Cancers (Basel) Integrative analysis of complex cancer genomics and clinical profiles using the 531 cBioPortal. science Signaling GSVA: gene set variation analysis for microarray and 533 RNA-Seq data Principal component analysis: a review and recent developments. 538 Philosophical transactions of the royalsociety Estimated impact of the COVID-19 pandemic on cancer services and 543 excess 1-year mortality in people with cancer and multimorbidity: near real-time data on cancer care, 544 cancer deaths and a population-based cohort study The challenges of big data biology DEApp: an interactive web interface for differential expression analysis of 547 next generation sequence data Moderated estimation of fold change and dispersion for 551 RNA-seq data with DESeq2 VolcaNoseR -a web app for creating, exploring, labeling and 553 sharing volcano plots PGC-1alpha-responsive genes involved in oxidative 556 phosphorylation are coordinately downregulated in human diabetes The Cancer Genome Atlas Program. 561 The START App: a web-based RNAseq 564 analysis and visualization resource The murine Microenvironment Cell Population counter method to estimate 569 abundance of tissue-infiltrating immune and stromal cell populations in murine samples using gene 570 expression Epithelial NOTCH Signaling Rewires the Tumor 573 Microenvironment of Colorectal Cancer to Drive Poor-Prognosis Subtypes and Metastasis ArrayExpress-E-MTAB-6363 -RNA-seq of intestinal cancer GEMMs GENAVi: a shiny web application for gene expression 581 normalization, analysis and visualization An algorithm for fast preranked gene set enrichment analysis using cumulative 587 statistic calculation plotly for R Clinical Significance of Four Molecular Subtypes of Gastric Cancer Identified by The Cancer 598 Gene set enrichment analysis: a knowledge-based approach for interpreting 600 genome-wide expression profiles TCC-GUI: a Shiny-based application for differential 605 expression analysis of RNA-Seq count data 2020) plyr Multidimensional Scaling: History, theory, and applications Make Interactive Complex Heatmaps in R