key: cord-0045878-pll6ijfj authors: Settino, Marzia; Arbitrio, Mariamena; Scionti, Francesca; Caracciolo, Daniele; Di Martino, Maria Teresa; Tagliaferri, Pierosandro; Tassone, Pierfrancesco; Cannataro, Mario title: MMRF-CoMMpass Data Integration and Analysis for Identifying Prognostic Markers date: 2020-05-22 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50420-5_42 sha: 8f3b649330b8e9ca2529e6763e61831ab807535e doc_id: 45878 cord_uid: pll6ijfj Multiple Myeloma (MM) is the second most frequent haematological malignancy in the world although the related pathogenesis remains unclear. The study of how gene expression profiling (GEP) is correlated with patients’ survival could be important for understanding the initiation and progression of MM. In order to aid researchers in identifying new prognostic RNA biomarkers as targets for functional cell-based studies, the use of appropriate bioinformatic tools for integrative analysis is required. The main contribution of this paper is the development of a set of functionalities, extending TCGAbiolinks package, for downloading and analysing Multiple Myeloma Research Foundation (MMRF) CoMMpass study data available at the NCI’s Genomic Data Commons (GDC) Data Portal. In this context, we present further a workflow based on the use of this new functionalities that allows to i) download data; ii) perform and plot the Array Array Intensity correlation matrix; ii) correlate gene expression and Survival Analysis to obtain a Kaplan–Meier survival plot. Multiple myeloma (MM) is a cancer of plasma cell and it is the second most common blood cancer. Myeloma is a heterogeneous disease with great genetic and epigenetic complexity. Therefore, the identification of patient subgroups defined by molecular profiling and clinical features remains a critical need for a better understanding of disease mechanism, drug response and patient relapse. In this context, the Multiple Myeloma Research Foundation (MMRF-CoMMpass) Study represents the largest genomic data set and the most widely published studies in multiple myeloma. Transcriptomic studies have largely contributed to reveal multiple myeloma features, distinguishing multiple myeloma subgroups with different clinical and biological patterns. Based on the hypothesis that myeloma invasion would induce changes in gene expression profiles, gene expression profile (GEP) studies constitute a reliable prognostic tool [3, 11] . Various studies have identified gene expression signatures capable of predicting event-free survival and overall survival (OS) in multiple myeloma [1, 6] . In order to aid researchers in identifying new prognostic RNA biomarkers as well as targets for functional cell-based studies, the use of appropriate bioinformatic tools for integrative analysis can offer new opportunities. Among these tools a promising approach is the use of TCGABiolinks package [2, 9, 10] . The main contribution of this work is to provide the researchers with a new set of functions extending TCGAbiolinks package that allows to MMRF-CoMMpass database to be investigated. Moreover, a simple workflow for searching, downloading and analyzing RNA-Seq gene level expression dataset from the MMRF-CoMMpass Studies will be described. The same workflow could be in general extended to other MMRF-CoMMpass datasets. Gene expression data from multiple myeloma patients can be retrieved from MMRF-CoMMpass 1 and Gene Expression Omnibus (GEO) 2 . GEO is an international public repository that archives and freely distributes high-throughput gene expression and other functional genomics datasets. The National Cancer Institute (NCI) Genomic Data Commons (GDC) [5] provides the cancer research community with a rich resource for sharing and accessing data across numerous cancer studies and projects for promoting precision medicine in oncology. The NCI Genomic Data Commons data are made available through the GDC Data Portal 3 , a platform for efficiently querying and downloading high quality and complete data. The GDC platform includes data from The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET) and further studies 4 . Recently, many studies are contributing with additional datasets to GDC platform, including the MMRF CoMMpass Study among others [7] . One of the major goals of the GDC is to provide a centralized repository for accessing data from large-scale NCI programs, however it does not make available a comprehensive toolkit for data analyses and interpretation. To fulfil this need, the R/Bioconductor package TCGAbiolinks was developed to allow users to query, download and perform integrative analyses of GDC data [2, 9, 10] . TCGAbiolinks combines methods from computer science and statistics and it includes methods for visualization of results in order to easily perform a complete analysis. The Cancer Genome Atlas (TCGA): The Cancer Genome Atlas (TCGA) contains data on 33 different cancer types from 11,328 patients and it is the world's largest and richest collection of genomic data. TCGA contains molecular data from multiple types of analysis such as DNA sequencing, RNA sequencing, Copy number, Array-based expression and others. In addition to molecular data, TCGA has well catalogued metadata for each sample such as clinical and sample information. Cancer Institute (NCI) Genomic Data Commons (GDC) is a publicly available database that promotes the sharing of genomic and clinical data among researchers and facilitates precision medicine in oncology. At a high level, data in GDC are organized by project (e.g. TCGA, TARGET, MMRF-CoMMpass). Each of these projects contains a variety of molecular data types, including genomics, epigenomics, proteomics, imaging, clinical and others. The MMRF-CoMMpass Study is a collaborative research effort with the goal of mapping the genomic profile of patients with newly diagnosed active multiple myeloma to clinical outcomes to develop a more complete understanding of patient responses to treatments. MMRF-CoMMpass Study identified many genomic alterations that were not previously found in multiple myeloma as well as providing a prognostic stratification of patients leading to advances in cancer care [8] . Recently the MMRF announced new discoveries into defining myeloma subtypes, identifying novel therapeutic targets for drug discovery and more accurately predicting high-risk disease 5 . The NCI announced in 2016 a collaboration with MMRF to incorporate genomic and clinical data about myeloma into the NCI Genomic Data Commons (GDC) platform. TCGAbiolinks is a R/Bioconductor package that combines methods from computer science and statistics to address challenges with data mining and analysis of cancer genomics data stored at GDC Data Portal. More specifically, a guided workflow [10] allows users to query, download, and perform integrative analyses of GDC data. The package provides several methods for analysis (e.g. differential expression analysis, differentially methylated regions, etc.) and methods for visualization (e.g. survival plots, volcano plots and starburst plots, etc.). TCGAbiolinks was initially conceived to interact with TCGA data through the GDC Data Portal but it can be in principle extended to other GDC datasets if the functions to handle their differences in formats and data availability are properly handled [9] . The GDC API Application Programming Interface (API) provides developers with a programmatic access to GDC functionality. TCGAbiolinks consists of several functions but in this work we will describe only the main functions used in the workflow described in the Sect. 3. More specifically: -GDCquery uses GDC API for searching GDC data; -GDCprepare allows to read downloaded data and prepare them into an R object; -GDCquery clinic allows to download all clinical information related to a specified project in GDC; -TCGAanalyze Preprocessing performs an Array Array Intensity correlation (AAIC). It defines a square symmetric matrix of spearman correlation among samples; -TCGAanalyze SurvivalKM performs an univariate Kaplan-Meier (KM) survival analysis (SA) using complete follow up taking one gene a time from a gene list. The SummarizedExperiment [4] object is the default data structure used in TCGAbiolinks for combining genomic data and clinical information. A Sum-marizedExperiment object contains sample information, molecular data and genomic ranges (i.e.gene information). MMRF-CoMMpass presents some differences in formats and data respect to TCGA dataset. For example, the sample ID format in MMRF-CoMMpass is "study-patient-visit-source" (e.g."MMRF-1234-1-BM" means patient 1234, first visit, from bone marrow). Moreover, some fileds in MMRF-CoMMpass SummarizedExperiment are lacking or they are named differently respect to TCGA dataset format. In order to fill this gap and to make MMRF-CoMMpass dataset suitable to be handled by previous functions we introduced the following customized functions: -MMRF prepare adds the sample type information to SummarizedExperiment object from GDCprepare; -MMRF prepare clinical renames the data frame field "submitter id" of clinical information from GDCquery clinic as the field name found in TCGA dataset (i.e. bcr patient barcode); -MMRF prepare SurvivalKM makes the MMRF-CoMMpass sample ID format in Gene Expression matrix (dataGE) from GDCprepare suitable for using in TCGAanalyze SurvivalKM function. The following workflow describes the steps for downloading, processing and analyzing MMRF-CoMMpass RNA-Seq gene expression using TCGABiolinks jointly with the new functions before reported. GDCquery uses GDC API to search the data for a given project and data category as well as other filters. A valid data category for MMRF-CoMMpass project can be found using getProjectSummary function. The results are shown in Table 1 . The following listing illustrates the use of GDCquery for searching gene expression level dataset (HTSeq -FPKM) using the "Trascriptome Profiling" category in the list obtained from getProjectSummary. For simplification purposes just a filtered by barcode subset is downloaded. query . mm . fpkm <-GDCquery ( project = " MMRF -COMMPASS ", data . category = " Transcriptome Profiling ", data . type = " Gene Expression Quantification ", workflow . type =" HTSeq -FPKM ", barcode = c (" MMRF_2473 " ," MMRF_2111 " ," MMRF_2270 ", " MMRF_2238 " ," MMRF_1080 " ," MMRF_2253 ", " MMRF_2119 " ," MMRF_2468 ", " MMRF_1201 ", " MMRF_2821 " ," MMRF_1957 " ," MMRF_1678 ") ) Listing 1.1. GDCquery function for searching gene expression data in MMRF-CoMMpass. The datset is filtered by barcode. The GDCdownload function allows to download and save the data in a local folder to be used in GDCprepare function that transforms the downloaded data into a Summa-rizedExperiment. The clinical data (e.g. tumor stage, days to last follow up, treatments) can be obtained using the GDCquery clinical function specifying as input project "MMRF-COMMPASS"). At this point, MMRF prepare and MMRF prepare clinical functions allow to make the output of the previous functions suitable for being handled by TCGABiolinks functions. Analyse MMRF-COMMPASS Data: Once the data were downloaded and they are prepared, outliers could be discovered through the use of the function TCGAanalyze Preprocessing which performs an Array Array Intensity correlation (AAIC). The plot in Fig. 1 shows an example of heat map of AAIC for MMRF-CoMMpass gene expression data. We used MMRF prepare SurvivalKM for preparing dataGE from GDCprepare to be handled by TCGAanalyze SurvivalKM function. Finally, we performed a Kaplan-Meier univariate survival analysis (KM-SA) using TCGAanalyze SurvivalKM function. The resulting plot allows to correlate visually gene expression and Survival Analysis. Two thresholds are defined for each gene expression according its level of mean expression in cancer samples. In this example we used the threshold of intensity of gene expression to divide the samples in 2 groups (High, Low) . The Fig. 2 shows the correlation between survival and the most high/low expressed gene. The MMRF-CoMMpass has proven itself to be a leader in scientific innovation as well as in data sharing when it decided to incorporate their data into the GDC platform. The use of appropriate bioinformatic tools for integrative analysis of MMRF-CoMMpass data can offer great opportunities. In order to take this chance, the TCGAbiolinks package represents a useful tool for data integration and analysis of cancer data. For example, TCGAbiolinks offers the possibility to integrate gene expression data from external sources (e.g GEO) obtaining a merged result that can be used for further analysis such as differential expression analysis. The main contribution of this paper is the extension of TCGABiolinks package with new functions to handle MMRF-CoMMpass data available at the NCI's Genomic Data Commons (GDC) Data Portal. This will allow to MM researchers to better exploit MMRF-CoMMpass data. As future work we plan to make available these new functions as package through a public repository and to extend them to allow further analysis of MMRF-CoMMpass data. Gene signature combinations improve prognostic stratification of multiple myeloma patients TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data Transcriptomic profiling of the myeloma bone-lining niche reveals BMP signalling inhibition to improve bone disease Orchestrating high-throughput genomic analysis with Bioconductor The NCI genomic data commons as an engine for precision medicine A gene expression signature for high-risk multiple myeloma Data harmonization for a molecularly driven health system A network analysis of multiple myeloma related gene signatures New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEX TCGA Workflow: analyze cancer genomics and epigenomics data using bioconductor packages Gene expression profiles in myeloma: ready for the real world?