A validated generally applicable approach using the systematic assessment of disease modules by GWAS reveals a multi-omic module strongly associated with risk factors in multiple sclerosis 1 A validated generally applicable approach using the systematic assessment of disease modules by GWAS reveals a multi-omic module strongly associated with risk factors in multiple sclerosis Tejaswi V.S. Badam1,2†, Hendrik A. de Weerd1,2†, David Martínez-Enguita2, Tomas Olsson3, Lars Alfredsson3,4,Ingrid Kockum3,Maja Jagodic3, Zelmina Lubovac-Pilav1*, Mika Gustafsson2* 1School of Bioscience, Systems Biology Research Center, University of Skövde, Sweden 2Bioinformatics, Department of Physics, Chemistry and Biology, Linköping university, Linköping, Sweden 3Department of Clinical Neuroscience, Karolinska Institutet, Center for Molecular Medicine, Karolinska University Hospital, SE-171 76, Stockholm, Sweden 4Institute of Environmental Medicine, Karolinska Institutet, Center for Molecular Medicine, Karolinska University Hospital, SE-171 76, Stockholm, Sweden †These authors contributed equally to the work. *These authors share senior authorship. Corresponding author: Mika Gustafsson (mika.gustafsson@liu.se) Running Title : Multi-omic modules in multiple sclerosis Keywords : Benchmark , Multi-omics , Network modules ,Multiple Sclerosis, Risk factors SUMMARY : Our benchmark of multi-omic modules and validated translational systems medicine workflow for dissecting complex diseases resulted in multi-omic module of 220 genes highly enriched for risk factors associated with multiple sclerosis. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 2 ABSTRACT Background: There are few (if any) practical guidelines for predictive and falsifiable multi-omics data integration that systematically integrate existing knowledge. Disease modules are popular concepts for interpreting genome-wide studies in medicine but have so far not been systematically evaluated and may lead to corroborating multi-omic modules. Methods: We assessed eight module identification methods in 57 previously published expression and methylation studies of 19 diseases using GWAS enrichment analysis. Next, we applied the same strategy for multi-omics integration of 19 datasets of multiple sclerosis (MS), and further validated the resulting module using both GWAS and risk-factor associated genes from several independent cohorts. Results: Our benchmark of modules showed that in immune-associated diseases modules inferred from clique-based methods were the most enriched for GWAS-genes. The multi-omics case study using MS revealed the robust identification of a module of 220 genes. Strikingly, most genes of the module was differentially methylated upon the action of one or several environmental risk factors in MS (n = 217, P = 10-47) and were also independently validated for association with five different risk factors of MS, which further stressed the high genetic and epigenetic relevance of the module for MS. Conclusion: We believe our analysis provides a workflow for selecting modules and our benchmark study may help further improvement of disease module methods. Moreover, we also stress that our methodology is generally applicable for combining and assessing the performance of multi-omics approaches for complex diseases. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 3 INTRODUCTION Complex diseases are the result of disruptions of many interconnected multimolecular pathways, reflected in multiple omics layers of regulation of cellular function, rather than perturbations of a single gene or protein[1]. Systems and network medicine aim to translate observed omics differences in patients using networks, in order to personalize medicine[2]. Importantly, genes that are associated with diseases are more likely to interact with each other rather than with non-disease associated genes, forming multi-omics network disease modules[3,4]. Owing to the incompleteness of the underlying multi-omics interactions, the networks are often modeled as effective gene-gene interactions, using for example STRING database[5]. Thus, network modules might be ideal tools for multi-omics analysis. However, the evaluation of performance of different module inference methods remains a poorly understood topic, which creates the need for transparent evaluation of these methods based on objective benchmarks across various diseases and omics. Genomic concordance has been suggested as a multi-omics validation principle[4,6], i.e., modules derived from one omic, such as gene expression or DNA methylation should be enriched for disease- associated single nucleotide polymorphisms (SNPs). The variety of algorithms that have been proposed and applied for identification of disease modules can be categorized into two main groups. On the one hand, there are methods which rely purely on clustering of the genes in relevant disease networks[7]. On the other hand, there are algorithms which make use of disease-associated molecules or genetic loci to reveal disease modules that correlate with disease function, such as the disease module detection (DIAMOnD) algorithm[8], clique-based methods[9],[10] and weighted gene co-expression network analysis (WGCNA)[11]. The data-derived information can either be differentially expressed genes or differentially correlated or co-expressed genes. Methods following the former approach were recently benchmarked by a metric utilizing genomic concordance within the DREAM consortia[12]. However, so far, algorithms from the latter group have not been benchmarked. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 4 In this study we analyzed, assessed, and compared the performance of eight of the most popular methods for disease module analysis using the R package MODifieR[13] on 19 different diseases including 47 expression and ten methylation datasets. We assessed the performance of the methods using genome-wide association (GWAS) enrichment analysis from the summary statistics of all assayed SNPs similarly as in DREAM[12]. The resulting workflow provided a systematic procedure for selecting the best method for each disease and set the stage for method development in the disease module area. Moreover, it allowed the predictive assessment of combining multiple datasets across several omics using GWAS, which we tested in multiple sclerosis (MS), a heterogeneous complex disease. Briefly, we derived multi-omic modules in a stepwise optimization of GWAS enrichment from transcriptomic and methylomic analyses of MS. We further evaluated the identified multi-omic MS module of 220 genes for its enrichment across DNA methylation studies of eight known lifestyle-associated risk factors of MS. Additionally, we validated the identified significant enrichment risk factors in an independent DNA methylation MS study which indeed showed a very strong and significant MS enrichment for both module genes and risk factor associations. In summary, we provide a robust multi-omics strategy that can be used to disentangle networks of affected genes in complex diseases from both genetic and environmental levels. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 5 MATERIALS AND METHODS Benchmark data A total of 47 publicly available datasets for the transcriptomic benchmark and ten publicly available datasets for the methylomic benchmark were used. To avoid bias due to subtypes of diseases and drug treatments, we searched for datasets that have only patient and control samples, and that are available for download from the GEO database. We categorized the datasets into seven distinct disease types based on the disease-trait type associations used in Choobdar et al[12]., i.e. autoimmune, cardiovascular, glycemic, inflammatory, neurodegenerative, and psychiatric and social disorders. A total of 19 complex diseases were used in the transcriptomic benchmark analysis, while six complex diseases were used in the methylation benchmark analysis. The methylation benchmark diseases belong to inflammatory, autoimmune, and glycemic disease types. MS use case data A total of 14 publicly available and one non-publicly available transcriptomic and methylomic MS- related datasets were used in the MS multi-omics integration use case. In general, every dataset in the MODifieR benchmark was also used in the MS use case, with exceptions according to certain criteria. The inclusion of transcriptomic MS datasets followed the criteria: 1) The largest dataset by sample number, per tissue, is shown in the MODifieR benchmark; 2) Replication cohorts are not included in the MS use case. Criteria for inclusion of methylomic MS datasets were the following: 1) The largest dataset by sample number, per tissue or cell type, is included in the MODifieR benchmark; 2) A single dataset for every cell-specific tissue was included in the benchmark; 3) Methylation studies that reported using whole blood as sample tissue were excluded from the MS use case, due to the high heterogeneity of this type of data. For the additional independent validation, we utilized the methylation microarray analysis of 279 blood samples analyzing from Kular et al 25 . For each of these MS patients (nMS= 139) and healthy controls (nHC= 140), we also collected their lifestyle-associated risk factors from questionnaires that (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 6 were part of the Epidemiological Investigation of Multiple Sclerosis (EIMS) study. Those factors were smoking status, prior EBV infection, sunbathing, nightshift work, alcohol consumption, as well as phenotypic features (age, sex, BMI at age of 20). Pre-processing and quality control of risk factor methylation data DNA methylation datasets were downloaded from GEO as raw IDAT files, when available, or matrices of beta values. Pre-processing of the data was performed using the Chip Analysis Methylation Pipeline (ChAMP) R package[14] , version 2.16.2. Default parameters were used for probe and sample filtering. Probes with a detection P-value above 0.01, probes with a fraction of failed (bead count less than 3) samples over 0.05, non-CpG probes, SNP-related probes, multi-hit probes, and probes located on chromosomes X and Y, were removed. Samples with a proportion of failed (NA) probe P-values over 0.1 were also removed from the analysis. Post-filtering imputation of NA values was conducted on the beta matrices, with default parameters (“combine” method, k = 5, probe cutoff = 0.2, sample cutoff = 0.1). Filtered imputed matrices were normalized applying the Beta- Mixture Quantile dilation (BMIQ) normalization method[15]�, including correction of Type-I and Type-II probe effects. Data quality was assessed by producing multi-dimensional scaling (MDS) plots of the top 1,000 most variable positions per sample, density plots for the distribution of beta values, and hierarchical clustering of samples, before and after normalization. Singular value decomposition (SVD) was used to detect the most significant components of variation in the data. Unwanted sources of variation in the normalized data were corrected using ComBat batch effect correction[16]. Module Identification The MODifieR13 R package offers nine different methods for producing disease modules for which we included all but Clique SuM exact as it is highly similar to Clique SuM. The included methods will produce modules based on the provided omics input and background network and do not include prioritization of pathway association. MODifieR methods used for module identification through this study are listed in the Supplementary Table 3. For the methods that require a network, we used the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 7 human PPI network from STRING5 database version 11, consisting of 11,295,036 interactions among 18,746 unique genes/proteins. We filtered the network to have high confidence interactions by using the cutoff > 900 to reduce the number of false positives, resulting in a subset of 631,782 interactions between 12,123 unique genes/proteins. For co-expression methods, the network is computed within the method algorithm from the gene expression matrix. In case of the benchmark analysis, we used a stringent cutoff of score > 900, so that the runs were not computationally intensive. For the MS use case benchmark, we used the network combined score cutoff > 700. The processed matrix for each dataset and their respective phenotypic information were downloaded from GEO. The input object is prepared using the create_input_microarray function from the MODifieR package which is then used for creating the modules. The input function applies linear model using limma for comparison of patient's vs controls to get the differentially methylated or expressed genes. A dynamic cutoff of 5% in the differentially methylated or expressed genes is applied for input seed genes for the methods that require seed genes. Differential methylation analysis of risk factor data Differentially methylated probes (DMPs) were found by fitting a linear model to the data using the limma R package[17]�, version 3.42.2 implemented in the ChAMP function champ.DMP. P-values were adjusted for multiple testing using Benjamini-Hochberg False Discovery Rate (FDR) correction. Differentially methylated genes (DMGs) were obtained and annotated using the org.Hs.eg.db R package�, version 3.10.0. DMG lists were cross-checked against the STRING database version 11 PPI network used for module identification in the MS multi-omics approach (high confidence interactions, combined score > 700). DMGs that were not present in the PPI network were removed. In case of the additional MS validation dataset, a linear mixed effect model with risk factors (age, sex, BMI at age of 20, smoking, alcohol consumption, sun exposure, night shift work, contact with organic solvents) as categorical covariates was implemented to find the differentially methylated genes after the preprocessing step, as described in the preprocessing section of the methods. Since all the patients were EBV positive, we did not include it for linear mixed effect model. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 8 Validation of modules The final modules produced from each single algorithm and the consensus were evaluated using Pascal[18] (Pathway scoring algorithm). Pascal implements a fast and rigorous gene scoring and pathway enrichment pipeline that can be run on a local machine. The SNP values are converted to gene scores by computing pairwise SNP-by-SNP correlations and obtaining Z-scores from their distribution. These obtained gene scores are fused with the pathway enrichment analysis to recompute a chi-square P-value for the given set of module genes. Thus, the obtained chi-square P- value serves as the significance of the module in its enrichment of the disease-associated pathway gene loci. A combined P-value was computed for each of the methods using Fisher’s method[19], diseases, and datasets for ranking the performance of the modules in each criterion. Integration of MS single-omic modules Clique SuM was ranked as the best performing method on average for both transcriptomic and methylomic data, according to the MS GWAS enrichment of the modules calculated by Pascal. Therefore, significant Clique SuM modules (P < 0.05) were selected for further analysis (nine transcriptomic and four methylomic modules). Consensus modules were generated across each omic by applying a module count-based method, where the criteria for gene inclusion in the consensus is its presence in a certain number of single-method modules. To balance the weight of each omic in the multi-omics integration, the top four significant modules per omic were used to create each consensus (Fig. 4a, b). Single-omic Clique SuM consensus were ranked again by GWAS enrichment, and the best performing consensus per omic was selected for integration into the multi-omics module. Enrichment analyses of the MS multi-omics module Disease enrichment analysis of the multi-omics module was performed by Fisher’s exact test, with a significance threshold of P < 0.05. MS-associated genes were obtained from the gene-disease association summary provided by DisGeNET database 6.0[20]�. All genes with a known association (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 9 to the disease “multiple sclerosis” (Unified Medical Language System unique identifier C0026769) were considered MS-associated genes (n = 1,105). Pathway enrichment analysis was carried out using the function enrichKEGG from the clusterProfiler R package[21]�, version 3.14.3. P-values were adjusted for multiple testing using Benjamini-Hochberg FDR correction, with a significance threshold of adj. P < 0.05. Enrichment of the multi-omics module in MS risk-factor-associated genes was performed by Fisher’s exact test, with a significance threshold of P < 0.05. To provide a uniform comparison of MS risk factor-associated genes across datasets, the module was tested for enrichment in the top 1,000 DMGs (with at least P < 0.05) obtained from the differential methylation analysis with ChAMP for each risk factor dataset. Representation of the MS multi-omics module Experimentally validated interactions for the multi-omics module genes were obtained from STRING database version 11 (experimental score > 700) and imported into Cytoscape[22] version 3.7.2. To determine representative functional clusters of module genes, overrepresented Gene Ontology (GO) Biological Process (BP) terms in the module were found using BiNGO[23] version 3.0.4, with Benjamini-Hochberg FDR for multiple testing correction, and a significance threshold of adj. P < 0.05. Then, enriched GO terms with adj. P < 1x10-10 were summarized using REVIGO[24] server tool (medium allowed similarity = 0.7) and categories of interest were selected by uniqueness (>= 80 %), dispensability (>= 50 %), and frequency (<= 10 %) criteria. Further manual assessment was performed to group similar terms with an adequate number of genes in the network. RESULTS A benchmark comparing 337 transcriptionally derived disease modules from 19 different diseases. We compiled a benchmark source of disease modules and summary statistics of GWAS datasets from 19 well-powered case-control studies (Supplementary Table 1), some of which were previously used in the DREAM topological disease module challenge[12]. For these datasets we assessed modules using the same metric as in the recent DREAM study[12], based on the pathway scoring (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 10 algorithm (Pascal)[18]. For each disease we compiled one to five publicly available transcriptomic datasets considering both easily assessable tissues (e.g. blood) and target tissues, thereby covering 47 transcriptomic datasets in total (Fig. 1a). Modules were created using eight different methods from MODifieR[13]. In addition, we also tested if genes detected by several methods, hereafter called consensus module genes, had higher enrichment scores than single-method module genes. Enrichment scores for the non-empty modules (n = 337) from this analysis were summarized for each method and dataset (Fig. 2a). In total, we found significantly GWAS-enriched modules in 17.8% (60/337) of the single-method modules and 25.5% (12/47) of the non-empty consensus modules that combined at least three methods as a criterion. These numbers seemed higher than expected, which might have been a consequence of the same GWAS being used to evaluate multiple transcriptomic datasets of the same disease. Hence, we aggregated scores of the same disease and method as meta P-values (see Methods). Out of the 152 possible disease-method combinations, 18% of the pairs showed a significant GWAS Pascal enrichment, which is more than expected by chance (n = 27, P = 1.0 x 10-8). The most enriched method was Clique SuM, which showed significant enrichment in seven out of 19 diseases (binomial test P = 2.3 x 10-5). Many methods exhibited strong enrichments in coronary artery disease (CAD), type 2 diabetes, multiple sclerosis (MS), rheumatoid arthritis (RA), and the inflammatory bowel diseases(IBD), ulcerative colitis (UC) and Crohn’s disease (CD), while no significant enrichments were found for asthma, hepatitis C, type 1 diabetes, narcolepsy, Parkinson’s disease, or for any psychiatric and social diseases. If we instead ranked methods based on their respective module GWAS enrichment, Clique SuM showed significant association in 34% (16/47) of the modules corresponding to seven different diseases followed by consensus modules identified by two out of three methods. Lastly, DIAMOnD and co- expression-based methods all achieved significant results, although worse than Clique SuM. Next, we tested the impact of network centrality and module size as potential confounding factors of the applied performance metric. We found a significant but very modest correlation for module size (Fig. 2c, Spearman rho = 0.165, P = 2.3 x 10-3), and a non-significant correlation for interactome (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 11 centrality (Fig. 2b, rho = 0.068, P = 0.21). Thus, it is meaningful to compare results with differences in those module properties. In summary, we found that the Clique SuM method resulted in the highest disease enrichment for most diseases, while not producing significant modules for others, such as type 2 diabetes, where co-expression-based methods and DIAMOnD scored best. In general, we observed stronger enrichments for inflammatory diseases and weaker results for psychiatric and social diseases. Considering that the transcriptomic modules showed that Clique SuM was the best performing method and that the cardiovascular and inflammatory diseases were the most enriched within the Clique SuM modules, we wanted to test whether this was true for methylomic data as well. A benchmark comparing 72 methylation-based disease modules from six different diseases using GWAS. Following the same logic of the transcriptomic benchmark, we performed a similar benchmark study for methylation modules. We collected ten datasets from three different disease categories, including six complex diseases, and ran the eight MODifieR methods on them (Fig. 1a). In addition, we constructed consensus modules for each of the datasets. Modules were then tested for GWAS enrichment using Pascal. Inspecting the overall performance, we found nine single-method modules with a significant GWAS enrichment (9/72, 11.8%). Though this might be due to disease and cell type heterogeneity, the enrichment is more than expected by chance (P=9.6x 10-3). Interestingly, the inflammatory diseases such as MS and UC showed a more significant GWAS enrichment Considering that the evaluation of module performance by GWAS enrichment may be biased due to differences in module sizes and interactome centrality, we again assessed the correlation between these values. We found a significant correlation between GWAS enrichment and module size (Fig. 3c, rho = 0.235, P = 0.046) and a non-significant correlation between GWAS enrichment and interactome centrality (Fig. 3b, rho = 0.190, P = 0.109). We found that 12.5% of the disease-method combinations yielded significant GWAS enrichment, which is more than expected from an independent random selection of modules (Fisher’s exact test P = 0.031, n = 6). The highly enriched disease modules belong to MS, UC and CD. Two out of the six diseases showed significant GWAS (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 12 enrichment by using the Clique SuM modules (P = 0.032). In summary, Clique SuM method resulted in a more significant GWAS enrichment for most diseases also for the methylomic benchmark. Multi-omics approach revealed a module enriched for MS-associated genes. Considering genomic concordance as the guidance principle for the modules that show enrichment for GWAS SNPs, differentially methylated genes and differentially expressed genes, we further wanted to evaluate multiple datasets of one specific disease, i.e., MS. We compiled 11 MS transcriptomic datasets and nine methylation (Supplementary Table 2) comparisons from GEO which satisfy the pre-defined dataset criteria (see Methods). For each dataset we implemented the pipeline for module identification and scoring shown in Fig. 1b. We evaluated each module using MS SNP enrichment analysis and selected the most enriched modules per omic from this metric. This analysis again showed that Clique SuM yielded the far highest average enrichment score (meta P = 3.2 x 10-12) and was significantly enriched (P < 0.05) in 9/11 transcriptomic datasets (Fig. 4a) and 4/9 of the methylation datasets (Fig. 4b). From the significant modules generated by Clique SuM, we choose the top four modules from each of the gene transcription and methylation sets, and prioritized genes detected in modules from multiple datasets in each omic. This analysis showed that the strongest MS SNP enrichment was found for genes in at least three out of four transcriptomic modules (n=1,552; P= 6.0 x 10-7) and two out of four methylomic modules (n=324, P= 1.5x10-6). Next, we used the same principle to combine these two and found that the intersection between the gene transcription and methylation consensus resulted in a module (n = 220 genes, Fig. 4) enriched for MS-associated genes (75/220, P < 2.2 x 10-16, OR = 7.8) and with the highest GWAS enrichment (P = 8.8 x 10-9) which we hereafter referred to as the multi-omics MS module. The multi-omics MS module was enriched in genes associated with major MS pathways. As we used GWAS enrichment as a selection criterion, the high GWAS enrichment of the final module was partly expected, which led us to analyze its biological functions and their potential epigenetic associations to MS. First, pathway enrichment analysis showed that the multi-omics module genes (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 13 are significantly involved in several inter-linked immune-related pathways, most of which have been previously associated to MS, including the T cell receptor[25] (adjusted P = 3.6 x 10-47), PI3K/Akt[26] (P = 4.6 x 10-35), ErbB[27] (P = 7.7 x 10-32), Fc epsilon RI[28] (P = 8.3 x 10-30), chemokine[29,30] (P = 2.6 x 10-28), MAPK[31,32] (P = 2.0 x 10-25), and B cell receptor[32] (P = 3.9 x 10-19) signaling pathways; Th17 (P = 9.6 x 10-29), and Th1 and Th2 (P = 6.9 x 10-19) cell differentiation[33]; natural killer cell mediated cytotoxicity (P = 1.6 x 10-27); and leukocyte transendothelial migration (P = 3.9 x 10-20), which indeed supports their relevance in MS. Interestingly, the module was also highly enriched in morphogenetic and neurogenetic signaling pathways, such as the neurotrophin (adjusted P = 1.3 x 10-36), Ras (P = 1.4 x 10-36), Rap1 (P = 2.2 x 10-35), vascular endothelial growth factor (VEGF, P = 1.7 x 10-27), FoxO (P = 3.6 x 10-27), and mTOR (P = 4.1 x 10-14) signaling pathways; and in growth hormone synthesis, secretion and action (P = 6.6 x 10-31). The multi-omics MS module was enriched in genes associated with five known environmental MS risk factors validated in an independent cohort. Second, from a literature study[34,35] we found nine environmental MS risk factors of varying evidence for which we could identify methylation studies in healthy controls. For each of these risk factors we derived the top 1000 differentially methylated genes (DMGs) and tested their enrichment with the module. Intriguingly, the module was significantly enriched for genes associated with five risk factors (Fig. 5b), which included the top associated risk factors, i.e., Epstein-Barr virus (EBV) infection (Fisher exact test P = 1.5 x 10-3, OR = 2.1) and smoking (P = 1.2 x 10-4, OR = 2.3), as well as low sun exposure (P = 1.2 x 10-4, OR = 2.3), high BMI (P = 0.023, OR = 1.7) and alcohol consumption (P = 2.9 x 10-4, OR = 2.2). Then, we asked whether these putative gene-risk factor associations could be validated using an independent omics dataset with paired risk factor associations. For this purpose, we utilized methylation arrays of peripheral blood from 139 MS patients and 140 controls, which have been described previously[36]. In this analysis we also considered risk factor associations for each individual including age, sex, BMI at age of 20, smoking, alcohol consumption, sun exposure, night shift work, contact with organic solvents. This enabled analysis of DMGs for the MS and risk factor status as covariates in linear mixed effect (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 14 analysis. Indeed, the module genes were highly significantly enriched for MS (n = 217; permutation test P = 1.2 x 10-47), but also for all the tested risk factors (EBV was not included, Methods) and non- significantly associated to age and sex having 104-135 of the genes in each factor (3.9x10-8 < P < 0.013; Fig 5b). Combining all these results we found 90 of the 220 module genes to be associated with a risk factors from both the risk factor studies, 25 genes were associated with two risk factors, and seven genes were associated with three risk factors (CSK, PRKCA, PRKCZ, RUNX1, RUNX3, STAT5A, and SYNJ2) (Fig. 5c). These associations suggest that the multi-omics module is capturing a key disease network with both genetically and epigenetically driven alterations, thereby providing the possibility to use it to identify potential novel biomarkers or therapeutic targets for MS.� (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 15 DISCUSSION The analysis of case control data in the context of networks has gained increased interest to detect consistent robust gene signatures of individual diseases. The application of disease modules might vary for different researchers, but here we systematically aimed at the detection of disease genes supported by genetic association. For this purpose, our study of the transcriptome and methylome profiles of 19 diseases showed significant GWAS enrichments for several inflammatory and heart diseases, while psychiatric disorders showed no enrichments and might not be suitable for GWAS validation of modules, potentially due to differences in affected tissue types and sampling points. However, analysis of the significant results showed that methods based of differentially expressed cliques in the protein-protein interaction network demonstrated the strongest enrichments (highest scoring for Clique SuM), while those based primarily on correlations, like WGCNA, showed weak enrichments. A potential reason for this could be that GWAS has shown to be mostly associated to the central genes of the protein-protein interaction (PPI) network, but our analysis demonstrated that the correlation between GWAS enrichment and centrality was non-significant. We also tested whether there was an improvement using consensus approaches that counted the frequency of the result of multiple methods but found this not to increase performance. Moreover, we tested the same strategy on a set of inflammatory, glycemic, and autoimmune methylation datasets and found similar results. We would like to emphasize that, rather than scoring a single best working method, our result is a pipeline for evaluating modules using independent high-throughput enrichments. The work on transcription and methylation datasets suggested that MS is a disease highly enriched for GWAS, and we therefore tested if increased enrichments could be derived by their integration. We found 20 publicly available datasets and run assessment for both omics independently, which again showed Clique SuM to score highest. We then tested if improved results could be obtained using modules from multiple datasets of these two omics using consensus modules from Clique SuM. This resulted in a module of 220 genes highly enriched for GWAS (P = 8.8 x 10-9). The multi- omic module was highly enriched in immune-associated pathways, such as T cell and B cell receptor (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 16 signaling, Th1/Th2 differentiation, or leukocyte transendothelial migration. These results conform with the current hypothesis that MS is mediated by an autoreactive response of CD4+ T cells against myelin surrounding neuronal axons, preceded by their migration across the blood-brain barrier (BBB)[37]. This autoproliferation of brain-targeting Th1 cells has been shown to be driven by memory B cells, in a process mediated by HLA-DR15[38]. In addition, another enriched pathway was VEGF signaling. MS patients present high serum VEGF levels, which is related to pro-inflammatory functions and can alter the permeability of the BBB[39]. As GWAS was used for method prioritization we asked if modules instead could be validated using epigenetics and lifestyle risk factor genes that we identified to associate with MS. With this aim, we compiled a set of publicly available data from omics studies of these risk factors in healthy individuals. This analysis demonstrated that five out of eight risk factors were enriched in our module. In order to validate the use of an environmental assessment using public domain risk factor association we found an independent methylome study of MS comprising environmental data for each MS and healthy individual. This analysis showed a remarkable enrichment of the 220 module genes by 217 to differentially methylated genes for MS (P = 1.2 x 10-47), and a majority to be associated with the tested risk factors. In contrast to previously known community challenges, in our study we not only used the topological property of the network, but we also combined the methods to use an omics-based input to uncover the disease modules that might be dysregulated at each omics level, contributing to the diverse causative mechanisms behind complex diseases. Although using the PPI network as background may lead to certain knowledge bias, this kind of benchmark allowed us to look at the relevant risk factors. In our assessment of the disease modules, methods such as Clique SuM and DIAMOnD did perform better than the community-based consensus predictions. In summary, our study provides a practical integrative workflow that enables system-level analysis of heterogeneous diseases, in terms of multi-omics disease modules, as well as the validation of these by using both disease-specific GWAS and risk factors enrichment. We believe that this analysis (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 17 validates our integrated use datasets and suggest a pipeline that readily could be tested in at least in other autoimmune and cardiovascular diseases. Lastly, our study did not aim to optimize hyper- parameters for individual disease modules, and instead used default values when possible, and to the methods from the MODifieR R package implementation of the methods[13]. However, this might be an important task for specific disease and our code and processed datasets are available at GitLab (https://gitlab.com/Gustafsson-lab/modifier-benchmark). In future work, this approach can be expanded to include diverse and context-specific networks to determine whether our multi-omics modules are able to capture various other levels of granularity. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 18 DECLARATIONS ETHICS APPROVAL AND CONSENT TO PARTICIPATE Not applicable AVAILABILITY OF DATA AND MATERIALS The data used for transcriptomic benchmark and methylation benchmark are downloaded from GEO. The disease specific GWAS files are downloaded from the latest Pascal version. The processed Data for analysis is available at https://gitlab.com/Gustafsson-lab/modifier-benchmark.The risk factor (EIMS) data will be made available on request. The R-package MODifieR is available on the GitLab: https://gitlab.com/Gustafsson-lab/MODifieR; the code used for benchmark analysis and risk factor analysis is available on GitLab: https://gitlab.com/Gustafsson-lab/modifier-benchmark ; the latest Pascal version: https://www2.unil.ch/cbg/index.php?title=Pascal. COMPETING INTERESTS The authors declare no competing interests. FUNDING This work was supported by the Swedish Research Council (grant 2015-03807(M.G.), grant 2018- 02638(M.J.)), the Swedish foundation for strategic research (grant SB16-0095(M.G.)), the Center for Industrial IT (CENIIT)(M.G.), European Union Horizon 2020/European Research Council Consolidator grant (Epi4MS, grant 818170(M.J.)), Knut and Alice Wallenberg Foundation (grant 2019.0089(M.J.)) and the Knowledge Foundation (grant 20170298 (Z.L.)). Computational resources were granted by Swedish National Infrastructure for Computing (SNIC; SNIC 2020/5-177, LiU-2018-12 and LiU-2019- 25). AUTHOR CONTRIBUTIONS T.V.S.B. compiled the necessary data for the benchmark analysis. H.A.W. performed the transcriptomic benchmark analysis. T.V.S.B. performed the methylation benchmark analysis. D.M.E. and H.A.W. performed the MS use case analysis. D.M.E performed the risk factor analysis. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 19 M.J.,I.K.,T.O., and L.A., provided the raw data and collected the associated risk factor data for the independent methylation dataset. T.V.S.B performed the independent validation dataset analysis. T.V.S.B. and D.M.E. collectively made the plots and figures for the manuscript. M.G. and Z.L. designed the study. T.V.S.B. and D.M.E. prepared the manuscript. All authors discussed the results and commented on the manuscript at all stages. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 20 REFERENCES 1. Naylor S, Chen JY. NIH Public Access. Natl Institutes Heal. 2011;7:275–89. 2. Santiago JA, Bottero V, Potashkin JA. Dissecting the Molecular Mechanisms of Neurodegenerative Diseases through Network Biology. Front Aging Neurosci [Internet]. 2017;9:1–13. Available from: http://journal.frontiersin.org/article/10.3389/fnagi.2017.00166/full 3. Barabási AL, Gulbahce N, Loscalzo J. Network medicine: A network-based approach to human disease. Nat Rev Genet [Internet]. Nature Publishing Group; 2011;12:56–68. Available from: http://dx.doi.org/10.1038/nrg2918 4. Gustafsson M, Nestor CE, Zhang H, Barabási A-L, Baranzini S, Brunak S, et al. Modules, networks and systems medicine for understanding disease and aiding diagnosis. Genome Med [Internet]. 2014;6:82. Available from: http://genomemedicine.biomedcentral.com/articles/10.1186/s13073- 014-0082-6 5. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-cepas J, et al. STRING v11[: protein – protein association networks with increased coverage , supporting functional discovery in genome- wide experimental datasets. Oxford University Press; 2019;47:607–13. 6. Lamparter D, Lin J, Kutalik Z, Choobdar S, Hescott B, Tomasoni M, et al. Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases. SSRN Electron J. 2018;1– 63. 7. Schadt EE. Molecular networks as sensors and drivers of common human diseases. Nature [Internet]. 2009;461:218–23. Available from: http://www.nature.com/doifinder/10.1038/nature08454 8. Ghiassian SD, Menche J, Barabási AL. A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome. Rzhetsky A, editor. PLoS Comput Biol [Internet]. 2015;11:e1004120. Available from: (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 21 https://dx.plos.org/10.1371/journal.pcbi.1004120 9. Hellberg S, Eklund D, Gawel DR, Köpsén M, Zhang H, Nestor CE, et al. Dynamic Response Genes in CD4+ T Cells Reveal a Network of Interactive Proteins that Classifies Disease Activity in Multiple Sclerosis. Cell Rep. 2016;16:2928–39. 10. Wang H, Rogers G, Benson M, Jarvelin M-R, Chavali S, Ramasamy A, et al. Highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms. Genome Biol. 2012;13:R46. 11. Langfelder P, Horvath S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9. 12. Choobdar S, Ahsen ME, Crawford J, Tomasoni M, Fang T, Lamparter D, et al. Assessment of network module identification across complex diseases. Nat Methods. 2019;16:843–52. 13. de Weerd HA, Badam TVS, Martínez-Enguita D, Åkesson J, Muthas D, Gustafsson M, et al. MODifieR: an Ensemble R Package for Inference of Disease Modules from Transcriptomics Networks. Bioinformatics. 2020;1–2. 14. Tian Y, Morris TJ, Webster AP, Yang Z, Beck S, Feber A, et al. Genome analysis ChAMP[: updated methylation analysis pipeline for Illumina BeadChips. 2017;33:3982–4. 15. Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-cabrero D, et al. Gene expression A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. 2013;29:189–96. 16. Johnson WE, Li C. Adjusting batch effects in microarray expression data using empirical Bayes methods. 2007;118–27. 17. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. 2015;43. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 22 18. Lamparter D, Marbach D, Rueedi R, Kutalik Z, Bergmann S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput Biol. 2016;12:1–20. 19. Mosteller, F. and Fisher R. A. Questions and Answers # 14 Author ( s ): Frederick Mosteller and R . A . Fisher Published by[: Taylor & Francis , Ltd . on behalf of the American Statistical Association Stable URL[: http://www.jstor.org/stable/2681650 All use subject to http://about.jsto. 1948;2:30–1. Available from: http://www.jstor.org/stable/2681650 20. Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, et al. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020;48:D845–55. 21. Yu G, Wang LG, Han Y, He QY. ClusterProfiler: An R package for comparing biological themes among gene clusters. Omi A J Integr Biol. 2012;16:284–7. 22. Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S. Baliga, Jonathan T. Wang, Daniel Ramage, Nada Amin , Benno Schwikowski, and Trey Ideker. Cytoscape: A Software Environment for Integrated Models. Genome Res [Internet]. 1971;13:426. Available from: http://ci.nii.ac.jp/naid/110001910481/ 23. Maere S, Heymans K, Kuiper M. Systems biology BiNGO[: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks. 2005;21:3448–9. 24. Supek F, Bošnjak M, Škunca N, Šmuc T. Revigo summarizes and visualizes long lists of gene ontology terms. PLoS One. 2011;6. 25. Carbone F, De Rosa V, Carrieri PB, Montella S, Bruzzese D, Porcellini A, et al. Regulatory T cell proliferative potential is impaired in human autoimmune disease. Nat Med. 2014;20:69–74. 26. Mammana S, Bramanti P, Mazzon E, Cavalli E, Basile MS, Fagone P, et al. Preclinical evaluation of the PI3K/Akt/mTOR pathway in animal models of multiple sclerosis. Oncotarget. 2018;9:8263–77. 27. Holley JE, Gveric D, Newcombe J, Cuzner ML, Gutowski NJ. Astrocyte characterization in the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 23 multiple sclerosis glial scar. Neuropathol Appl Neurobiol. 2003;29:434–44. 28. Pedotti R, DeVoss JJ, Youssef S, Mitchell D, Wedemeyer J, Madanat R, et al. Multiple elements of the allergic arm of the immune response modulate autoimmune demyelination. Proc Natl Acad Sci U S A. 2003;100:1867–72. 29. Cui LY, Chu SF, Chen NH. The role of chemokines and chemokine receptors in multiple sclerosis. Int Immunopharmacol [Internet]. Elsevier; 2020;83:106314. Available from: https://doi.org/10.1016/j.intimp.2020.106314 30. Krumbholz M, Theil D, Cepok S, Hemmer B, Kivisäkk P, Ransohoff RM, et al. Chemokines in multiple sclerosis: CXCL12 and CXCL13 up-regulation is differentially linked to CNS immune cell recruitment. Brain. 2006;129:200–11. 31. Krementsov DN, Thornton TM, Teuscher C, Rincon M. The Emerging Role of p38 Mitogen- Activated Protein Kinase in Multiple Sclerosis and Its Models. Mol Cell Biol. 2013;33:3728–34. 32. Kotelnikova E, Kiani NA, Messinis D, Pertsovskaya I, Pliaka V, Bernardo-Faura M, et al. MAPK pathway and B cells overactivation in multiple sclerosis revealed by phosphoproteomics and genomic analysis. Proc Natl Acad Sci U S A. 2019;116:9671–6. 33. Kunkl M, Frascolla S, Amormino C, Volpe E, Tuosto L. T Helper Cells: The Modulators of Inflammation in Multiple Sclerosis. Cells. 2020;9:482. 34. Waubant E, Lucas R, Mowry E, Graves J, Olsson T, Alfredsson L, et al. Environmental and genetic risk factors for MS: an integrated review. Ann Clin Transl Neurol. 2019;6:1905–22. 35. Olsson T, Barcellos LF, Alfredsson L. Interactions between genetic, lifestyle and environmental risk factors for multiple sclerosis. Nat Rev Neurol. Nature Publishing Group; 2016;13:26–36. 36. Kular L, Liu Y, Ruhrmann S, Zheleznyakova G, Marabita F, Gomez-Cabrero D, et al. DNA methylation as a mediator of HLA-DRB1 15:01 and a protective variant in multiple sclerosis. Nat (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 24 Commun. 2018;9. 37. Compston A, Coles A. Multiple sclerosis. Lancet [Internet]. Elsevier Ltd; 2008;372:1502–17. Available from: http://dx.doi.org/10.1016/S0140-6736(08)61620-7 38. Jelcic I, Al Nimer F, Wang J, Lentsch V, Planas R, Jelcic I, et al. Memory B Cells Activate Brain- Homing, Autoreactive CD4+ T Cells in Multiple Sclerosis. Cell. 2018;175:85-100.e23. 39. Lange C, Storkebaum E, De Almodóvar CR, Dewerchin M, Carmeliet P. Vascular endothelial growth factor: A neurovascular target in neurological diseases. Nat Rev Neurol [Internet]. Nature Publishing Group; 2016;12:439–54. Available from: http://dx.doi.org/10.1038/nrneurol.2016.88 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 Figures : Figure 1. Overview of the benchmark assessment of disease modules and the integration workflow for MS. (a) Transcriptomic and methylomic datasets from 19 different diseases were used as inputs for eight MODifieR module identification methods. The resulting single-omic disease modules (n = (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 26 456) were independently assessed by GWAS enrichment analysis of the same disease using Pascal module scoring. MODifieR methods were evaluated by the combined enrichment score of their respective disease modules. (b) Multi-omic integrative workflow for multiple sclerosis (MS)- associated modules. Data from 20 case-control comparisons were used as input for module detection with MODifieR methods. Clique SuM modules presented the highest GWAS enrichment score and were therefore used to generate single-omic consensus modules. The intersection of the best transcriptomic and methylomic consensus modules resulted in an MS multi-omic module (n = 220 genes) with the highest GWAS enrichment, which was independently found to be enriched for genes associated with five known lifestyle MS risk factors using public omics data from healthy individuals. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 Figure 2. Genomic concordance of MODifieR modules on transcriptomic datasets. (a) Heatmap of PASCAL p-values for eight single-method and eight consensus MODifieR modules, identified for 47 publicly available transcriptomic datasets. Module performance P-values are shown in a white to blue scale, where any shade of blue represents a significant module ( < 0.05; the darker, the more significant), white represents a non-significant module, and grey represents a module of size zero. Datasets are classified into six disease types: cardiovascular (red), glycemic (golden), inflammatory (green), neurodegenerative (fuchsia), psychiatric and social (pink), autoimmune (dark purple), and others (light purple); and two cell types: blood (maroon), and others (light yellow). Datasets are ranked by meta P-values using Fisher’s method of the single-method module P-values across and within their disease types (dataset score, bottom boxplot). MODifieR methods are organized by algorithm type: seed-based (green), co-expression-based (yellow), and clique-based (red), plus the consensus modules (blue). Single-methods and consensus were scored by meta P-values across (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 28 datasets (method score, right boxplot). Consensus x/8 indicates that the module genes are found in at least x methods out of eight. (b) Scatter plot showing Spearman correlation between module score and betweenness centrality. Modules are represented with a different shape depending on their method and colored based on the disease type. (c) Scatter plot showing Spearman correlation between module score and module size. Modules are represented with a different shape depending on their method and colored based on the disease type. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 Figure 3. Genomic concordance of MODifieR modules on methylomic datasets. (a) Heatmap of Pascal p-values for eight single-method and eight consensus MODifieR modules, identified for ten publicly available methylomic datasets. Module performance P-values are shown in a white to blue scale, where any shade of blue represents a significant module (P < 0.05; the darker, the more significant), white represents a non-significant module, and grey represents a module of size zero. Datasets are classified into two disease types: glycemic (golden), and inflammatory (green); and two cell types: blood (maroon), and others (light yellow). Datasets are ranked by Fisher’s combined P of the single-method module P-values across and within their disease types (dataset score, bottom boxplot). MODifieR methods are organized by algorithm type: seed-based (green), co-expression- based (yellow), and clique-based (red), plus the consensus modules (blue). Single-methods and consensus are scored by meta P-values across datasets (method score, right boxplot). Consensus x/8 indicates that the module genes are found in at least x methods out of eight. (b) Scatter plot (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 30 showing Spearman correlation between module score and betweenness centrality. Modules are represented with a different shape depending on their method and colored based on the disease type. (c) Scatter plot showing Spearman correlation between module score and module size. Modules are represented with a different shape depending on their method and colored based on the disease type. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 31 Figure 4. Genomic concordance of MODifieR modules on MS use case data. (a) Heatmap of PASCAL p-values for eight single-method MODifieR modules, identified for ten MS-related transcriptomic datasets. Module performance P-values are shown in a white to blue scale, where any shade of blue represents a significant module (P < 0.05), white represents a non-significant module, and grey represents a module of size zero. Datasets are classified into the reported MS type: MS (blue), RRMS (red), PPMS (green), SPMS (orange), and CIS (yellow); and four cell types: whole blood (maroon), PBMCs (light brown), white matter (light yellow), and CD4+ T cells (purple). Datasets are meta P- values of the single-method enrichments (dataset score, bottom boxplot). MODifieR methods are organized by algorithm type: seed-based (green), co-expression-based (yellow), and clique-based (red). Single methods are scored by P of the significant modules across datasets (method score, right boxplot). (b) Heatmap of PASCAL p-values for four single-method MODifieR modules, identified for nine MS-related transcriptomic datasets. (c-d) Bar plots of Pascal p-values for the MS consensus modules generated with Clique SuM from transcriptomic (a) and methylomic (b) datasets. (e) Union and intersection of the top performing modules, shown as a Venn diagram. Diseas e Type MS RRMS PPMS SPMS CIS Module Performance α = .05 1 10-2 10-3 10-4 ≤10-5Best Worst P Cell Type WB PBMCs WM CD4+ T cells CD14+ Monocytes CD19+ B cells CD8+ T cells a b c d e 0 2 4 6 8 1/4 2/4 3/4 4/4 Transcriptomic Cliq ue SuM consensus modules α -l o g 1 0 P * 0 2 4 6 8 α -l o g 1 0 P 1/4 2/4 3/4 4/4 Methylomic Cliq ue SuM consensus modules * Best transcriptomic consensus Best methylomic consensus IntersectionUnion ngenes 1041332 220 1656 *(P = 4.82 x 10 -8) (P = 3.74 x 10 -8) (P = 1.95 x 10 -8) (P = 8.76 x 10 -9) Diseas e Type Cell Type Mod. Disco v. MCODE Correl. Clique Clique SuM WGCNA MODA Di��CoEx DIAMOnD T1 2 4 6 8 0 α = .05 0 2 4 6 8 α = .05 11.5 -log10P Disease Type Cell Type Mod. Disco v. MCODE Correl. Cliq ue Clique SuM WGCNA MODA Di��CoEx DIAMOnD α = .05 0 2 4 6 8 -log10P T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 M1 M2 M3 M4 M5 M6 M7 M8 M9 NA NA NA NANANA NA NA NA NA NA NA NA NA NA NA α = .05 0 2 4 6 8 -l o g 1 0 P -l og 1 0 P (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 32 Figure 5. Risk factor enrichment and network visualization of the MS multi-omic module. (a) Evidence levels and effect on MS of the risk factor. � (b) Enrichment overlap of multi-omic MS DYNC1H1 JUN MAPK9 MAPK8 PRKCA PRKCE MAPK11 LCP2 RHOA DYNLL1 GRAP2 BCL6 DNM3 DNM1 PRKACB CASP3 BCL2 NRIP1 DNM2 BCL2L11 PRKACA PTEN ATF2 PRKCI BID RAC1 RAC2 RASA1 NRAS SOS1 PIK3CA HRAS CASP8 CDC42 PRKCZ PARD6A MET PLCG1 IRS1 PTK2 PGR KRAS RET HGF PIK3CB GAB1 VAV1 GRB2 ERBB2 HCK PIK3CD CRKL PIK3R2 CARM1 IGF1 PTK2B KDR VEGFA PXN EDN1 CBL BCAR1 APP SH3GL2 IQGAP1 SHC1 BDNF NGF NTRK1 PTPN6 EGFR INS GNB1 GNG2 ARID1A TRIM25 GNAI1 AR PIK3R3 PIK3R1 PTPRJ SP1 INPP5B TNF CTNNB1 NCAM1 CDH1 SPP1 SEC13 CSK TLN1 RAP1B ABL 1SRC ITGB3 PTPN11 EGF IT GB1 ITGAV SYNJ2 CD7 4 HLA-E CLTA CD4 HLA-DPB1 HLA-A PTPN22 HLA-DRA IL 10 MMP9 PIP5K1B CXCR4 CXCL12 ICAM1 LCKHLA-DRB1 AP2M1 AP2B1 FCGR1A AP1M1 MAPK14 VWF IRF7 IRF1 IRF4 IL4 IL 6 IFNG AKT3 A P2 A2 HSP90AA1 CD3D PPP2R1A GSK3B PPP2CA FGG EPS15L1 FGF2 PTPRC CD3G HSP90AB1 EPHA2 F N1 CLTC PIP5K1A VCAM1 FYN ESR1 TGFB1 ITGB2 CD8 6 NR3C1 CD80 CD3E AP2A1 RUNX1 CD28 CD4 4 CEBPB AP2S1 NFKB1 HDAC1 KIT CDK4 CCNA1 UBE2I PCNA CCND1 RELA STAT5A PRKCD PRKCQ ZAP70 RAF1 YWHAB AKT1 CD24 7 RAP1A MAPK1 MAPK3 PTAFR RAB7A MAP2K1 SMAD4 MAP3K5 CREBBP SMAD2 HMGB1 NGFR DAXX AKT2 PPARG TRIM2 4 SMAD3 MYC CTSS SIRT1 CSF2 BRCA1 SPTBN2 TP53 H2 AX SPHK1 EP3 00 JAK1 IRF3 STAT3 STAT1 STAT6 PAK1 HIF1A PLCG2PDGFB JAK2 PDGFRB CCNE1 RUNX3 RB1 EZH2CDK2 Functional Clusters Cell death and apoptosis Morphogenesis and neurogenesis Cell cycle and proliferation Chemotaxis and cell migration Response to hormone stimulus Leukocyte activation and di��erentiation Node Color Legend Low sun exposure Smoking High BMI Alcohol use EBV infection Associated with MS Signif. enriched MS risk factors Risk factor Evidence E��ect EBV infection Smoking Low sun exposure Adolescent obesity High BMI Night shift work Organic solvent exposure Alcohol consumption Oral tobacco +++ +++ ++ ++ ++ ++ + + + � Risk � Risk � Risk � Risk � Risk � Risk � Risk � Risk a c b Module enrichments 1 2 3 4 Risk factor datasets -log10 P α = .05 1 2 3 4 Validation dataset -log10 P α = .05 NA NA NA 7.4 � Risk (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783 33 module genes in the top 1,000 DMGs in risk factor datasets and independent risk factor methylation dataset (see Methods) shown as Fisher exact test P-values (threshold α=0.05). (c) Visualization of the module. Nodes (module genes) are arranged in functional clusters according to their overrepresented GO terms. Genes with a known association to MS are marked with a blue circle. Node colors display the associations to an MS risk factor for which the module is significantly enriched (red, alcohol use; green, high BMI; yellow, smoking; purple, low sun exposure; light blue, EBV infection; grey, no association). Edges were extracted from the STRINGdb v11 human PPI network of experimentally validated interactions (confidence score > 700). SUPPLEMENTARY MATERIALS Supplementary Table 1: All case-control comparisons used in the Transcriptomic and Methylomic benchmarks. Supplementary Table 2: All case-control comparisons used in the MS use case benchmark. Supplementary Table 3: All Methods implemented in the benchmark. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.10.26.351783doi: bioRxiv preprint https://doi.org/10.1101/2020.10.26.351783