key: cord-0204286-6nhwp5ue authors: Kelly, Jack; Berzuini, Carlo; Keavney, Bernard; Tomaszewski, Maciej; Guo, Hui title: Discovery methods for systematic analysis of causal molecular networks in modern omics datasets date: 2022-01-28 journal: nan DOI: nan sha: 4f0e321c206efc8ad2c36587a69cb4a2d95a6883 doc_id: 204286 cord_uid: 6nhwp5ue With the increasing availability and size of multi-omics datasets, investigating the casual relationships between molecular phenotypes has become an important aspect of exploring underlying biology and genetics. This paper aims to introduce and review the available methods for building large-scale causal molecular networks that have been developed in the past decade. Existing methods have their own strengths and limitations so there is no one best approach, and it is instead down to the discretion of the researcher. This review also aims to discuss some of the current limitations to biological interpretation of these networks, and important factors to consider for future studies on molecular networks. Molecular networks are important to understanding biological process beyond the analysis of a single gene or molecule (1) . The operation of molecular phenotypes at all levels is not isolated and interactions make up complicated networks that contain a wealth of information. In an age where data is being produced more than ever, these networks can become increasingly complex. A molecular network contains a set of nodes and edges. Nodes represent information from multiomics, including but not limited to genes, messenger RNAs (mRNAs), proteins, DNA methylation patterns and protein phosphorylation. Edges represent the relationship between the nodes and so can symbolise direct and indirect relationships between molecular phenotypes and transcriptional regulation. One of the primary advantages of molecular networks is in elucidating genetic and biological mechanisms underlying disease. Even in diseases with known causative genes (eg. CFTR mutation causing Cystic fibrosis (2) and mutations in HTT leading to Huntington's disease (3)) these genes act as part of a large network and never in isolation. Dysregulated biological processes and important 'hubs' within them can be identified as disease drivers, which potentially help identify drug targets that impact sets of associated genes rather than important individual genes, though this has yet to be translated to clinically useful therapies (4) . Figure 1A ) rely on using correlation between nodes to infer symmetric associations. However, causal networks aim to differentiate the directed regulatory relationships from just associations. This approach identifies directed (as shown in Figure 1B ) or mixed networks (as shown in Figure 1C ). It is worth noting that directed relationships in a network do not necessarily have a causal interpretation, as they may merely depict temporal orders in the 4 data generating process. Only if the confounders between the nodes have been adjusted for will these relationships have a causal meaning. Identifying causal relations from gene expression data was proposed over 20 years ago (5) and since then a large number of causal inference methods have been developed using omics data. This approach is advantageous in the study of biology as it allows for inferring causality without interventions, especially when randomised controlled trials are infeasible due to high cost and ethical issues (6) . As the technology becomes more accessible, there is an increasing range of omics data that is being collected, which allows for integrative analysis to develop a more complete picture of how different types of omics interact with one another (7) . Causal inference in molecular networks is a growing area of research. However, complex high dimensional causal networks have limited use and their contribution to the literature is heavily restricted as they are often difficult to interpret. There needs 5 to be approaches that allow for identification of biologically important sub-networks and a small number of targets for future research or therapeutic intervention. In this review, we will discuss the current literature using causal discovery methods on molecular networks and challenges that the area is facing. We will also discuss factors that influence interpretation of causal networks, including clustering and visualisation. Previous reviews (8, 9) have focussed on introducing methodologies of building causal networks and given few biological examples, however here we will focus on published methods and their applications specifically to molecular networks and subsequent biological interpretation. Undirected networks have been an important approach for the investigation of biological processes and identification of hub genes in disease. Traditionally, protein-protein interaction networks have been built using a combination of in vivo and in vitro methods to understand interactions, however these approaches have huge time and financial costs, and result in noisy networks with many false positives (10) . Approaches to omics data using in silico methods have been used as an alternative to better understand these undirected associations. Most commonly, co-expression molecular networks are built on the basis of co-expression, usually measured using correlation (11) . Transcriptomics data, using mRNAs to measure gene expression, is being increasingly generated. It has become popular to use specific R software to infer undirected networks from such data. For example, weighted gene co-expression network analysis (WGCNA) (12) is particularly user-friendly as the authors have produced extensive tutorials and guides to increase accessibility to researchers. Gene co-expression networks have successfully been used to identify 6 important gene clusters (modules) and hub genes in many diseases, including cancer (13) and neurodegenerative disease (14) . In many cases, the genes within these modules are investigated to see if any biological pathways are enriched using freely available tools such as the Gene Ontology (GO) (15, 16) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (17) databases. Novel network approaches to hub detection (14) have been developed to understand important genes in disease. Although providing limited mechanistic understanding, undirected networks are important as they are often precursors of the study of causal networks. Applications of different causal methods to omics data is covered in this review. The simplest causal network only involves the causal relationship between a pair of variables, investigating whether a single exposure can cause a single outcome. Causal networks can be made increasingly complex to investigate the relationships between thousands of variables. Here, we consider Mendelian randomisation (MR), Bayesian networks (BN) and the PC algorithm that have been used individually with molecular phenotype data, as shown in Figure 2 . We then focus on ensemble approaches to reduce the limitations of any single method. A summary of the methodologies discussed here are shown in table 1. The true causal graph is shown in (B). The PC algorithm initially begins with an undirected fully connected graph (i) and uses data to create a skeleton graph with undirected edges. In this case, the X1 −X2 edge is removed because X1 is independent of X2 (ii) and the edges between X1 −X4 are removed as the nodes are independent given X3 . The same is true for the X2 −X4 edge (iii). Then v-structures are identified (iv) and final edges oriented (v) (18) . MR uses single-nucleotide polymorphisms (SNPs) as 'instrumental variables' (IVs) to infer the causal effect of an exposure on an outcome. It mimics randomized controlled trials by assuming that SNP genotypes are randomly assigned to individuals within a population. MR has three key assumptions ( Figure 2A ) that must be met to infer causal relationships; a) IVs are associated with the exposure of interest; b) IVs are independent of confounders (both observed and unobserved) between exposure and outcome; c) IVs only affects the outcome through the exposure of interest. Horizontal pleiotropy occurs when the IV influences outcome outside of its effect on the exposure, breaking the assumption that genotype only affects the outcome through the phenotype of interest. Several adaptations of MR have been developed to reduce the impact of horizontal pleiotropy. Popular approaches include MR-Egger (19) (which models pleiotropy assuming that effects of the IV on exposure and outcome are independent), MR-PRESSO (20) (which corrects for IVs with outlier effects) and Causal Analysis Using Summary Effect estimates (CAUSE) (21) (which accounts for correlated and uncorrelated pleiotropic effects). These newer approaches have seen wide adoption in the literature as they reduce these limitations to using MR. MR-PRESSO and MR-Egger are often both applied to data and results compared to reduce the impact of pleiotropy and outliers. These approaches have been used to provide evidence to support the casual effect of estimated glomerular filtration rate, a measure of kidney function, on chronic kidney disease, kidney stone formation, diastolic blood pressure and hypertension (22) . Additionally, they have been used to show the causal effect of blood pressure on renal outcomes commonly affecting patients with hypertension (7). In most cases, MR analysis requires the association between IV-exposure and IV-outcome are from two independent studies (23) . This is known as two-sample MR. There are a limited number of onesample MR methods that deal with IVs, exposures and outcomes coming from a single study (19, 24) . Some expansions to MR have been developed to handle data when two studies have overlapping individuals in common (25) , which in classic MR approaches lead to bias. Zou et al. (26) have developed a more flexible Bayesian MR method that can handle one, two and overlapping samples. Bayesian MR has an advantage in its flexibility of coping with complex data structures, such as overlapping samples, horizontal pleiotropy, study heterogeneity and multiple exposure and outcomes, all in a single model (26) (27) (28) . More Identifying IVs with similar causal estimates may improve identification of causal mechanisms. These advancements in MR methodologies provide researchers with more options to design models that better fit the assumptions of MR. Inferring causality using MR has been increasingly applied (31, 32) , however have been focused on smaller-to-medium scale and applications to large scale omics networks have been limited. Nevertheless, MR has found application being used in combination with other approaches to building molecular networks, which will be discussed shortly. Bayesian networks (BNs) use Bayesian inference to calculate probabilistic graphical models of data. BNs are directed acyclic graphs (DAGs), a graph whose edges are directed and contain no bidirectional edges and there is no subset of nodes that can form a closed loop. The edges of the DAG are determined via conditional independence which is present when two nodes are independent conditioning on all other nodes in the graph. An example of a BN is shown in Figure 2B . There are two types of Bayesian networks, constraint-based and score-based. Constraint-based methods learn an undirected network skeleton using conditional independence testing and then assign the direction of edges between nodes that are not found to be independent. Score-based methods instead aim to optimise a scoring criterion across a search space of DAGs. Bayesian networks were one of the first approaches proposed to investigate gene expression networks (5) . Due to the high computational cost, most studies have been limited to inferring causal relationships within triplets of a gene regulatory network (33) with limited approaches to scaling networks to larger more complete molecular networks. Much of the literature using BN to infer molecular networks has introduced limitations to the size of the networks built. Mäkinen et al. (34) used BNs to investigate coronary artery disease, introducing genetic information as priors by not allowing genes that have no associated SNPs to be parents of genes that have an associated SNP. However, this was only done on a subset of genes rather than a full network. Azad and Alyami (35) Similarly, Bhattacharya and Das (36) used BNs to investigate causal genes in drug pathways for cancer, with a limited set of genes identified using machine learning and known drug target genes. Using a small dataset, they identified gene to gene connections that play a role in imatinib resistance in chronic myeloid leukaemia, including a ACADVL to PDIA5 connection present in non-responder populations and not in those that respond to imatinib. These two proteins have been previously shown to play important roles in cancer drug-resistance (37) . BNs have been used in the past to identify any causal effects of microRNA (miRNA) on gene expression interactions (38) . However, these networks are very limited, with causal edges only from miRNA to gene expression and in many cases failed to identify known gene-gene interactions from experiment-supported databases. Identifying the optimal BN is very difficult, and many approaches have been proposed with the aim to improve this process within transcriptional networks (35) . These improvements have only generally shown to be moderate and computationally intensive for generating large networks. Large amounts of information could be missed if only a subset of data is used to build causal networks which is generally the approach used with BN due to the high computational cost. It is possible to sacrifice accuracy of networks for speed using approximate solutions (39), however this is not guaranteed to make it possible to build networks using data that is as highly dimensional as omics data. To overcome this problem, BNs are being used in combination with approaches including MR to optimise construction of large-scale networks. The PC algorithm (40) (named after its initial authors, Peter Spirtes and Clark Glymour) is used to estimate Bayesian networks, starting with a fully connected undirected graph and recursively deleting edges based on conditional independence properties. This generates a completed partially DAG (CPDAG) which consists of both directed and undirected edges. The steps the PC algorithm takes to build causal networks are shown in Figure 2C . The PC algorithm is fast for high dimensional and sparse problems, which makes it more suited towards uses with molecular network data (41). Zhang et al. (45) used the IDA approach to infer miRNA-mRNA pair interactions, and identified differences in causal effects between different conditions. They have used IDA to infer causality of long non-coding RNA (lncRNA) on mRNA within modules identified using WGNCA to identify lncRNAs in specific biological functions (46) , an approach that has also since been used to investigate pancancer (44) . Despite being faster than alternatives, the PC algorithm is still slow when applied to high dimensional datasets, and so as data is integrated runtime will increase (18) . The PC algorithm has 14 seen limited use on its own in applications to molecular networks. However, it has been used more recently in combination with other approaches to infer causality in biological data. Research is trending towards the use of ensemble approaches to building causal molecular networks, with the aim to reduce the limitations of individual approaches and build more robust networks. MR, in particular, has been combined with other methods to help topologies and speed up construction of causal network by putting constraints on edge directions. Yazdani et al. (47) proposed an approach to identifying causal networks named genome granularity DAG (GDAG). Initially, strong IVs are generated from phenotype SNP data across each chromosome independently. The structure of the undirected network for omics data is identified, and the principle of MR is used to determine the directionality of edges using the strong IVs generated previously. They have used this approach to investigate the network of metabolites (48, 49) . Augmenting Bayesian networks with the principles of MR has become popular for building molecular networks (8) . Wang et al. (50) have tried to address the computational limitations of BNs on largescale transcriptome-wide networks using a tool they have named findr. They used the SNPs that are directly associated with gene expression, known as expression quantitative trait loci (eQTLs). For each gene, the most strongly associated eQTL is selected as the IV in inferring the pairwise causal relationships between all genes in the network. These edges are ranked and assembled into a DAG (51) . This method is much more efficient and outperforms traditional ways of building BNs, though has rarely been practically applied in the literature. Badsha and Fu (52) have developed MRPC, which incorporates the principle of MR into the PC algorithm. The principle of MR is generalised to account for a variety of causal relationships between SNPs and molecular phenotypes. MRPC begins by learning the graph skeleton using the PC algorithm with an online false discovery rate correction and any edges are oriented to point from SNPs to molecular phenotypes. MRPC then looks for v-structures in the network between any 3 nodes and uses the principle of MR to help orient edges. Although MRPC has been shown to be very effective for building molecular networks, there is still room to develop further. Within small to medium networks MRPC performs exceptionally, however for very high dimensional data as is common with multi-omics data, it is still computationally expensive and could be further optimised. A recent paper by Zuber et al. (53) proposed a multivariable MR and Bayesian model averaging (MR-BMA) approach that can include information from many IVs using only summary statistics from genetic association studies. It assumes the proportion of true causal risk factors is sparse when compared with all risk factors, which they demonstrate is usually the case with metabolomics data. Using MR-BMA, they identify high density lipoprotein (HDL) cholesterol as a potential causal risk factor for age-related macular degeneration, supported by previous literature (54) . This approach has also been used to identify Apolipoprotein B as key lipid risk factor for coronary artery disease (55) . All the above methods using the principle of MR require that the three assumptions of MR are satisfied. As multi-omics data is large and complex, using MR to sidestep the problems of confounding and reverse causation is important for causal network inference. Causal Graphical Analysis Using GEnetics (cGAUGE) has also been proposed to construct causal networks by Amar et al. (56) . cGUAGE first identifies conditional independencies in the data that are used to identify IVs for downstream MR, and for the construction of large-scale networks, which is called ExSep. Initially, the skeleton is found using the PC algorithm. Edges between nodes are then oriented. If SNPs are marginally associated with a node X2, but are independent of X2 given another node X1, then this is used as evidence that X1 is causal of X2. cGUAGE does not infer causal effect size, so there is a lot of future potential in integrating ExSep with MR and other approaches to infer the skeleton. Time series data provides the opportunity to investigate molecular networks across a biological process. Generating causal networks is made much more difficult with the problems that inherently come with this data type. Particularly, the time between measurements may be inconsistent or not reflect the rate of change that is being investigated, causal relations can greatly change over time and unmeasured confounding variables may be introduced. As multi-omics data becomes easier to generate, there has been an increased interest in using time-series data to investigate molecular networks (57) . The most common approach to identifying causality in time series molecular data is Granger (63) have been shown to outperform Granger causality and be able to handle large scale networks. Applying these approaches to molecular networks would be an important step in progressing the analysis of time series causal molecular networks. Networks of connected genes can quickly become very complex, which severely limits biological interpretation, even in simple co-expression network (64) . Nevertheless, even when interpreting simple networks it is important to distinguish between association and causality. Inappropriate use of causal language has been a particular problem in biological sciences in the past (65) . Causal molecular networks are often high dimensional. Many studies (35, 36) have identified smaller subsets of genes they are interested in through previous knowledge of pathways or clustering of undirected networks before inferring causality, however this can miss out factors that may be relevant within the causal network but are not within the cluster or not identified by traditional univariate analysis. Alternatively, constructing a causal network and then clustering the nodes would identify any functionally close sets of variables that are likely involved in similar biological processes. Few published papers have carried out clustering within causal molecular network. As the size of these networks grow, clustering will become increasingly important to identify biological processes and important causal molecules within them. Network visualisation is often one of the first steps once networks have been created. One of the advantages of network visualisation is the ability to better communicate the results to readers and colleagues without a full understanding of how results were generated. Appropriate visualisation therefore becomes crucial to reflect the results and get the most from the data. There are many tools that assist in generating networks, including Cytoscape (69) and Gephi (70). These tools generally include a large amount of customisability to visualise the network, particularly in automatically generating layouts. However, visualising and interpreting very large and complex networks can be difficult and often overlooked in the literature. Selecting the best and most appropriate way of displaying networks is very dependent on the type of network that is being visualised, and so requires a large amount of input by someone who understands the data and how it has been analysed. In molecular networks with multi-omics data, layering the different omics types within the visualisation to show how they interact would give a much more structured view than any predesigned layout that is available. Some approaches, including Bayesian networks and MR, provide causal effect sizes which can be visualised within networks by increasing size of edges for larger effect sizes. This allows experts from other biological fields to interpret the interactions of molecular phenotypes and is more likely to lead to future research. There is potential for creating interactive networks where nodes and edges can be included or excluded by adjusting a causal effect size threshold. One of the aims of causal inference is the identification of a small number of targets for therapeutic interventions and so effective visualisation with easy interpretation can be used by other researchers to identify networks of their particular interest. Building causal molecular networks is becoming increasingly important in systems biology. Inferring causality from entirely observational data is much less time consuming and less expensive than traditional randomised trials or intervention experiments. Additionally, the availability of genetic and multi-omics data is massively increasing making casual molecular network inference a very powerful approach. Here, we have reviewed the available approaches to building causal molecular networks. Traditional small-scale MR approaches infer causality between an exposure and outcome. This makes MR a 20 powerful tool when combined with other approaches to build large-scale networks but very limited when used on its own. Bayesian network methods, including the PC algorithm, are based on conditional independence properties and rarely scale to large multi-omics networks well. Additionally, many of the methods developed based on Bayesian networks output a Markov equivalence class that may mean there may mean ambiguity between directed and undirected relationships. Ensemble approaches to inferring causal networks have attracted increasing attention as they bring together the advantages of individual approaches, e.g. augmenting Bayesian networks with the principle of MR, such as MRPC (52) and findr (50) . This has allowed for scaling of networks to a much larger size, however computational cost is still very high. Still, these approaches have not been widely applied in the literature and there is still much to improve. Reducing the impact of unmeasured confounders and horizontal pleiotropy is important in any complex causal inference and is why MR plays an important role in these approaches. These issues are being addressed with modern MR methods such as MR-egger (19) , CAUSE (21) and Bayesian MR, and integrating these approaches into ensemble methods should be a focus in the future. Selecting IVs is also a challenge for large-scale casual networks. Linkage disequilibrium and pleiotropic effects can violate IV assumptions. Selecting strong IVs would potentially reduce data size, thus reducing computation time, and reduce bias. However, there is a trade-off as only including strong IVs that only explain a small proportion of variation in the exposures may reduce the precision of the estimates. Therefore, the future challenge is to effectively identify and select for valid IVs that satisfy assumptions and are optimal for large causal molecular networks, which may prove to be especially difficult as it is not known if strong IVs will exist for every phenotype. Understanding biological functions through molecular networks Cystic fibrosis Huntington's disease Applications of molecular networks in biomedicine Using Bayesian networks to analyze expression data Causal Queries from Observational Data in Biological Systems via Bayesian Networks: An Empirical Study in Small Networks Uncovering genetic mechanisms of hypertension through multi-omic analysis of the kidney Mendelian randomization and causal networks for systematic analysis of omics Review of causal discovery methods based on graphical models Protein-Protein Interaction Detection: Methods and Analysis The Structure of a Gene Co-Expression Network Reveals Biological Functions Underlying eQTLs WGCNA: an R package for weighted correlation network analysis Identification of Important Modules and Biomarkers in Breast Cancer Based on WGCNA Genetic networks in Parkinson's and Alzheimer's disease Research Gene Ontology: tool for the unification of biology David The Gene Ontology Consortium. The Gene Ontology resource: enriching a GOld mine Kyoto Encyclopedia of Genes and Genomes A Fast PC Algorithm for High Dimensional Causal Discovery with Multi-Core PCs Mendelian randomization with invalid instruments: Effect estimation and bias detection through Egger regression Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genome-wide summary statistics Trans-ethnic kidney function association study reveals putative causal genes and effects on kidney-specific disease aetiologies Commentary: two-sample Mendelian randomization: opportunities and challenges Statistical inference in two-sample summarydata Mendelian randomization using robust adjusted profile score. arXiv A correction for sample overlap in genome-wide association studies in a polygenic pleiotropyinformed framework Overlapping-sample Mendelian randomisation with multiple exposures: a Bayesian approach A Bayesian approach to Mendelian randomization with multiple pleiotropic variants Bayesian Mendelian randomization with study heterogeneity and data partitioning for large studies A robust and efficient method for Mendelian randomization with hundreds of genetic variants Causal inference for heritable phenotypic risk factors using heterogeneous genetic instruments Coffee intake, cardiovascular disease and all cause mortality: Observational and mendelian randomization analyses in 95 000-223 000 individuals Meta-analysis and Mendelian randomization: A review Large-scale local causal inference of gene regulatory relationships Integrative Genomics Reveals Novel Molecular Pathways and Gene Networks for Coronary Artery Disease Discovering novel cancer bio-markers in acquired lapatinib resistance using Bayesian methods Fast and robust method for drug response biomarker identification and sample stratification Endoplasmic Reticulum Stress-Activated Transcription Factor ATF6 Requires the Disulfide Isomerase PDIA5 To Modulate Chemoresistance Co-expression network analysis identified six hub genes in association with progression and prognosis in human clear cell renal cell carcinoma (ccRCC) Approximate learning of high dimensional Bayesian network structures via pruning of Candidate Parent Sets Causation, Prediction, and Search. Second Edi Predicting causal effects in large-scale systems from observational data Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information Inferring microRNA-mRNA causal regulatory relationships from expression data Biomarker Categorization in Transcriptomic Meta-Analysis by Concordant Patterns With Application to Pan-Cancer Studies Inferring condition-specific miRNA activity from matched miRNA and mRNA expression data Inferring and analyzing module-specific lncRNA-mRNA causal regulatory networks in human cancer Generating a robust statistical causal structure over 13 cardiovascular disease risk factors using genomics data Identification, analysis, and interpretation of a human serum metabolomics causal network in an observational study Genome analysis and pleiotropy assessment using causal networks with loss of function mutation and metabolomics High-Dimensional Bayesian Network Inference From Systems Genetics Data Using Genetic Node Ordering Controlling false discoveries in Bayesian gene networks with lasso regression p-values. arXiv Learning causal biological networks with the principle of Mendelian randomization Selecting likely causal risk factors from highthroughput experiments using multivariable Mendelian randomization Mendelian randomization implicates high-density lipoprotein cholesterol-associated mechanisms in etiology of age-related macular degeneration High-throughput multivariable Mendelian randomization analysis prioritizes apolipoprotein B as key lipid risk factor for coronary artery disease Graphical analysis for phenome-wide causal discovery in genotyped population-scale biobanks A Boolean network inference from time-series gene expression data using a genetic algorithm Investigating causal relations by econometric models and cross-spectral methods Granger-causal testing for irregularly sampled time series with application to nitrogen signalling in Arabidopsis Learning Causality: Synthesis of Large-Scale Causal Networks from High-Dimensional Time Series Data Causal network reconstruction from time series: From theoretical assumptions to practical estimation Causal Network Inference by Optimal Causation Entropy Detecting and quantifying causal associations in large nonlinear time series datasets Learning from co-expression networks: Possibilities and challenges Misrepresentation and distortion of research in biomedical literature Causal network models of SARS-CoV-2 expression and aging to identify candidates for drug repurposing CaNDis: a web server for investigation of causal relationships between diseases, drugs and drug targets The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks This work was jointly supported by the British Heart Foundation and The Alan Turing Institute (which receives core funding under the EPSRC grant EP/N510129/1) as part of the Cardiovascular Data Science Awards (Round 2, SP/19/10/34813). No ethical approval was needed. The authors declare that they have no conflicts of interest.