key: cord-0324186-wez4mrky authors: Lu, Jonathan; Dumitrascu, Bianca; McDowell, Ian C.; Jo, Brian; Barrera, Alejandro; Hong, Linda K.; Leichter, Sarah M.; Reddy, Timothy E.; Engelhardt, Barbara E. title: Causal Network Inference from Gene Transcriptional Time Series Response to Glucocorticoids date: 2019-03-24 journal: bioRxiv DOI: 10.1101/587170 sha: 6789b16cca59abc5761024dbd49b2e4040ff46f3 doc_id: 324186 cord_uid: wez4mrky Gene regulatory network inference is essential to uncover complex relationships among gene pathways and inform downstream experiments, ultimately paving the way for regulatory network re-engineering. Network inference from transcriptional time series data requires accurate, interpretable, and efficient determination of causal relationships among thousands of genes. Here, we develop Bootstrap Elastic net regression from Time Series (BETS), a statistical framework based on Granger causality for the recovery of a directed gene network from transcriptional time series data. BETS uses elastic net regression and stability selection from bootstrapped samples to infer causal relationships among genes. BETS is highly parallelized, enabling efficient analysis of large transcriptional data sets. We show competitive accuracy on a community benchmark, the DREAM4 100-gene network inference challenge, where BETS is one of the fastest among methods of similar performance but additionally infers whether the causal effects are activating or inhibitory. We apply BETS to transcriptional time series data of 2, 768 differentially-expressed genes from A549 cells exposed to glucocorticoids over a period of 12 hours. We identify a network of 2, 768 genes and 31, 945 directed edges (FDR ≤ 0.2). We validate inferred causal network edges using two external data sources: overexpression experiments on the same glucocorticoid system, and genetic variants associated with inferred edges in primary lung tissue in the Genotype-Tissue Expression (GTEx) v6 project. BETS is freely available as an open source software package at https://github.com/lujonathanh/BETS. Definition 3.1 (Granger causality) . For lag L, a gene g is said to Granger-cause another gene g if using X g t−1 , . . . , X g t−L , the expression value of g at times t − 1 to t − L, improves prediction of X g t , the expression 85 value of g at time t, beyond the prediction using X g t−1 , . . . , X g t−L alone. 86 To test for Granger causality from g to g, we first preprocessed the gene expression time series data 87 (STAR Methods). For every potential effect gene g, we fit all other genes g ∈ ¬g simultaneously (Equation 88 1), echoing ideas from the graphical lasso for undirected network inference [42] . Intuitively, this adapts the 89 idea of Granger causality to conditional Granger causality, where we consider how gene g Granger causes g 90 conditioning on the effects of all other genes. This approach uses the regression: where t ∼ N (0, 1). For BETS, we set L = 2. To test for an edge, if β g ,g = 0, then we say g conditionally 92 Granger-causes g at lag . We build the directed network by including a directed edge to g from every gene 93 g that has been inferred to conditionally Granger-cause g. Robustly building this network is difficult due to the high dimensionality of the problem: the number 95 of genes that could Granger-cause a given g far exceeds the available time points and technical replicates. 96 To address this challenge, BETS regularizes the VAR model parameters using an elastic net penalty (STAR 97 Methods, Figure 1A ). Elastic net regression encourages sparsity and performs automatic variable selection 98 on the genes being tested for causal influence [43] . The elastic net penalty, unlike the lasso penalty [44] , is 99 able to select groups of correlated variables and allows the number of selected variables to be greater than 100 the number of samples. This is particularly important for gene expression assays where gene expression levels 101 are often well-correlated and there are far more genes than samples. In BETS, we fit the same VAR model to a data set in which causal genes have their expression permuted Figure S1 for an overview of network inference methods. 3.2. Leading Performance on DREAM Network Inference Challenge. 114 We evaluated BETS against other directed network inference methods. We used the DREAM4 Network 115 Inference Challenge [41], a community benchmark for directed network inference using gene time series data. 116 This benchmark consisted of five data sets, each with ten time series measurements for 100 genes across 21 117 time points [41] . Evaluation was previously done by looking at the average of the area under the precision 118 recall curve (AUPR) or the area under the receiver operating characteristic (AUROC) over the five data sets 119 [33, 41] . Any method that provides a ranking of possible network edges could be evaluated in this framework. 120 We tested BETS and Enet against 20 other methods on the DREAM challenge [32, 33, 36, 45, 46] . We ran B A Figure 2 : Algorithm performance on the DREAM Community Benchmark. A) AUPR scores from 22 methods, averaged across the five DREAM networks. B) AUROC scores from 17 methods, averaged across the five DREAM networks. Arrows indicate our methods. Stars indicate methods that we ran in-house; results were consistent with reported results. The bars reach one standard deviation from the average as calculated across the five DREAM networks; no bar indicates the standard deviation is not reported. See also Tables S1, S2, S3, S4, and S5. 3.3. Application to gene transcription response to glucocorticoids. 149 To infer the causal relationships in the GC response network, we analyzed RNA-seq data collected from 150 the human adenocarcinoma and lung model cell line, A549. This consisted of two data sets. In an original 151 exposure data set, cells were exposed to the synthetic GC dexamethasone (dex) for 0, 0.5, 1, 2, 3, 4, 5, 6, 152 7, 8, 10, and 12 hours [6] . In an unperturbed data set, the cells were first exposed to dex for 12 hours, after 153 which the media was replaced and dex removed, and then measurements were taken at the same intervals 154 0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, and 12 hours. BETS was fit jointly over the two data sets. In total there 155 were 7 technical replicates (4 from original exposure and 3 from unperturbed ). A single VAR was fit on 70 156 samples: Each of the 7 replicates had 10 samples, because using a lag 2 VAR model turns 12 time points 157 into 10 samples. 158 We applied BETS to the GC-mediated expression responses to infer a causal network ( Figure 3A ). Edges 159 with selection frequency (frequency of appearance among bootstrap networks) at least 0.097 were declared 160 significant (FDR ≤ 0.2; Figure 3B ). The network contained 2, 768 nodes representing distinct genes and 161 31, 945 directed edges (0.4% of possible edges). Of these, 466 genes were causes (had an outward directed 162 edge) and all 2, 768 genes were effects (had an incoming directed edge). The out-degree distribution was 163 heavy-tailed and skewed right ( Figure 3C ) while the in-degree distribution was lighter-tailed and more sym-164 metric ( Figure 3D ). The network's edge in-degree had a heavier left tail and lighter right tail than a normal 165 distribution ( Figure 3E ). This suggests that causal genes were relatively rare (only 1/6th of network genes 166 were causes) and a fifth of those only affected a single gene, whereas genes that were effects tended to have 167 multiple causes. The network was inferred efficiently due to parallelization across genes, taking six days in 168 real time and 292 days in CPU time to perform 5.5 million elastic net model fits. To study the network with respect to the glucocorticoid system, we annotated specific genes as transcrip- To study the interactions among gene classes inferred by our network, we quantified enrichment for edges In-Degree Quantiles Table S6 . Our network identified known biological interactions between genes with immune, metabolic, and TF Table S7 . We asked whether our inferred network edges validated on overexpression versions of the same experimen-198 tal system, in which each of ten TFs was separately overexpressed over the same 12 hours of observations. Specifically, we assessed the concordance between inferred network edges g → g and their coefficient in the 3.5. Validation of network edges through lung trans-eQTLs. 219 We validated our network edges on an expression quantitative trait-loci (eQTL) study. A single nucleotide 220 polymorphism (SNP) S is an eQTL for a gene g if it is associated with g 's expression level within a 221 population. Given a true causal edge g → g, if a SNP S is a local (cis-) eQTL for g , S might also be 222 a distal (trans-) eQTL for g [56] . We used gene expression levels in primary lung tissue (n = 278) from 223 the Genotype Tissue Expression (GTEx) project v6p [10]. We observed an enrichment of low trans-eQTL 224 association p-values from the directed network compared to shuffling the SNP labels ( Figure 6A -B). This 225 suggests our network captures more valid causal effects than expected by chance. 226 We next inspected specific associations and their corresponding edges. We found 341 trans-eQTL pairs in 227 lung samples corresponding with 130 network edges (q-value FDR ≤ 0.2). There are more trans-eQTLs than 228 edges because there are multiple cis-eQTLs for some causal genes g . The 341 trans-eQTLs greatly improved 229 upon the 2 identified in the GTEx v6 trans-eQTL study [57] , demonstrating the utility of transcriptional time 230 series for prioritizing promising associations. The top trans-associations were rs2302178-CLDN1 (q-value 231 FDR ≤ 0.095, extended from the cis-association rs2302178-HS3ST6 ), rs590429-ADAMTS (q-value FDR 232 ≤ 0.11, extended from the cis-association rs2302178-OLR1 ), and rs2072783-CLIP2 (q-value FDR ≤ 0.11, 233 extended from the cis-association rs2302178-GMPR) ( Figure 6C ). 234 We searched for validated associations between immune-related genes, metabolic-related genes, and TFs. Another association was between the TF SNAI2 and gene PTPN6, where we find that the known associ-241 ation between SNP rs56800165 and SNAI2 extends to an association between the same SNP rs56800165 and We described an approach, BETS, to build directed networks using short time series observations of high- Next, we applied BETS to time series RNA-seq data from human A549 cells exposed to glucocorticoids and 266 identified a directed network of 31, 945 edges (FDR ≤ 0.2), capturing the causal relationships among genes 267 after exposure to GCs. In our network, we found enrichment of immune genes and TFs among causal genes. 268 We also found enrichment of causal edges from TFs, immune genes, and metabolic genes. We validated our 269 network first in ten overexpression data sets, replicating positive edges with overexpression effects. Let X g t,r be the expression of gene g at time t for replicate r. Let X g t = [X g t,1 , X g t,2 , . . . , X g t,R ] T be the 323 R × 1 vector of gene expression levels of gene g across R replicates at time t. The rest of the paper does not 324 mention replicates for simplicity, but here we discuss replicates for completeness. Let g be the gene we are testing to be causal for gene g and let refer to the time lag of the causal edge g → g. Let L be the maximum lag. In BETS, L = 2. 327 We model each gene g as where t ∼ N (0, 1). In other words, the expression of each gene g is modelled as a linear function of its 329 and other genes' L previous expression values, under independent Gaussian noise. α g represents the (scalar) 330 effect size of gene g's th previous value, X g t− , on its current value, X g t . β g ,g represents the (scalar) effect 331 size of the th previous value of gene g = g, X g t− , on gene g's current value, X g t . Equation 2 requires that 332 t > for the th previous value, X g t− , to exist. To demonstrate how our model is fit in practice, we reformulate Equation 2 using matrix notation. Each Define X g t , an N × 1 vector, as: We can similarly write X g t− , which is X g t with each entry replaced by its th previous value. Define X g t− , 337 a N × L matrix consisting of the first L previous vectors X g t− , i.e., for ranging in {1, . . . , L}. Let α g be a L × 1 vector of the L lagged coefficients. Next, let us formulate Equation 2 involving the genes g in matrix notation. Let X ¬g t− be a N × L(|G| − 1) Let β .,g be a L(|G| − 1) × 1 vector of the causal coefficients β g ,g where g = g. We then fit the model: where t is a N × 1 vector with each element t,n ∼ N (0, 1). To write in the most compact form, we can write Note that X G t− is a N × L|G| matrix andβ g is a L|G| × 1 vector. Thus the final matrix formulation of 345 Equation 2 is: Elastic net penalty. Because of the large number of predictors as compared to the small number of samples, 347 we use the elastic net penalty, which is a generalization of both ridge and lasso penalties. The elastic net fits 348 the following objective: Here · 1 represents the 1 -norm and · 2 represents the 2 -norm. that the the optimal value selected in some cases was the maximum value of λ = 1. We thus expanded the 353 range to {10 −5 , . . . , 10 6 } to ensure that we were not missing better hyperparameters at larger values. At this 354 point, the optimal λ was found to be 100. Hyperparameter tuning. Hyperparameters were selected using leave-one-out cross-validation (LOOCV). The 356 hyperparameter (or pair of hyperparameters, for elastic net) that minimizes the mean-squared error on the 357 held-out datapoints is selected. More specifically, we first fix a hyperparameter (λ, a). Then, for a given gene 358 g and row index i, extract the i-the row of X g t and X G t− . Refer to this extracted validation set as (X g t ) i 359 (target) and (X g t− ) i (predictors). The remaining data is the training set, (X g t ) −i (target) and (X G t− ) −i 360 (predictors). First, letβ g (λ,a),i be theβ g ELASTIC NET that is fit from the training set. 362β g (λ,a),i = arg min We then compute prediction error on the validation set, (X g t ) i − (X G t− ) iβ g (λ,a),i 2 2 ). We repeat the fit 363β g (λ,a),i and error for every row index i of X g t and for every gene g. The mean held-out cross-validation error 364 for (λ, a) is: The (λ, a) that minimizes the error in Equation 13 is selected. Permuted coefficients. We evaluate the significance of any given edge g → g through permutation. In detail, we remove the time dependency between g and g via permutations of individual gene temporal profiles over 368 time. 369 We first generate a single permuted data setX g t . For each gene, we independently shuffle the temporal 370 profile of each gene g ∈ {1, . . . , |G|} across time ( Figure 1A ). This is done separately for distinct replicates. 371 We wish to model the hypothesis of no causal relations from any gene g ∈ ¬g, upon a given effect gene 372 g. We use the unpermuted values of the effect gene X g t and the permuted values of all other causal genes g ∈ ¬g, asX ¬g t . The effect gene g remains unpermuted, as we do not consider self-regulatory loops. Permutation-based causal coefficientsβ g ,g are then fit as We use these coefficients to perform FDR calibration. Edge FDR. The result of the elastic net VAR model is a complete network whose edges are weighted according 377 to the estimated regression coefficients. For each lag ∈ {1, . . . , L} and effect gene g, we control the edge FDR at ≤ 0.05 by finding the threshold 379 T g such that For each gene pair (g , g), g ∈ ¬g, a directed edge g → g exists if for at least one of the lags ∈ {1, . . . , L}, 381 |β g ,g | > T g . Stability selection. Stability selection is used to ensure the robustness of BETS to small sample size. Stability with replacement from X G t− , the predictors, and X g t , the target (Equation 10). For each bootstrap sample, we infer a network using BETS. Each edge g → g's selection frequency, π g ,g 389 (the frequency of g → g among the bootstrap networks) is computed. (Figure 1 B) . Stability FDR. To determine the appropriate cutoff for the selection frequency of each edge (π g ,g ), we 391 generate a null distribution of selection frequencies using permutations. First, we generate a second permuted 392 data set in which we again independently shuffle the temporal profile of each gene g ∈ {1, . . . , |G|} across 393 time. This is done separately for distinct replicates. We run the selection frequency procedure on this 394 permuted data set to get the null selection frequency of each edge,π g ,g . We control the stability FDR at 0.2 by finding the threshold T b such that Because the maximum lag is 2, each edge g → g has two possible lags and thus two selection frequencies. permuted and the real empirical distributions. So, if |β g ,g 1 | exceeds |β g ,g 2 |, the lag is said to be 1 and the 399 selection frequency π 1 g ,g is used. ulation. There was also an unperturbed data set of 3 replicates in which cells were exposed to dex for 12 424 hours, after which the conditioned media was replaced and dex removed. Gene expression was profiled at 425 {0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12} hours after dex removal. We integrated the original exposure and unperturbed 426 data into a joint data set with 7 replicates. 427 We selected 2, 768 genes for analysis, which had average expression > 2 TPM and were differentially No regularization was included, and ordinary least squares was used to fit the equation. The expression X g t−1 479 of a causal gene g is fit as a single predictor without the other expression. Lag 1, not 2, is used due to the A causal edge g → g is included if I (g , g) exceeds a threshold. MI methods have the advantage of being . A causal edge g → g is included in the network if β g ,g is significantly different from 0 for some . Although complex dynamics are often nonlinear, ODE methods typically assume linearity, as small sample X g t = f (X g t−1:t−L , X A causal edge g → g is included in the network when an importance score for g -typically, the reduction 533 in variance of g from including g as a predictor-exceeds some threshold. One limitation of DT methods is 534 that they only produce a ranking of edges, without specifying the sign of the relationship between the genes 535 [32, 33]. . Each gene's expression has a linear relationship with its parents: X g t = f (pa(X g t )) + t , f ∈ LINEAR. While DBNs have been shown to be effective on smaller data sets [90], they scale poorly due to the super- in the network based on its posterior probability of existence, i.e., the sum of the posterior probabilities of 553 those networks that contain the edge. Each gene's expression has a nonlinear relationship with its parents: 554 X g t = f (pa(X g t )) + t , f ∼ GP. By allowing nonlinear relationships between genes, GPs have proven highly effective. However, like DBN, 555 they perform a search over causal graphs, and therefore suffer from the same scalability issues [39, 40]. Figure 2 . Improvement on DREAM4 100-gene Network Inference from Bootstrap. For each AUROC or AUPR column, the average is the listed value and the standard deviation is listed in parentheses. "Coefficient" denotes the result when ranking edges by their fitted coefficient, as in the original method. "Bootstrap" denotes the results when ranking edges by the frequency by which they appear in the bootstrap networks. Studying and modelling dynamic biological processes using time-569 series gene expression data Bayesian factor Two optimal strategies for active learning of causal models from interventional 605 data Active learning of causal networks with intervention experiments and optimal 607 designs An introduction to Gaussian Bayesian networks DREAM3: network inference using dynamic 611 context likelihood of relatedness and the inferelator Experimental assessment of static and dynamic algorithms for gene regulation 613 inference from time series expression data Large-scale mapping and validation of Escherichia coli transcriptional regulation from 616 a compendium of expression profiles Information-theoretic inference of large transcriptional 618 regulatory networks 620 ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular 621 context TimeDelay-ARACNE: Reverse engineering of gene networks 623 from time-course data by an information theoretic approach Testing for causality Inference of gene regulatory networks and compound mode 626 of action from time course gene expression profiles The Inferelator: 628 an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo Induction of decision trees Classification and regression trees Combining tree-based and dynamical systems for the inference of 633 gene regulatory networks dynGENIE3: dynamical GENIE3 for the inference of gene networks from time series 635 expression data Inferring dynamic genetic networks with low order independencies Using graphical models and 639 genomic expression data to statistically validate models of genetic regulatory networks Fast Bayesian inference for gene regulatory networks using 642 A Bayesian approach to reconstructing 644 genetic regulatory networks with hidden factors An empirical Bayesian method for estimating 646 biological networks from temporal microarray data Learning gene regulatory networks from gene expression measurements using 649 non-parametric molecular kinetics CSI: a nonparametric Bayesian 651 approach to network inference from multiple perturbed time series gene expression data Generating realistic in silico gene networks for 654 performance assessment of reverse engineering methods Sparse inverse covariance estimation with the graphical lasso Regularization and variable selection via the elastic net Regression shrinkage and selection via the lasso How to infer gene networks from expression profiles, revisited, Interface 663 Focus Inferring regulatory networks from expression data using 665 tree-based methods Gene Ontology: tool for the unification of biology Gene set enrichment analysis: A knowledge-671 based approach for interpreting genome-wide expression profiles AnimalTFDB: a comprehensive 676 animal transcription factor database SOCS-1 and SOCS-3 block insulin signaling 678 by ubiquitin-mediated degradation of IRS1 and IRS2 Suppressor of cytokine signaling-3 provides a novel interface in the cross-talk between angiotensin II 682 and insulin signaling systems Suppressor of cytokine signaling (SOCS) IL-4-activated insulin receptor substrate (IRS)-2 tyrosine phosphorylation in monocytes and 685 macrophages via the proteasome A novel pathway for vitamin A signaling mediated by RXR heterodimerization 687 with NGFI-B and NURR1 Orphan receptor 689 TR3 attenuates the p300-induced acetylation of retinoid X receptor-α Causality: Lecture Notes Distant regulatory effects of genetic variation in multiple human tissues PPARγ regulates adipocyte cholesterol metabolism 698 via oxidized LDL receptor 1 Upregulation of OLR1 and IL17A genes 700 and their association with blood glucose and lipid levels in femoropopliteal artery disease Oxidized LDL receptor 1 gene polymorphism in patients with metabolic syndrome, Euro-704 pean LOX-1 boosts immunity C-type lectin-like receptor LOX-1 promotes dendritic cell-mediated class-switched B 708 cell responses Inflammation-induced effector CD4+ T cell 711 interstitial migration is alpha-v integrin dependent Snai2 is a new target to mediate 713 glucocorticoid signaling on breast cancer cell migration The SHP-1 protein tyrosine phosphatase negatively modulates 717 glucose homeostasis Deficient SOCS3 and SHP-1 expression in psoriatic T 720 cells Macrophages of multiple sclerosis patients display deficient SHP-1 723 expression and enhanced inflammatory phenotype KEGG: new perspectives on genomes, 725 pathways, diseases and drugs A continuous tumor-cell line from a human 727 lung carcinoma with properties of type ii alveolar epithelial cells On the dependency of cellular protein levels on mRNA abundance Advantages and limitations of current network inference methods Revealing strengths and 734 weaknesses of methods for gene network inference Utility and limitations of using gene 737 expression data to identify functional associations Stability selection The relationship between Precision-Recall and ROC curves The DREAM4 In-silico Network 743 Challenge: Training data, gold standards, and supplementary information Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable 746 Analysis The BioGRID interaction database: 2017 update The STRING database in 2017: quality-controlled protein-protein association 752 networks, made broadly accessible Context-specific and differential 756 gene co-expression networks via Bayesian biclustering models Matrix eQTL: ultra fast eQTL analysis via large matrix operations Statistical significance for genomewide studies Application of Granger causality to gene regulatory network 763 discovery Causality and pathway search in microarray time series experi-765 ment Characterizing 767 Dynamic Changes in the Human Blood Transcriptional Network Discovering graphical Granger causality using the truncating lasso penalty Prior knowledge driven Granger causality analysis on gene regulatory network 772 discovery OKVAR-Boost: a novel boosting algorithm 774 to infer nonlinear dynamics and interactions in gene regulatory networks Granger causality vs. dynamic Bayesian network inference: a comparative study minet: AR/Bioconductor package for inferring large transcrip-779 tional networks using mutual information Endotoxin tolerance impairs IL-1 781 receptor-associated kinase (IRAK) 4 and TGF-β-activated kinase 1 activation, K63-linked polyubiq-782 uitination and assembly of IRAK1, TNF receptor-associated factor 6, and IκB kinase γ and increases 783 A20 expression Networks of bZIP protein-protein interactions 785 diversified over a billion years of evolution A comprehensive resource of interacting protein regions for refining 788 human transcription factor networks IGFBP3 promotes esophageal cancer growth by suppressing 791 oxidative stress in hypoxic tumor microenvironment Toward 794 an understanding of the protein interaction network of the human liver MDM2 mediates ubiquitination and degradation of 797 activating transcription factor 3 APC/CCdc20 targets E2F1 for degradation in prometaphase Conversion of Bcl-2 from protector to killer by interaction with nuclear orphan receptor Nur77/TR3 Nur77 upregulates HIF-α by inhibiting pVHL-mediated 804 degradation Using an in situ proximity ligation assay to systematically profile endogenous 807 protein-protein interactions in a pathway network Cytoplasmic localization of Tristetraprolin 809 involves 14-3-3-dependent and-independent mechanisms Proteomic analyses reveal distinct chromatin-associated and soluble transcription factor complexes Cyclin E2, a 814 novel G1 cyclin that binds Cdk2 and is aberrantly expressed in human cancers Cyclin E2: a novel CDK2 partner in the late G1 and S phases of the mammalian cell cycle Cyclin E2, a novel human G1 cyclin and activating partner of CDK2 820 and CDK3, is induced by viral oncoproteins Self-assembling protein microarrays