key: cord-0042974-8z6us7v4 authors: Allen, Edward E.; Farrell, John; Harkey, Alexandria F.; John, David J.; Muday, Gloria; Norris, James L.; Wu, Bo title: Time Series Adjustment Enhancement of Hierarchical Modeling of Arabidopsis Thaliana Gene Interactions date: 2020-02-01 journal: Algorithms for Computational Biology DOI: 10.1007/978-3-030-42266-0_11 sha: ac1d1a76a5c769b1b6a526983606a533f90c803d doc_id: 42974 cord_uid: 8z6us7v4 Network models of gene interactions, using time course gene transcript abundance data, are computationally created using a genetic algorithm designed to incorporate hierarchical Bayesian methods with time series adjustments. The posterior probabilities of interaction between pairs of genes are based on likelihoods of directed acyclic graphs. This algorithm is applied to transcript abundance data collected from Arabidopsis thaliana genes. This study extends the underlying statistical and mathematical theory of the Norris-Patton likelihood by including time series adjustments. Cell signaling is accomplished via networks of transcriptional changes that lead to synthesis of distinct sets of proteins, which cause changes in growth, development, or metabolism. Treatments that elevate levels of hormones result in cascades of changes in gene expression driven by activation and synthesis of transcription factors which are required to turn on downstream genes. One approach to model these gene regulatory networks is to collect measurements of changes in abundance of gene transcripts across a time course. The expression of a gene encoding a transcriptional activator or repressor protein may signal to the next gene to either turn on or turn off downstream genes and their encoded proteins. Thus, time course transcriptomic data sets contain important information about how genes drive these changes in biological networks. Yet genome-wide transcript abundance assays examine tens of thousands of genes so identification of patterns or networks within these large data sets is difficult. It is also critical to filter the meaningful transcript changes in these data sets to remove genes whose responses are not above background or that are dissimilar due to biological or technical variation. Yet even though the bioinformatics community has developed statistical methods to filter the data [9] , additional approaches are needed to identify the networks and patterns in these large data sets. An important modern approach to statistical modeling includes Bayesian techniques involving likelihoods and posterior probabilities. Here, we extend our previous work on this problem by incorporating time series adjustments in the computation of Bayesian likelihoods. We apply this method to time course data generated in response to treatments that elevate the levels of the hormone ethylene in Arabidopsis thaliana. We take advantage of a previously published genome-wide transcriptional data set [9] , subjected to rigorous filtering and from which all the genes predicted to encode transcription factors have been identified. The goal is to predict gene regulatory networks that control time-matched developmental changes. The results in this paper are novel for several reasons. First, the methods use the hierarchical nature of the data sets. For example, replicate data are not averaged. Rather, the method constructs a model over all of the data that uses each replicate as a source of information. The assumption is that at each level of the hierarchy there are commonalities in the data and parameters. Thus, the replicate data is not independent. Second, the addition of time series adjustment to improve the independence of the model's residuals gives these techniques stronger statistical foundations. Third, the combination of Bayesian model averaging with a cutting edge genetic algorithm provides rigorous estimates of posterior probabilities for edges. These computational modeling algorithms are derived using rigorous mathematical and statistical techniques and are computationally efficient. The models produced are easily understandable. Many different techniques for modeling non-hierarchical data using gene expression data have been proposed. An excellent recent survey on this subject was given by Emily [4] . There are many techniques for modeling gene and protein networks-with various different properties-available in the literature. Our technique in this paper is a Bayesian regression type method. Variations of Bayesian modeling can be found in [7, 11, 19] . Other methods that use types of regression include [2, 21] which focus on logistic regression techniques, and [22, 23] which use Poisson regression. Other approaches to modeling these types of problems include differential equations [1] and Boolean modeling [14] . This Bayesian likelihood computational algorithm incorporates additional important features from earlier versions. Earlier variations included computing posterior probabilities for a single replicate [11] and for multiple replicates with both hierarchical [18] and independent [17] structures. Over the course of this research, the search procedure has changed from Metropolis Hastings to genetic algorithms. Genetic algorithms' execution times are typically polynomial rather than the doubly exponential execution time, in terms of the numbers of time points and genes, of Metropolis Hastings. This variation also uses a Bayesian version of the Cross generational elitist selection, Heterogeneous recombination, Cataclysmic mutation algorithm (CHC) [6] . Genetic algorithms are motivated by the operators of selection, crossover, and mutation. The CHC variation does not allow the crossover of similar parents. Once the population becomes too homogeneous, then a cataclysmic mutation event regenerates the population from the current most fit parents. The Bayesian CHC (BCHC) implemented in this paper uses a hierarchical statistical construct (the Norris-Patton Likelihood) as the fitness function. The hormone ethylene (ACC) is known to activate root growth in Arabidopsis thaliana [9] . Transcription factors (TFs) are cellular proteins that bind to DNA to turn genes either on (activation) or off (repression). Developmental changes are controlled by these genes. The data set used in this modeling process was the complete set of abundance levels of the twenty-six TFs believed/known to be involved in the activation of the growth of roots at eight time points after treatment with the ethylene precursor ACC [9] . Here, constructing an appropriate network model has potential agricultural applications in that it should lead to more complex understandings of root development. Three network modeling paradigms are generally considered in the literature: cotemporal, next state one step and next state one and two steps. A next state one step model predicts the transcript abundance relationships between genes at time j based on the transcript abundance at time j − 1. In this paper, we will only consider next state one step models; for simplicity, we will refer to next state one step as next state. The time series adjusted (tsa) next state models are an amalgamation of next state modeling with standard time series adjustments [12] . The time series adjustment methodology makes the residuals (i.e., the estimated error terms) more independent. A directed graph G = (V, E) consists of a pair of collections: V a set of vertices (or nodes); and, E a collection of directed edges between pairs of vertices. A cycle is a sequence v 1 , e 1 Directed acyclic graphs (DAGs) do not contain cycles. An example of a DAG is given in Fig. 1 . In this modeling algorithm, DAGs form the mathematical foundation of our computational approach. The vertices of a DAG represent genes and the directed edges are one-way relationships between pairs of vertices. When there is a directed edge from For any DAG D with vertex set V = {v 1 , v 2 , · · · , v n }, the vertices can be topologically sorted. This gives a total order > on V such that if v i is an ancestor Conditional probability gives that for any two events A and B, the probability Similarly, the density function f for two continuous variables y 1 and y 2 is Recursively, using the order < implied by topologically sorting the DAGs on the set of continuous variable Specific for a particular DAG D, let y 1 be the gene that cannot have any parents. Let y 2 be the gene that can have at most parent y 1 . Similarly, let y h be the gene that can have parents from the collection {y 1 , · · · , y h−1 }. Therefore, if we let y i represent the data of child i for all of the r replicates, we have for D f (y 1 , y 2 , . . . , y n |D) =f (y 1 |D)f (y 2 |y 1 , D)f (y 3 |y 1 , y 2 , D) · · · f (y k |y 1 , · · · y n−1 , D) Statistical regression models of response (child) data from predictors (parents) data over time nearly always have correlated residuals over time. This is usually due to the remaining influence of the previous time's response data. In complicated modeling situations (e.g., like ours where we need to obtain closed form likelihoods of DAGs within a hierarchical structure in order to produce posterior probabilities of edges), it is common to derive results as if there were non-correlated residuals, as we have done in previous work. Our previous work has shown utility both for simulated and biological data, but we now rigorously incorporate a time series adjustment into our model. This should result in substantially less correlated residuals and thus more accurate likelihoods for the DAGs. Since these likelihoods are the foundations for the edges' estimated posterior probabilities, these estimates should also be improved. Our time series adjustment is an integer autoregressive adjustment of order 1 in the commonly used family of Markov conditioning. It is a version of Kedem's and Fokianos' autoregressive model [12, page 184 ]. In our setting, this simply adds the child's data at the previous time as an additional regressor for the child's data at the current time. Thus, much of the child's data at the previous time's influence would be regressed out leaving less correlated, closer to independent, residuals from one time to the next. For each h, with 1 ≤ h ≤ n, f (y h |y 1 , y 2 , . . . , y h−1 , D) gives the density of y h given y h 's parent's data for DAG D. Now, let i y c be the data vector of any given child c from the i th replicate. The vector i y c has dimension t, the number utilized time points in the child c data set for a given replicate i. The symbol i x c is the t × k c regressor matrix for i y c . For next state with time series adjustment, t is the number of time points per replicate minus one since at time 1, the child data has no last previous parent data nor last previous child (tsa) data-so, the utilized child data starts at time 2. The value of k c is the number of parents of c plus two since i x c has a separate column for each of its parent's data at the previous time, a column of 1's for the intercept, and a column of the child's data at the previous time (the time series adjustment). A k c dimensional slope vector for child c's regressors is i β c . The common within replicate residual variance of child c is σ 2 c . Assumptions which detail the hierarchical structure include that for a given The proof of Theorem 1 uses the following lemmas whose computation can be found in [16] (a thesis from our research group). We include the proof of Lemma 2 to show how the computation of the likelihood includes the slope parameters i β c of each of the replicates separately. Proof. Using integration, we have Letting |M | denote the determinant of the matrix M , we have the following: Extending Lemma 2 to the product of density functions used in Lemma 1, we have: Note that g, v 0 and σ 2 c are positive free parameters. In our modeling algorithm, we set g = v 0 = σ 2 c = 1. The use of the time series adjusted next state Norris-Patton likelihood, along with a tailor-made genetic algorithm and Bayesian model averaging, allows for the rigorous estimation of posterior probabilities for all gene pair interactions. if indicator < 0 then 18: P (t) cataclysm(P (t)) 19: indicator 50 20: end if 21: Archive Archive ∪ P (t) 22: end while 23: return Archive 24: end procedure Simply put, a genetic algorithm (GA) takes the current population and produces the next generation using the operations of selection, crossover, and mutation [15] . Individuals (i.e., DAGS) are automatically moved to the next generation with preference given to those with the higher likelihoods (the elitist strategy). The first population must be initialized. The genetic algorithm terminates after a specified number of iterations. The TBCHC genetic algorithm is an extension of BCH [13] which was heavily influenced by the CHC [5] . The TBCHC fitness function includes the next state time series adjustment. The TBCHC operators of selection, crossover, mutation, and repair will be discussed in the following paragraphs. The population of each generation consists of a fixed number of DAGs. Each DAG represents gene relationships. The genetic algorithm's aim is to move from the current population of DAGs to a new generation where the overall quality improves (as measured by the Norris-Patton likelihood). The elitist strategy only moves the top 10% of DAGs from the current generation to the next and the balance is filled by crossover. As TBCHC iterates, all distinct DAGs are archived. The final gene interaction model is produced from this archived collection. Generally, the selection operator chooses which members of the current population can potentially contribute children to the next generation. In Fig. 2 selection is accomplished through a random pairing of all parents in the current population (lines 8-10). By assuming prior probabilities for the DAG, the likelihood of a given DAG D is proportional to the D's NPL [3] . Thus, the fitness of a candidate D can be computed using the NPL. The crossover operator (line 12) exchanges genetic information (i.e., directed edges) between two parents producing two new offspring. The edges chosen to be exchanged are chosen randomly. There is one caveat: if the two parents are too similar-determined by the Hamming distance between them then the two selected parent DAGs are not allowed to produce offspring (line 11). In a simple genetic algorithm, all selected parents are allowed to produce offspring. This TBCHC prohibition of mating by similar parents may result in fewer DAGS in the next population than in the current population. Since the modeling process is based on DAGs, if the crossover operator introduces a cycle in the offspring, a repair operator is applied. Selection and crossover are used exclusively in TBCHC until the population becomes too similar. At that point, cataclysmic mutation (line 17) is applied to reset the population by creating a new population of DAGs from the top 10% NPL DAGs. There are no known techniques for assigning the optimum values to the genetic algorithm parameters. However, experience and the literature give general criterion for appropriate values. Still, values are often determined on a case by case basis. The TBCHC algorithm parameters include the following: 20 parallel executions each with 600 generations; the number of initial DAGS is 400; the crossover probability is 0.30; and, the number of parents of any given node is limited to 3. Cataclysmic mutation causes the population of DAGs to be replaced by DAGs generated by crossover and mutation on the top 10% of the population to restore the candidate class to 400. This TBCHC algorithm is implemented in python 3.0 using the NetworkX [8] and dispy packages [20] . It is important to realize that each directed edge in the model is labeled by a number in the interval [0, 1] indicating the posterior Bayesian probability that the associated relationship exists in the biological network. Using Bayesian statistics, , which simply and appropriately weights each visited DAG D according to its likelihood. This methodology requires equally likely priors since in such a situation the posterior for D is proportional its likelihood [3] . In order for this estimate to reflect its true value, it is necessary that AR contain a large and varied collection of DAGs of high likelihood. Using the transcript abundance data for 26 Arabidopsis thaliana genes stimulated by ACC, gene interaction models for a next state with and without time series adjustment were computationally created, shown in Fig. 3 . Each edge is labeled by its posterior probability. Figure 4 provides comparisons of three similar models to those given in Fig. 3 . Figure 4 (a) shows a stronger and tighter distribution of posterior probabilities than Fig. 4(b) . There is significant agreement across the models for average posterior probabilities exceeding 0.8 and less than 0.2. However, for average posterior probabilities with values greater than 0.2 and less than 0.8 there is a great deal of variance, which reflects the lack of a strong posterior probability over this range. A typical underlying assumption of statistical analysis is that the residuals are independent [3, page 737]. It is well understood, however, that the residuals associated with time course data are not usually independent. By incorporating time series adjustments into the modeling process, the residuals' independence is much improved; thus, yielding a less approximated, more accurate likelihood function. The continuation of this research includes four tasks. First, the computational networks have been sent to the Muday lab for biological investigation, confirmation and interpretation. Second, in this paper, we investigated the enhancement of times series adjustment on a next state one step model. There are two other time paradigms, next state one and two steps and cotemporal, each of which has a time series adjustment analogue and a corresponding Norris-Patton likelihood. Comparing and contrasting the computational results of these three distinct modeling methods-as well as their biological interpretations-are important in understanding the gene interaction models developed using this methodology. Third, we will further consider higher order autoregressive adjustment to continue improving the independence of the residuals. Fourth, effort is underway to implement nonuniform priors in the modeling techniques. This would permit construction of gene interaction models that reflect relationships found in the literature. Modeling gene regulation networks using ordinary differential equations Detecting gene-gene interactions that underlie human diseases Probability and Statistics, 4th edn A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination Evolutionary Computation 1 -Basic Algorithms and Operators Using Bayesian networks to analyze expression data Exploring network structure, dynamics, and function using NetworkX Identification of transcriptional and receptor networks that control root responses to ethylene Bayesian model averaging: a tutorial Continuous cotemporal probabilistic modeling of systems biology networks from sparse data Regression Models for Time Series Analysis A BCHC genetic algorithm model of cotemporal hierarchical Arabidopsis thaliana gene interactions Stochastic Boolean networks: an efficient approach to modeling gene regulatory networks An Introduction to Genetic Algorithms Bayesian interaction and associated networks from multiple replicates of sparse time-course data Bayesian probabilistic network modeling from multiple independent replicates Hierarchical Bayesian system network modeling of multiple related replicates Bayesian network analysis of signaling networks: a primer dispy: distributed and parallel computing with/for python PLINK: a toolset for whole-genome association and populationbased linkage analysis BOOST: a fast approach to detecting gene-gene interactions in disease data GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies Acknowledgments. The authors thank the National Science Foundation for their support with a grant, NSF#1716279. John Farrell thanks Wake Forest University for support as a Wake Forest Fellow for Summer 2019.