key: cord-0303676-1mdim0r1
authors: Cummins, Breschine; Motta, Francis C.; Moseley, Robert C.; Deckard, Anastasia; Campione, Sophia; Gedeon, Tomáš; Mischaikow, Konstantin; Haase, Steven B.
title: Experimental Guidance for Discovering Genetic Networks through Iterative Hypothesis Reduction on Time Series
date: 2022-04-30
journal: bioRxiv
DOI: 10.1101/2022.04.28.489981
sha: 7abf669cb50e77668099c7288e2bdbf45701e813
doc_id: 303676
cord_uid: 1mdim0r1

Large programs of dynamic gene expression, like cell cyles and circadian rhythms, are controlled by a relatively small “core” network of transcription factors and post-translational modifiers, working in concerted mutual regulation. Recent work suggests that system-independent, quantitative features of the dynamics of gene expression can be used to identify core regulators. We introduce an approach of iterative network hypothesis reduction from time-series data in which increasingly complex features of the dynamic expression of individual, pairs, and entire collections of genes are used to infer functional network models that can produce the observed transcriptional program. The culmination of our work is a computational pipeline, Iterative Network Hypothesis Reduction from Temporal Dynamics (Inherent Dynamics Pipeline), that provides a priority listing of targets for genetic perturbation to experimentally infer network structure. We demonstrate the capability of this integrated computational pipeline on synthetic and yeast cell-cycle data. Author Summary In this work we discuss a method for identifying promising experimental targets for genetic network inference by leveraging different features of time series gene expression data along a chained set of previously published software tools. We aim to locate small networks that control oscillations in the genome-wide expression profile in biological functions such as the circadian rhythm and the cell cycle. We infer the most promising targets for further experimentation, emphasizing that modeling and experimentation are an∗Corresponding author: breschine.cummins@montana.edu essential feedback loop for confident predictions of core network structure. Our major offering is the reduction of experimental time and expense by providing targeted guidance from computational methods for the inference of oscillating core networks, particularly in novel organisms.

Pipeline is shown in Fig 1. The focus is on identifying key regulatory components of the GRN that form a The network finding step accepts a ranked list of gene interactions that are ideally enriched by regulatory 153 connections critical to the molecular process under consideration. Although DL×JTK and LEM have a strong 154 tendency to highly rank ground truth nodes [25] and edges [11] respectively, false positives and false negatives 155 do exist within the lists of top-ranked nodes and edges. Furthermore, even when both tools work perfectly, 156 there is no guarantee that the top pairwise LEM interactions will produce a network of complex interactions

Step Term Definition Node finding DL×JTK node ranking ranking of genes according to the most periodic gene expression

Edge finding local edge ranking ranking of activating and repressing interactions according to LEM simulation top-ranked LEM edges the top N edges in the local edge ranking, a user choice local node participation score (for gene g) the median rank of all edges in the top-ranked LEM edges that involve g local node ranking rank ordering of genes according to their local node participation score Network finding oscillation score (for a network) the proportion of network behavior that permits a stable oscillation according to DSGRN pattern match score (for a network) the proportion of stable oscillations that exhibit a DSGRN pattern match top-ranked DSGRN networks networks with the desired oscillation and pattern match scores edge prevalence score (for edge g → g )

the proportion of top-ranked DSGRN networks that include g → g global edge ranking rank ordering of edges according to their edge prevalence score global node participation score (for gene g) the median rank of all edges in the global edge ranking that involve g global node ranking rank ordering of genes according to their global node participation score Table: Key terminology for scoring and ranking nodes, edges, and networks in the Inherent Dynamics Pipeline in order of computation. 186 We studied the performance of the Inherent Dynamics Pipeline on a synthetic, strongly connected regulatory 187 network with nodes A, B, C, D, E, and F called the ground truth network shown in Fig 2 (a) . A strongly 188 connected network is one in which there exists a path connecting each node to every other node, and thus 189 there is at least one feedback loop between each pair of nodes in the network. This ground truth network was 190 designed to achieve high oscillation and pattern match scores (Table 2 .1) to mimic robust clock-like behavior. in Methods Section 4.4, which represents a node that does not participate in the simulated network. This false 195 node G provides a negative control of the algorithm in the sense of being a "true negative." Because all of the 196 nodes in the synthetic networks are strongly connected, the node finding step was not needed for the synthetic 197 data and the Inherent Dynamics Pipeline was run on the edge and network finding steps only. 198 We ran the Inherent Dynamics Pipeline beginning with the edge finding step for each of the three synthetically- edges for each pair of target/source nodes taken from A-G. As seen previously, the top of the local edge ranking 204 is enriched with true positives (Appendix Tables A.3-A. can identify true positive edges that do not strongly influence the global dynamics of the observed oscillations.

In the analysis of the data shown in Fig 2 (d) , the edge C repressed by D never appears in the top-ranked LEM 254 edges due to a low LEM likelihood score, unlike the results for (b) and (c) in the same figure. Moreover, the 255 edge D repressed by E has a zero edge prevalence score; i.e., it participates in no top-ranked DSGRN networks.

This is circumstantial evidence that the node D might play a less important role in the dataset Fig 2 (d) . We (e) that is formed by removing the node D. 259 We simulated datasets at the same parameters as in can operate as the true core oscillator of the synthetic network under some conditions. This is an example of 266 a case where a strongly connected network is not necessarily a core oscillator and illustrates the crucial role 267 parameters can play in determining the true core. Each synthetic dataset was run through the edge and network finding steps five times. The mean (blue dots) and standard deviation (blue bars) of the local node participation scores across the five runs is plotted against the mean and standard deviation of the global node participation scores. The node participation score for each simulation is computed only over the edges in the intersection of the top-ranked LEM edges of all five simulations. This excludes edges that do not have sufficiently high local edge ranks in all five simulations. Nodes located above the red diagonal line indicate an improved global ranking in their node participation score versus their local node rankings. Notice that the node G is noticeably downranked in all three panels.

Global dynamic behaviors can improve core variable inferences. Our primary goal is to use the based on node participation in networks that robustly support the observed dynamic behavior.

The node participation scores for the local versus the global node rankings are seen in 

In terms of experimental prioritization, nodes at the lower left of the diagonal should be viewed as having 286 the greatest confidence in their participation in a core oscillator, since they rank highly at both the local and 287 the global level. Next to be prioritized are those highest above the diagonal; i.e. those that are upranked 288 consistently in the global node ranking. We suggest that downranked nodes be disregarded. The downranked 289 nodes may contain false negatives; however, we observe that the true negative G is correctly identified and that 290 the area above the diagonal is enriched with true positives, making it a more promising area of investigation. interactions. We will refer to these as substantiated nodes and edges. All other nodes and edges will be 308 referred to as unsubstantiated.

In the yeast cell cycle, and even more so for non-model organisms, the collection of core oscillator genes is 310 uncertain, and many non-core genes exhibit oscillatory transcriptional dynamics. To assess the performance 311 of the Inherent Dynamics Pipeline, we included two "true negative" or unsubstantiated gene products, RIF1 312 and EDS1, that are highly oscillatory according to DL×JTK [25], but do not participate in any regulatory 313 interaction with substantiated nodes according to YEASTRACT.

Prior biological knowledge, e.g. the identity of a core regulator, or the functional activity of a regulator as 315 only a repressor or only an activator, could be used in principle to make a priori hypothesis reductions. The

Inherent Dynamics Pipeline incorporates this information using gene annotations that record whether a given 317 gene product acts as an activator, a repressor, or only as a target. The least constraining choice is to allow 318 a gene product to take any of these roles. If a gene is marked as not a target, then its corresponding node 319 in a regulatory network will have no in-edges. Likewise, if a gene product may be neither an activator nor a 320 repressor, then it will have no out-edges. The most interesting case is when a gene is both a regulator and a 321 target, but is allowed to be only an activator or only a repressor. This allows the gene to be evaluated as a 322 potential member of the core oscillator, but restricts the type of interactions that LEM will model. We call 323 such a restriction a nontrivial annotation; see the caption of Table 4 .5 for the nontrivial annotations of the 324 substantiated nodes.

Has only substantiated nodes the synthetic data case study, we see that LEM exhibits enrichment of its top ranks with substantiated edges 333 and that generally more than half of the sampled networks are consistent with the data (Appendix Table B .2 334 first column).

The scoring criteria that we use to assess top network performance is different than for the synthetic network, behavior. We also require very robust pattern matching via a pattern match score of 100% and a requirement 342 that both replicates must exhibit pattern matches. We emphasize that this is a user-defined choice that is based 343 on a biological phenotype. 344 Figure 6 : Local versus Global Node Participation Scores for the Yeast Cell Cycle Network. Mean ± standard deviation node participation scores for the scenario

. Each scenario was run through the edge and network finding steps five times. The mean (blue dots) and standard deviation (blue bars) of the local node participation scores across the five runs is plotted against the mean and standard deviation of the global node participation scores. The node participation score for each simulation is computed only over the edges in the intersection of the top-ranked LEM edges of all five simulations. This excludes edges that do not have sufficiently high local edge ranks in all five simulations. Nodes located above the red diagonal line indicate an improved global ranking in their node participation score versus their local node ranking.

Global dynamic behaviors most improve local inference when the least prior information is avail- scenarios, usually strongly, and sometimes is also poorly ranked in the local node ranking. We remark that CLN3 372 is the only substantiated node that is not a transcription factor. CLN3 is a cyclin that activates cyclin-dependent 373 kinase (CDK), which in turn regulates the activity or stability of other proteins via phosphorylation.

There are two potential explanations for the strong down ranking of CLN3. When inferring GRNs from data, the space of potential core nodes, core interactions, and core networks is too 382 large to exhaustively explore, even computationally, much less through experimentation. We demonstrate that 383 high-throughput experimental data can be leveraged by using the software tool Inherent Dynamics Pipeline [24] 384 to iteratively reduce these spaces and provide experimental guidance. We show the efficacy of this method on a 385 synthetic network designed to exhibit robust oscillations and on yeast cell cycle data that displays controllable 386 oscillations and a large body of experimental evidence for a particular network topology.

The Inherent Dynamics Pipeline network discovery tool consists of a node finding step implemented with in the core oscillator (Figure 4 and Figure 6 ).

For the synthetic data, top-ranked networks were required to have an oscillation score of 100% to represent 400 robustness, while for the yeast data, the oscillation score was limited to the range 10-40% to account for 401 phenotypic plasticity of the cell cycle network. The fact that in addition to oscillatory behavior the network 402 exhibits steady state behavior in the conditions that trigger one of its checkpoints highlights a difficulty in 403 optimization of any kind for discovering networks. The mixture of cell plasticity in phenotype versus robust 404 expression of a behavior is unknown, and therefore the classification of networks as "top performers" is uncertain.

Moreover, evolution dictates that a cell only requires a sufficiently good solution, not the best solution, that is 406 achievable under unknown developmental constraints. These constraints impact the collection of networks over 407 which evolutionary optimization can occur, which will be highly limited with respect to all of network space.

These uncertainties speak to the necessity of incorporating as much biological knowledge as possible in addition 409 to time series data in order to increase the chances of discovering the true molecular interaction network.

The fact that the greatest gains in edge ranking by the Inherent Dynamics Pipeline come in situations 411 where annotation information is the most sparse ( Figure 5) parameters for these ODE models will simply be referred to as "parameters." The parameter space for a 441 switching system ODE is decomposed by DSGRN into a finite number of regions [42] and DSGRN computations 442 are performed over these regions rather than over individual real values. Each such region is called a DSGRN 443 parameter in previous publications, but we will use the term "DSGRN parameter region" in this work for clarity.

Lastly, there are user choices for controlling the behavior of the numerical methods DL×JTK, LEM, and DSGRN 445 pattern matching in the Inherent Dynamics Pipeline. These will be referred to as "hyperparameters." (1)

First to each gene is associated its so-called "regulator score", which is taken to be the standard deviation of the Alan Hutchinson [47] was modified.

The Local Edge Machine (LEM) algorithm [11] adopts a Bayesian framework to perform inference of functional 485 gene regulation. Namely, a prior distribution on the space of single-edge regulatory models of a given gene G is where each gene product can achieve a (local) maximum or minimum expression level. A consequence is that 526 the potential for oscillatory behavior can be identified from the STG, as well as the stability of the oscillations 527 and the order of the maxima and minima of different gene product concentrations within the oscillations. 528 We propose that ordering the extrema in a time series dataset is an appropriate description of the observed 529 dynamical behavior, taking noise into account. This representation is coarse, but it qualitatively captures 530 certain characteristics, such as relative frequency and phase differences. Given any collection of gene products, 

The DL×JTK node ranking prioritizes gene time series for further analysis in the Inherent Dynamics Pipeline.

LEM ingests the top-ranked nodes and produces a ranked-ordered list of edges called the local edge ranking 555 that is used to initiate the network finding step. The procedure for the network finding step is to pick a seed 556 network composed of the top few LEM edges, and then to stochastically search in a neighborhood around 557 the seed network for strongly connected networks that are sampled using a larger portion of the local edge 558 ranking. We call edges permitted in the construction of networks the top-ranked LEM edges. The increased 559 permissivity allows the possibility of including network edges that may have been downranked by LEM due 560 to either stochastic computation or experimental noise. Self-repressing edges are currently removed from the 561 top-ranked LEM edges for technical reasons, but this functionality is expected to be added in the near future.

Using the top-ranked LEM edges, the network finding step produces a sample of candidate networks in 563 the neighborhood around the seed network. User-supplied scoring constraints are then employed to identify 564 a collection of top performing networks. These scoring constraints are based on the fact that the number of 565 DSGRN parameters is finite, which allows proportions of DSGRN parameter space with the desired dynamical 566 behavior to be computed. In this work, there are two numerical scores that we use to choose top regulatory 567 networks. The first is the proportion of DSGRN parameters that exhibit stable oscillations (oscillation score).

The second is the proportion of stably oscillating DSGRN parameters that exhibit a pattern match to at least one 569 dataset within the stable oscillation (pattern match score). When we have replicate experimental datasets, 570 as we do for the S. cerevisiae data, we also require at least one pattern matching success for each replicate.

Candidate networks that meet the chosen criteria are called the top-ranked DSGRN networks.

A rank-ordered list of edges called the global edge ranking is created by measuring the participation of 573 each edge in the top-ranked DSGRN networks. Every edge is assigned an edge prevalence score that is the 574 proportion of top-ranked DSGRN networks in which it appears, i.e., for the edge i → j, the edge prevalence 575 score P i→j is

where T is the number of top networks and N i→j is the number of top networks that have the edge i → j (or 577 i j for a repressing edge). P i→j is nonzero for any edge i → j that participates in at least one network that 578 can faithfully reproduce the observed data to the requested degree of robustness and accuracy.

The edge prevalence score defines the global edge ranking, which is a re-ranking of the top-ranked LEM 580 edges according to their ability to participate in complex networks with a desired phenotypic behavior. When 581 ties in the edge prevalence score exist, they are broken by local edge rank. Any edge with a zero prevalence 582 score is given the worst possible ranking: the number of top-ranked LEM edges.

In addition to ranking edges, we can also revise the DL×JTK node ranking for experimental prioritization.

The method is the same for either the local edge ranking from LEM or the global edge ranking from DSGRN. 585 We use the (local or global) node participation score, which is computed for each node g by collecting the 586 ranks of all edges either coming into g or emanating from g and taking the median of these ranks. The global 587 edge and node rankings together provide guidance to the experimentalist desiring to prioritize experiments. We constructed a collection of six-node network topologies that robustly exhibit oscillations across DSGRN 592 parameter regions, and then we chose the top ranked of these with nontrivial regulation at a node (see node C 593 in Fig 2 (a) ). We generated synthetic time series data from this network from three different DSGRN parameter Fig 2 (a) . The network's oscillation score is 100% and the pattern match scores for each time series 601 dataset ranged from 45.8% to 51.2%.

We then added a spurious time series, "node" G, to the dataset to evaluate the performance of the Inherent 603 Dynamics Pipeline with imperfect data. The time series for G was generated by

where t is the vector of time points used to simulate the synthetic data.

To generate the synthetic gene expression profiles from the network topologies given in Fig 2 (a) and (e), 606 we simulated systems of ODEs with Hill function nonlinearities as specified in Equations 4 and 5 respectively, 607 with parameters given in Table 4 .1.

608 Figure 7 : The true negative time series G for the synthetic network.

Note that the parameter Γ D,C is one or two orders of magnitude smaller for node C in the first parameter (Fig 2 (b) ). This may partially explain the observation that node D serves an Table 4 .4: The list of 9 substantiated and 2 unsubstantiated yeast cell cycle genes.

To choose "true positive" regulatory edges, we use YEASTRACT, a database that compiles experimental 663 evidence for regulatory interactions in the yeast genome [7] . We assume that regulatory edges that have 664 documented transcriptional evidence in YEASTRACT are "ground truth", or substantiated edges. When 665 using YEASTRACT, we specified that expression level data was required for a regulatory interaction; binding-666 only relationships were not used. We augment the substantiated edges list by the three interactions WHI5 667 repressed by CLN3, SWI4 repressed by WHI5, and SWI4 repressed by NRM1 from the cell cycle network model 668 in [33] . The full list of 24 substantiated edges is in Table 4 

The edge finding hyperparameters are given in Table 4 2.60 ± 0.80 0.00 ± 0.00 9.00 ± 0.00 Table of Results for Edge Finding. All numbers are means over five separate runs of the Inherent Dynamics Pipeline plus/minus one standard deviation. Column 1 contains the dataset that was analyzed. Column 2 reports the the total number of pairwise interactions modeled by LEM; in this case that is 7 2 interactions between 7 nodes multiplied by 2 for positive and negative regulation. Columns 3 and 4 report the number of true positive and false positive edges with a LEM probability score greater than the chosen threshold. This indicates the number of ground truth and non-ground truth edges in the seed network for the network finding step out of 10 total ground truth edges (see Fig 2 (a) ). Column 5 is the number of ground truth edges available for network sampling in the network finding step; i.e. the number of ground truth edges out of 10 that are in the top-ranked LEM edges, which has approximately 45 edges. It varies slightly according to the size of the seed network. The number of sampled networks that have at least one pattern match for at least one dataset out of 2000 sampled networks. Column 3: Top networks are those networks with an oscillation score of 100% and a pattern match score of 50% or above. Column 4: The number of true positive edges with a zero edge prevalence score, i.e., those that are false negatives. These are above and beyond those missing from local edge ranking, as indicated in the last column of 

Towards a proteome-scale map of the human protein-protein 704 interaction network

A proteome-scale map of the human interactome network

High-quality binary protein interaction map of the yeast interactome 710 network

Chip-seq: A method for global identification of regulatory elements in the 712 genome

A gene-coexpression network for global discovery of conserved 714 genetic modules

Yeastract: an upgraded database for the analysis of transcription regulatory 717 networks in saccharomyces cerevisiae

YEASTRACT+: 721 a portal for cross-species comparative genomics of transcription regulation in yeasts

A comprehensive survey of regulatory network 724 inference methods using single cell rna sequencing data

Estimation of gene regulatory networks from cancer transcriptomics data

Causal network inference by optimal causation entropy

The inferelator: an 734 algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo

Timedelay-aracne: Reverse engineering of gene networks from 737 time-course data by an information theoretic approach

Statistical inference of the time-varying structure 740 of gene-regulation networks

Addressing false discoveries in network 743 inference

Integrative random forest for gene regulatory network inference

Inferring regulatory networks from expression data 747 using tree-based methods

Hierarchical structure and the prediction of missing links in 749 networks

Missing and spurious interactions and the reconstruction of complex net-751 works

Semi-supervised network inference using simulated gene expression dynamics

Wisdom of crowds for robust gene network inference

Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the u.s

Evaluation of individual and ensemble probabilistic forecasts of covid-19 mortality in the us

Conservation of dynamic characteristics 801 of transcriptional regulatory elements in periodic biological processes

An efficient nonparametric algorithm for 804 detecting rhythmic components in genome-scale data sets

Comparison of computational 807 methods for the identification of cell cycle-regulated genes

Machine learning helps identify chrono as a circadian 811 clock component

Dsgrn software

Dsgrn software

Checkpoints couple 817 transcription network oscillator dynamics to cell-cycle progression

The cell-cycle transcriptional network generates and transmits a 821 pulse of transcription once each cell cycle

Reconciling conflicting models for 823 global control of cell-cycle transcription

Transcriptional regulatory 828 networks in saccharomyces cerevisiae

Serial regulation of transcriptional regulators in the yeast cell 832 cycle

Global control of cell-cycle transcription by coupled cdk and network oscillators

The forkhead transcription factor hcm1 regulates 839 chromosome segregation genes and fills the s-phase gap in the transcriptional circuitry of the cell cycle

The whi1+ gene of saccharomyces cerevisiae 842 tethers cell division to cell size and is a cyclin homolog

Daf1, a mutant gene affecting size control, pheromone arrest, and cell cycle kinetics of saccha-844 romyces cerevisiae

Dilution of the cell cycle inhibitor whi5 846 controls budding-yeast cell size

Extending combinatorial 849 regulatory network modeling to include activity control and decay modulation

Space for Switching Systems

Python tutorial

Inherent dynamics visualizer, an inter-856 active application for evaluating and visualizing outputs from a gene regulatory network inference pipeline

A new measure of rank correlation

An efficient, minimal-storage procedure for calculating the mann-whitney u, generalized 861 u and similar distributions

pyJTK: Python implementation of the JTK CYCLE statistical test

The possible effects of the aggregation of the molecules of haemoglobin on its dissociation curves

Asymptotic Approximations of Integrals

An algorithmic approach to chain recurrence

Lattice structures for attractors I

Lattice structures for attractors II

Lattice structures for attractors III

Identifying robust hysteresis in networks

Using extremal events to 880 characterize noisy time series

Model rejection and parameter reduction via time 883 series

 Fig 2 (b) . These are the edges present in the top-ranked LEM edges in all five computations. The notation A=act(B) should be read "A activated by B". Boxed global edge ranks denote ground truth edges. Notice that the ground truth edge D repressed by A was not a top-ranked LEM edge for at least one computation and is therefore not listed. All edges with a zero prevalence score are given the worst possible rank. The edges are sorted by median global edge ranking. Fig 2 (d) . These are the edges present in the top-ranked LEM edges in all five computations. The notation A=act(B) should be read "A activated by B". Boxed global edge ranks denote ground truth edges. Notice that the ground truth edge C repressed by D was not a top-ranked LEM edge for at least one computation and is therefore not listed. All edges with a zero edge prevalence score are given the worst possible rank. The edges are sorted by median global edge ranking. The number of sampled networks that have at least one pattern match for at least one dataset out of 4000 sampled networks. Column 3: Top networks are those networks with an oscillation score of 10-40% and a pattern match score of 100%. Column 4: The number of substantiated edges with a zero edge prevalence score, meaning they are probable false negatives. Column 5: The number of unsubstantiated edges with a zero edge prevalence score, i.e. probable true negatives. For example, in row 1, six substantiated edges are probable false negatives on average during the network finding step above and beyond the two substantiated edges lost from the top-ranking LEM list due to a low edge ranking (see the last column of Table B .1 in row 1 showing 22/24 edges in the top-ranked LEM edges). In addition, on average 16 unsubstantiated edges are identified as probable true negatives.