key: cord-0016330-5pbfq6hc
authors: Sun, Xiaoqiang; Zhang, Ji; Nie, Qing
title: Inferring latent temporal progression and regulatory networks from cross-sectional transcriptomic data of cancer samples
date: 2021-03-05
journal: PLoS Comput Biol
DOI: 10.1371/journal.pcbi.1008379
sha: 41f69fc9af1f6335a941c44839d8e69975940be2
doc_id: 16330
cord_uid: 5pbfq6hc

Unraveling molecular regulatory networks underlying disease progression is critically important for understanding disease mechanisms and identifying drug targets. The existing methods for inferring gene regulatory networks (GRNs) rely mainly on time-course gene expression data. However, most available omics data from cross-sectional studies of cancer patients often lack sufficient temporal information, leading to a key challenge for GRN inference. Through quantifying the latent progression using random walks-based manifold distance, we propose a latent-temporal progression-based Bayesian method, PROB, for inferring GRNs from the cross-sectional transcriptomic data of tumor samples. The robustness of PROB to the measurement variabilities in the data is mathematically proved and numerically verified. Performance evaluation on real data indicates that PROB outperforms other methods in both pseudotime inference and GRN inference. Applications to bladder cancer and breast cancer demonstrate that our method is effective to identify key regulators of cancer progression or drug targets. The identified ACSS1 is experimentally validated to promote epithelial-to-mesenchymal transition of bladder cancer cells, and the predicted FOXM1-targets interactions are verified and are predictive of relapse in breast cancer. Our study suggests new effective ways to clinical transcriptomic data modeling for characterizing cancer progression and facilitates the translation of regulatory network-based approaches into precision medicine.

information of the breast cancer patients used for network prediction were downloaded from the NCBI GEO database (GSE7390). The microarray and ChIP-seq data used for network validation were downloaded from the NCBI GEO database (GSE40766, GSE40762, GSE62425, GSE2222, GSE58626 and GSE27830). The clinical gene expression data used for survival analysis were downloaded from the NCBI GEO database (GSE2990, GSE12093, GSE5327, GSE1456, GSE2034, GSE3494, GSE6532 and GSE9195). The gene expression RNAseq and phenotype information associated with the TCGA COAD dataset were downloaded from the UCSC Xena website (https://xena.ucsc.edu/) via the following links: https://tcga-xena-hub.s3.us-east-1. amazonaws.com/latest/TCGA.COAD.sampleMap/ HiSeqV2.gz and https://tcga-xena-hub.s3.us-east-1.amazonaws.com/latest/TCGA.COAD.sampleMap/ COAD_clinicalMatrix; The gene expression RNAseq and phenotype information associated with the TCGA SKCM dataset were also downloaded from the UCSC Xena website (https://xena.ucsc.edu/) via the following links: https://tcga-xena-hub.s3.useast-1.amazonaws.com/latest/TCGA.SKCM. sampleMap/HiSeqV2.gz and https://tcga-xena-hub. s3.us-east-1.amazonaws.com/latest/TCGA.SKCM. sampleMap/SKCM_clinicalMatrix. A recent requantification of the LPS scRNA-seq dataset (GSE48968) was downloaded from the conquer database (http://imlspenticton.uzh.ch:3838/ conquer/). The code for PROB is available at https://github.com/SunXQlab/PROB. The numerical data underlying graphs in the manuscript is available at S1_Data.xlsx in the Supporting Information.

evolution and genetics studies [19] , for instance, phylogenetic trees based on microarray data [20] and genetic linkage maps based on genetic markers [21] . To this end, we propose that the latent temporal order of cancer progression status (i.e., latent-temporal progression) could be estimated from the cross-sectional data based on transcriptomic similarity between patient samples. Leveraging the latent-temporal ordering, we could represent the GRN as a nonlinear dynamical system. What's more, however, considering the technical variability or measurement error in the RNA-sequencing or microarray data (e.g., variations in sample preparation, sequencing depth and measurement noise and bias) [22, 23] , it's indispensably important to guarantee the robustness of the GRN inference.

In this study, we present PROB, a latent-temporal progression-based Bayesian method of GRN inference designed for population-scale transcriptomic data. To estimate the temporal order of cancer progression from the cross-sectional transcriptomic data, we develop a staging information-guided random walk approach to efficiently measure manifold distance between patients in a large cohort. In this way, the cross-sectional data could be reordered to be analogous to time-course data. This transformation enables us to formulate the GRN inference as an inverse problem of progression-dependent dynamic model of gene interactions, which is solved using a Bayesian method. The robustness of the estimates of regulatory coefficients is justified through mathematical analysis and simulations. Furthermore, applications to real data not only demonstrate better performance of PROB than other existing methods but also show good capacity of PROB in identifying key regulators of cancer progression or potential drug targets. The identified ACSS1 in bladder cancer and predicted FOXM1-targets interactions in breast cancer are both validated. In addition, we also discuss potential clinical applications of our method.

The tumor tissues in this study were received from the operative resection of bladder cancer patients. The patients/participants provided their written informed consent to participate in this study. The studies involving human participants were reviewed and approved by the Ethics Committee of Sun Yat-sen University Cancer Center (approval no. GZR2018-131).

Overview of PROB. PROB consists of two major components. First, to infer the latent temporal information of cancer progression from the cross-sectional data, a graph-based random walk approach was developed to quantitatively order patient samples (Fig 1A and 1B) . More specifically, we defined a manifold distance between patients by analytically summing the transition probabilities over all random walk lengths to quantify temporal progression and the root of the progression trajectory was automatically identified with the aid of staging information. The quantitative reordering of the samples led to the recovery of the temporal dynamics of gene expression (Fig 1C) . Second, a progression-dependent dynamic model was proposed to mechanistically describe the gene regulation dynamics during the above estimated temporal progression. To ultimately infer the GRN, the inverse problem in terms of parameter estimation of the dynamic model was transformed to a linear regression model which was solved using a Bayesian Lasso method (Fig 1D) . Compared to the existing correlational network methods, PROB can infer causal GRNs with directed and signed edges from cross-sectional transcriptomic data.

Temporal progression inference for cancer samples. We employ a similarity graphbased random walk approach to order patients along with the progression and to estimate the progression score for each patient, given the hypothesis that the similarity between patients can be measured by the patients' gene expression profiles and pathology information.

We first define a Gaussian similarity function for two patients, x and y, as 

Where T x and T y are vectors used to represent the transcriptomic expression profiles of the respective patients and kT x −T y k is the L 2 norm of T x −T y . The parameter γ is determined as

where ω xy is a weight coefficient given by pathology information such as stage or grade, which is defined in this study as ω xy = 1+|G x −G y |, with G x and G y representing grading or staging information (taking values of, for instance, 1, 2, 3, or 4) of the two patients x and y, respectively. The parameter E x is adaptive for each patient x and is set as the patient's distance to the κ-th nearest neighbor. S can be viewed as a stage-weighted and locally scaled Gaussian kernel.

To eliminate the effect of sampling density, we subsequently remold the above rotationinvariant kernel S(x, y) into an anisotropic kernel H(x, y),

by normalizing S with a proxy for the sampling density of the data points,

We next define a transition probability matrix P whose elements are defined as P xy ¼ EðxÞ

where E(x) is the row normalization of H, that is,

P is a symmetric transition matrix [24, 25] . P xy can be interpreted as the probability of transitioning from x to y (or from y to x). The eigenvectors of P can be referred to as diffusion components, and taken together, they constitute a modified version of the diffusion map [24, 25] , which extracts the topological structure of the high-dimensional data.

We then measure the transitions on all length scales between patients. The accumulated transition probability (Q xy ) of visiting y from x over random walk paths of all lengths is analytically calculated as

whereP ¼ P À c 0 c 0 T , and ψ 0 is the first eigenvector of P (corresponding to eigenvalue 1).

Since ψ 0 is associated with the steady state and contains no dynamic information [26] , we subtract the stationary component ψ 0 ψ 0 T from P, resulting inP. In this way, all the eigenvalues of P are smaller than 1; hence, the above sum of infinite series is convergent. ðI ÀPÞ y is the generalized inverse (or Moore-Penrose inverse) of I ÀP [27] . We use Q(x,�) to represent the accumulated transition probability of visiting all points from x. Thus, Q(x,�) is a row in Q and can be viewed as a feature representation for patient x. Therefore, we define a temporal progression distance (TPD) between two patients as TPDðx; yÞ ¼ kQðx; �Þ À Qðy;

where k�k stands for the L 2 norm. We remark that TPD is a scale-free manifold distance and is computationally efficient due to the closed form expression of Q.

Given a patient x, the progression score with respect to the trajectory's root x 0 is s = TPD (x 0 , x). Therefore, it is critical to determine the root sample in a large cohort for ordering the patients. We fulfill this task with the aid of the staging information of the patients: among all patients, the root of the trajectory should have the largest TPD to a patient with maximal stage (e.g., stage 4). That is, the root sample x 0 can be identified according to the following formula:

where x ref is a randomly selected patient from the maximal grade subpopulation. The selection of x 0 was limited among patients with the smallest grade (i.e., {x min }) to eliminate potential influence of a few outliers in the data. The ordering of the progression scores quantifies the relative progression status and maps the patients into a smoothed temporal trajectory. We remark that the incorporation of staging information into the Gaussian kernel and root identification could significantly improve the accuracy of temporal progression inference (see S1 Fig and Discussion section).

Dynamical systems modeling. Based on the mass action kinetics [28] , the temporal regulation of gene expressions can be modeled using the following dynamical system,

where X i (s) represents the expression level of gene i (i = 1,. . .,n) in cancer with progression status s. a ij is the regulatory coefficient from gene j to gene i (i = 1,. . .,n; j6 ¼i), and d i is the selfdegradation rate of gene i. The details of model assumption and derivation are provided in S1 Text. Parameter estimation using Bayesian Lasso method. Take m+1 points S i = s(r i ) from the smoothed progression trajectory s(r), where r i = i/m, i = 0,1,� � �,m. We approximate dX i ds s k ð Þ � X i ðs kþ1 Þ À X i ðs k Þ s kþ1 À s k and denote

where s k+1 −s k is sufficiently small (since m could be chosen large enough). Therefore, the above continuous model (i.e., Eq (10)) can be discretized and rewritten as

We then denote

and X ðiÞ ¼ 

Consequently, Eq (11) can be transformed into the following linear regression model:

where ε i ¼ ðε i0 ; ε i1 ; � � � ; ε im Þ T are the random effects with each ε ik � Nð0; s 2 i Þ, (k = 0,1,� � �,m). We then use an adapted Bayesian Lasso method to infer the posterior distribution over the coefficients in each A i . We assume that the conditional prior distribution of A i js 2 i ; l i is the Laplace (double exponential) distribution with a mean of 0 and scale s i l i , that is,

where λ i is the fixed lasso shrinkage parameter, which is set to 1. The prior distribution of s i 2 ; pðs 2 i Þ, is usually assumed to be an inverse gamma, with the probability distribution function

where A and B determine the shape and scale, respectively, of the inverse gamma distribution. Using Bayes' rule, we formulate the joint posterior distribution of A i and s 2 i as follows:

with 'ðA i ; s 2 i jY i ; X ðiÞ Þ, the data likelihood, given by

where X ðiÞ k is the (k+1)-th column of X (i) , (k = 0,1,� � �,m), and �ðY ik ; A i X ðiÞ k ; s 2 i Þ is the Gaussian probability density with mean A i X ðiÞ k and variance s 2 i evaluated at Y ik . The Markov chain Monte Carlo (MCMC) algorithm with Gibbs sampling updates is employed to estimate the marginal distribution of each parameter. A directed edge from gene j to gene i could be determined to be present if the 95% credible interval (CI) of the parameter estimates of a ij does not contain zero, otherwise absent.

Mathematical analysis. Considering the technical variability or measurement error in the transcriptomic data [22, 23] , it is important to examine the robustness of the method with respect to the perturbation in latent-temporal progression. To this end, we present the following theorem. 

ðX i ðsÞ;ã ij Þ both satisfy the equations of progression-dependent dynamic model, i.e.,

The proof of the above theorem is provided in S2 Text.

Based on the spectral graph theory [24, 25] , the above manifold distance (TPD) is noiseresistant, so the variation in the progression trajectory (i.e.,s À s) should be small given moderate perturbations (as illustrated below). Consequently, Theorem 1 then implies that the corresponding estimates of [a ij ] n×n should vary minimally. Therefore, the above theorem theoretically guarantees the consistency and robustness of the estimates of the regulatory coefficients. In addition, the Bayesian Lasso method adopted by PROB further ensures a robust implementation of GRN inference.

A corollary of the above theorem is that the mapping s7 ![a ij (s)] n×n defined by Eq (10) is continuous under certain appropriate metric. More specifically, for two trajectories s ands, if the difference between the two inferred regulatory coefficients [a ij (s)] n×n and ½a ij ðsÞ� n�n is significantly larger than 0, then the difference between s ands should not be arbitrarily small. This implies that if the inferred regulatory networks for two progressions are largely different, then the two progressions should have different trajectories and thus distinct clinical outcomes. Hence, Theorem 1 also suggests that GRN-based signatures may be used for predicting or controlling cancer progression.

Computational algorithm. The algorithm to infer progression trajectory and GRN is presented below. The implementation of PROB is described in S3 Text. 3: Normalization of S : H xy ¼ Sðx;yÞ DðxÞDðyÞ . 4: Transition probability:

5: Accumulated transition probability:

ψ 0 is the first eigenvector of P.

6: TPD function: TPDðx; yÞ ¼ kQðx; �Þ À Qðy; �Þk L 2 .

7: Trajectory root: 

For tumor sample-based gene expression data, several methods have been developed to infer gene networks. Pearson correlation (PCOR) is often used to quantify gene coexpression. Mutual information (MI) measures non-linear dependency between genes and thus provides a natural generalization of the correlation. MI-based methods for GRN inference include ARA-CNe [29] , CLR [30] , and MRNET [31] . Another commonly used method for GRN inference based on gene expression data is multiple linear regression LASSO method [32] , which assumes sparse network structure and is feasible for high-dimensional data. Ensemble learning methods, such as GENIE3 (a tree-based ensemble learning method [16] ), have been developed to infer gene regulatory relationships by viewing GRN reconstruction as a classification problem. In addition, we also included some GRN inference methods recently developed for scRNA-seq data into benchmarking analysis, since scRNA-seq data is also cross-sectional type. Such methods include SCODE [33] that uses ordinary differential equations model and LEAP [34] that constructs gene co-expression networks by using the time delay involved in the estimated pseudotime of the cells. SINCERITIES [35] is designed for time-stamped scRNA-seq data but requires at least 5 time points, so it is not applicable for the following benchmarking dataset as well as the tumor sample-based transcriptomic data.

In this study, we compared the accuracy of PROB with that of PCOR, ARACNe, CLR, MRNET, Lasso, GENIE3, SCODE and LEAP based on a real scRNA-seq data of dendritic cells (DCs) (GSE41265 [36] ). The cells were stimulated with LPS and sequenced at 1, 2, 4, and 6h after stimulation. Only wild type cells (n = 479) without Stat1 and Ifnar1 knockout were chosen for analysis. We choose this DC dataset for benchmarking because regulatory potential between 23 TFs in the DCs has been determined via a high-throughput Chromatin Immuno-Precipitation (HT-ChIP) method [37] . The AUC of ROC was used to assess and compare the prediction accuracies of the above methods.

In addition, we collected a set of known regulators and targets [38] to test whether PROB could correctly distinguish outgoing regulations of different genes. To this end, we defined an outgoing causality score (OCS) for gene i in cell k as follows:

where m ji is the absolute value of mean of the posterior distributions of a ji ; X k i is the expression level of gene i in cell k. The OCS is defined in accordance of the Eq (10) based on the mass action kinetics and quantifies the outgoing regulatory potential of a give gene. We then compared the distributions of OCS values of 6 regulators and that of 28 targets using the above DC dataset. The Wilcoxon rank-sum test (one-tailed) p value was calculated to assess statistical significance.

We applied PROB to a dataset of bladder cancer patients that includes 84 cases of conventional UCs and 28 cases of SARCs which were profiled by Illumina HumanHT-12 DASL Expression BeadChips (GSE128192 [39] ). The temporal progression inference was performed to quantitatively order samples based on the whole gene expression profile with UC samples and SARC samples labeled by 1 and 2 respectively. To reconstruct epithelial-to-mesenchymal transition (EMT) regulatory networks, we collected 44 representative genes of TGFB1 pathway, RhoA pathway, p53 pathway, p63 pathway and EMT transcriptional regulators (S1 Table) [39] . The UC network and SARC network were reconstructed based on the ordered expression data of the above 44 genes in UC samples and SARC samples respectively. The UC-specific network and SARC-specific network were then constructed by extracting edges that were unique to UC network and SARC network respectively. The out-degree values for each node in the two networks were calculated to prioritize key regulator genes.

We applied PROB to a microarray dataset of breast cancer (GSE7390 [40] ) to identify key regulator genes with prognostic role in cancer progression. We identified the hub gene in the GRN based on an eigenvector centrality measure according to singular value decomposition method [41] . Denote the mean of the posterior distributions of a ij as m ij , and M = (m ij ) n×n . We subject M to singular value decomposition. We calculated the principal eigenvector of M�M T and denoted it H = (h 1 , h 2 ,. . .,h n ). The hub score of node i was defined as h i . The gene with greatest hub score was identified as a hub gene for further analysis and validation. Over-expression plasmids and siRNA transfection. 5637 cells were placed in 24 wells plate and transfected with the lentiviral vectors pTSB-CMV-puro and SiRNA against ACSS1 reaching 70%-80% confluence using Lipofectamine 2000 (Thermo Scientific) according to the manufacturer instructions. The SiRNA sequence used in this study are listed in S2 Table. RNA extraction and qPCR. Total RNA was extracted by HiPure Total RNA Mini Kit (R4111-03, Magen) and the concentration was detected by ultramicrospectrophotometer (NanoDrop 2000, Thermo Fisher Scientific). RT-PCR was performed using PrimeScript RT Master Mix (DRR036A, TakaRa) and qPCR was performed by qPCR SYBR Green Master Mix (11198ES03, Yeasen) in Real-time quantitative PCR instrument (Q1000+, Long Gene). All the relative mRNA expression was normalized to GAPDH. The qRT-PCR primer sequence used in this study are listed in S3 Table. Western blotting. Total protein was extracted by RIPA lysis buffer (JC-PL001, Genshare) with PMSF (1:100, 20104ES03, Yeasen). Standard western blot protocols were adopted. The band intensity of western blots was detected by BLT GelView 6000M. All the relative protein expression was normalized to β-actin.

Immunohistochemistry: All the tumor tissues were received from the operative resection of patients. The patients/participants provided their written informed consent to participate in this study. The studies involving human participants were reviewed and approved by the Ethics Committee of Sun Yat-sen University Cancer Center (approval no. GZR2018-131). The immunohistochemical analysis of the two markers including ACSS1 and E-Cadherin was performed. All the pathological sections were produced, scanned and analyzed by Leica Biosystems.

We validated the regulation of FOXM1 (a hub gene, see Results section) on the predicted targeted genes using multiple sets of gene expression data and ChIP-seq data that are publicly available.

To validate the expression changes of the predicted targeted genes following FOXM1 perturbation, we analyzed microarray gene expression data in MCF-7 cells that were treated with DMSO (control) or Thiostrepton (FOXM1 inhibitor) for 6 hours (GSE40762 [42] ). The differential expression of the above 8 genes between control condition and FOXM1 inhibition condition was examined to test whether they were down-regulated after FOXM1 inhibition. The statistical significance was assessed using Wilcoxon rank sum test (one-tailed) p values.

To test whether FOXM1 binds to some of the predicted targeted genes, we used ChIP-seq data in both MCF-7 cell line (ER+) and MDA-MB-231 cell line (ER-) (GSE40762 [42] ) to analyze binding of FOXM1. A standard procedure of the ChIP-seq analysis was performed for peak calling (S7 Text).

To illustrate the function of PROB, we generated a set of synthetic cross-sectional expression data (S4 Text). For visualization purpose, we considered 6 genes in 100 cancer patients (S2A and S2B Fig) . We first used PROB to infer temporal progression from the randomized samplebased data. The inferred latent-temporal progression was compared against the true progres- To verify the robustness of PROB to the measurement variability, we further tested PROB for datasets at different levels of variabilities (Fig 2) . The gene expressions were randomly perturbed by using multiplicative Gaussian noises to simulate different levels of measurement variabilities in the data, resulting in a series of coefficient of variations (CVs) (i.e., 0%, 5%, 10% and 15% respectively) (Fig 2A) . The IDs of the samples were randomized to mimic samplebased snapshots of gene expression data, but the staging information was retained for each patient. PROB was applied to infer the GRN for each dataset. The accuracy of GRN inference was evaluated using the AUC of the ROC, showing that PROB could strongly reduce bias in gene expression measurements (Fig 2B and 2C) and robustly reconstructed the GRNs (Fig  2D) . Additional evaluation metrics were employed to verify the robustness of PROB against a series of variations in the data (with CVs ranging from 0% to 30%). The root mean square error (RMSE) and Spearman correlation coefficients were used to evaluate the accuracy of the temporal progression inference (S4A and S4B Fig). The accuracy, positive predictive value (PPV) and Matthews correlation coefficient (MCC) were used to evaluate the robustness of the GRN reconstruction (S4C-S4F Fig) . In addition to the above Gaussian noises, we also tested the robustness of PROB against perturbations of multiplicative exponential noises generated from the exponential distribution with mean ranging from 0 to 0.3 (S5 Fig). The findings are consistent with the above results (Fig 2) .

We used a set of single cell RNA-seq (scRNA-seq) data (GSE48968 [36] ) for benchmarking of GRN inference methods since our method can be naturally applied to stage-stamped or timecourse scRNA-seq data and the ground-truth of the GRN is available in this case as described in the Methods section. The LPS-stimulated dendritic cells (DCs) were sequenced at 1, 2, 4, and 6h after stimulation. The capture time in the data was treated as an analogy to 'staging' information when using PROB. The estimated latent-temporal progression recapitulated the physical progression of cells with a high correlation to the capture times (R 2 = 0.851) (Fig 3A) .

We compared PROB with other pseudotime inference methods (Slice, Slicer, PhenoPath, Wishbone, PAGA, Monocole2, DPT, Tscan). PROB estimation achieved a highest correlation with the original physical capture times among all methods tested, evaluated using both Kendall Tau rank correlation coefficient (Fig 3B) and coefficient of determination R 2 (S6 Fig). We next compared the accuracy of PROB with other existing GRN inference methods (e.g., PCOR, ARACNe, CLR, MRNET, Lasso, GENIE3, SCODE and LEAP) for cross-sectional data. A previous study measured binding region coverage scores for 23 TFs and thus quantified Comparison of PROB with other existing pseudotime inference methods and GRN inference methods using a real dataset. We employed a set of scRNA-seq data of dendritic cells (DCs) for benchmarking since the gold standard in this situation is available. The cells were sequenced at 1, 2, 4 and their regulatory potential in the DCs using a high-throughput Chromatin ImmunoPrecipitation (HT-ChIP) method [37] . A TF network was defined where an edge was viewed to be present if the coverage score between two TFs was greater than 0.3. We employed this network as a benchmark to compare the prediction accuracy of the network topologies inferred by PROB (Fig 3C) and other methods based on the above scRNA-seq data of DCs. The AUC values ( Fig  3D) indicated that PROB outperformed the other existing methods.

Furthermore, we collected a set of known regulators and targets [38] to test whether PROB could correctly reveal the regulatory causality. To this end, we applied PROB to infer a GRN for 6 regulators and 28 targets based on the above DC scRNA-seq data and defined outgoing causality score (OCS) for each gene in the inferred network (see definition of OCS in the Methods section). The OCS values of regulators were much higher than that of targets (Fig  3E) , suggesting that PROB faithfully revealed the ordering of the OCS values for the known regulators and targets on the analyzed dataset.

In addition, we summarized and compared the capabilities of the above methods in predicting gene regulatory links, directions, signs and expression dynamics (Fig 3F) . Only PROB can simultaneously fulfill those four tasks in GRN inference.

Sarcomatoid urothelial bladder cancer (SARC) is a highly lethal variant of bladder cancer and has been reported to be evolved by the progression of the conventional urothelial carcinoma (UC) [39] . It has been demonstrated that the dysregulation of genes involved in the epithelialto-mesenchymal transition (EMT) drives the progression of UC to SARC. To elucidate the dynamic change of the EMT regulatory network during the progression, here, we applied PROB to an expression dataset of bladder cancer containing 84 UC samples and 28 SARC samples (GSE128192). We collected 44 representative genes involved in several typical EMTregulating pathways (S1 Table) . The expression patterns of these genes were recovered along with the inferred temporal progression (Fig 4A) .

We then applied PROB to reconstruct GRNs for UCs and SARCs, respectively, based on the ordered expression data of the above 44 genes. Fig 4B and Fig 4C show the UC-specific network and the SARC-specific network, respectively, suggesting rewiring of the EMT regulatory network during the progression of UC to SARC. The two networks were enriched with crosstalks between different pathways, indicating cooperative regulation of EMT by those pathways. PTPN12 and ACSS1 were found to have largest out-degree values in UC-specific network and SARC-specific network, respectively (S1 Table) . Temporal dynamics of gene expression (Fig 4D) showed that ACSS1 and PTPN12 oscillated synchronously with CDH1 (coding gene of epithelial marker protein E-cadherin) at the early stage of UC development. However, at a later stage before transition to SARC, ACSS1 dramatically increased and PTPN12 decreased. Meanwhile, the decrease of CDH1 later on indicated a transition from epithelial to mesenchymal phenotype in SRACs, in consistent with changes in EMT score values (Fig 4E) . 

The decrease in PTPN12 expression during the progression is consistent with the previous finding that the loss of PTPN12 promotes EMT process and cell migration [43] . Furthermore, our result suggests that the up-regulation of ACSS1 might play a crucial role in the bladder cancer progression by promoting EMT program. We managed to validate the role of ACSS1 in EMT during bladder cancer progression, which has not been reported previously. The overexpression of ACSS1 in the 5637 cell line resulted in a significant decrease in CDH1 expression level (Fig 5A) , and ACSS1 knockdown by small interfering RNA leaded to significant increase in CDH1 expression level (Fig 5B) . The consistent changes in CDH1 protein levels following ACSS1 overexpression and knockdown were also observed (Fig 5C and 5D) . The numerical values of qPCR data and quantification of western blots were provided in S1 Data. These results confirmed that ACSS1 promoted EMT in bladder cancer cells. Furthermore, the Table) . (d) Reconstructed expression dynamics of ACSS1, PTPN12 and CDH1. ACSS1 and PTPN12 have largest out-degree values in the UCspecific network and SARC-specific network, respectively. CDH1 is a marker gene of epithelial state during EMT. (e) A decrease in EMT score indicated a transition from epithelial to mesenchymal state during the progression of UC to SARC. The EMT score for each tumor sample was calculated as weighted sum of expression levels of 73 EMT-signature genes as introduced in [39] . Positive EMT score corresponds to the epithelial phenotype while negative score to mesenchymal phenotype. Wilcoxon rank sum test (one-tailed) p value was calculated to assess the statistical significance. (Fig 5E) revealed that conventional UC tumors showed focal retention of epithelial marker protein E-cadherin while SARC tumors showed focal retention of ACSS1, supporting the above estimated dynamics of ACSS1 and CDH1 during bladder cancer progression.

To test whether our approach could be used to identify key genes underlying cancer progression, we applied PROB to a set of microarray data of breast cancer patients (n = 196) with clinical information (GSE7390) (see details in S5 Text) [40] . Based on the expression data reordered by PROB, we investigated which genes were upregulated or downregulated over progression by using a trend analysis technique. Such genes are referred to as temporally changing genes (TCGs) in this study. The one hundred top TCGs were selected. A heatmap with hierarchical clustering (Fig 6A) showed that these 100 genes were clearly clustered into two groups: a descending group (purple) and an ascending group (blue). We investigated the enriched gene sets for the two groups of genes using GSEA software [44, 45] . The descending genes were enriched in locomotion and movement of cell or subcellular component (Fig 6B,  upper panel) , and the ascending genes were mainly enriched in cell cycle and cell division processes (Fig 6B, lower panel) .

We then inferred the regulatory network of the above 100 top genes (Fig 6C) . Based on an eigenvector centrality measure (S5 Text), FOXM1 was identified as a most influential gene in the network. We verified significant associations between FOXM1 and the distant metastasisfree survival (DMFS), relapse-free survival (RFS) and overall survival (OS) (Fig 6D-6G ) and therapeutic responses (S7 Fig) in breast cancer patients (see details in S6 Text), in consistent with previous clinical studies [46] . Moreover, both in vitro and in vivo experiments [47, 48] have validated that FOXM1 plays important roles in breast cancer progression through promoting cell proliferation and cell cycle. Furthermore, FOXM1 has been used as a key drug target in breast cancer [49, 50] , and several drugs (e.g., daunorubicin, doxorubicin, epirubicin, and tamoxifen [51] ) developed to target or inhibit FOXM1 have been tested in clinical trials (https://clinicaltrials.gov/). These evidences suggest that our network inference and analysis approach is effective to identify key genes of cancer progression or candidate drug targets.

A subnetwork was reconstructed for FOXM1, which predicted that FOXM1 could positively regulate ASPM, CDCA8, KIF2C, MCM10, MELK, NCAPG, SHCBP1 and STIL (Fig 7A) . Preliminary investigation indicated that, except for STIL, the other 7 genes were functionally associated with FOXM1 according to String (https://string-db.org/), a database of functional protein-protein interaction networks (S8 Fig). We proceeded to validate the expression changes of these predicted target genes using microarray data of MCF-7 cells that were treated with DMSO (control) or thiostrepton (a FOXM1 inhibitor) for 6 hours (GSE40766 [42] ). We found that, except for SCCBP1 and STIL, the other 6 genes were significantly downregulated after FOXM1 inhibition (Fig 7B) . The statistical significance was assessed using Wilcoxon rank-sum test (one-tailed) p values. These results suggest that PROB well predicted both the directions and signs of the edges in the FOXM1 subnetwork. Moreover, we used ChIP-seq data (GSE40762 [42] ) to analyze the binding of FOXM1 to the predicted targeted genes (S7 Text). Both estrogen-dependent ER (+) MCF-7 and estrogenindependent ER (-) MDA-MB-231 human breast cancer cell lines were used for analysis. The analysis showed that FOXM1 binds ASPM, CDCA8 and KIF2C in both cell lines (Fig 7C-7H) . We note that the above three targets of FOXM1 were not previously reported by the widely used databases of transcriptional factor targets (e.g., TRANSFAC [52] and TRRUST v2 [53] ). Interestingly, in another human mammary epithelial cell line (HMEC) (GSE62425 [54] ) (S9 Fig) , the binding of FOXM1 to CDCA8 was absent, suggesting the emerging binding of FOXM1 to certain genes during the formation of breast cancer. In addition, we confirmed that the expression levels of the above three genes, ASPM, CDCA8 and KIF2C, were significantly reduced following the knockdown or silencing of FOXM1 based on both microarray data in BT-20 breast cancer cells (GSE2222 [55] ) (S10A-S10C Fig) and RNA-seq data in MCF-7 breast cancer cells (GSE58626 [56] ) (S10D-S10F Fig) . These findings suggest that FOXM1 not only positively regulates the expression of but also directly binds to some of the predicted genes.

PROB provides a novel tool for inferring cancer progression and GRNs from cross-sectional data. Our approach is based on a dynamical systems representation of gene interactions during cancer progression. The inverse problem with respect to GRN reconstruction was solved by combining latent progression estimation and Bayesian inference for high-dimensional dynamic systems. PROB can be used to generate experimentally testable hypotheses on the molecular regulatory mechanisms of gene regulation during cancer progression and to identify network-based gene biomarkers for predicting cancer prognosis and treatment response.

Besides cross-sectional bulk transcriptomic data, our method can be naturally applied to time-course scRNA-seq data (Fig 3) . Although scRNA-seq data can be used to infer GRNs during cell differentiation or development, it is currently not feasible to use scRNA-seq to investigate long term cancer progression due to patient heterogeneity, difficulty in acquisition of massive samples and expensive cost. In view of this, clinical transcriptomic data of cancer patients provide an alternative way to infer GRNs underlying cancer progression. The novelty and superiority of PROB can be first attributed to the successful ordering of tumor samples by using both gene expression data and staging information. Our proposed stage-weighted Gaussian kernel allows construction of diffusion-like random walks to quantify the temporal progression distance (TPD) between two patients (Eq (8)). The diffusion map, as a manifoldbased nonlinear dimension reduction method, has been recently applied to scRNA-seq data analysis [26, [57] [58] [59] . One major difficulty in applying diffusion maps for inferring pseudo trajectories lies in identifying the rooting point when using scRNA-seq data itself, and it often needs additional biological knowledge. An advantage of clinical transcriptomic data is that staging or grading information is usually available for samples as well, allowing development of an algorithm that automatically identifies the rooting point (Eq (9)). We demonstrated that incorporating staging information into the temporal progression inference significantly improved its accuracy (S1 Fig) and that our method significantly outperformed existing pseudotime inference methods (Figs 3B and S6) .

Considering technical variabilities in the sample-based transcriptomic data, it is important to have good robustness of the interaction coefficients in the GRN model with respect to the perturbation of the temporal progression. In addition to proving such property mathematically, through simulations we found that PROB inference of both the progression trajectory and the gene network structure are rather robust to noise in the data (Figs 2, S4 and S5 ). In addition, PROB is computationally efficient for GRN inference, which could be completed within 1 minute on the three real datasets analyzed in this study (S4 Table) .

For clinical applications, our method can be used to identify key genes for early detection of cancer progression and design of therapeutic targets. By recovering the temporal dynamics of gene expression in terms of the disease progression, PROB provides insights into exploiting kinetic features of functionally important genes that may be used as predictive biomarkers or drug targets. In the case study of bladder cancer progression, we have demonstrated that ACSS1 and PTNT12 played important roles in EMT during bladder cancer progression from UC to SARC and their expressions dynamically changed over the progression (Figs 4 and 5) . Therefore, we hypothesized that the temporal dynamics of EMT regulatory genes (e.g., ACSS1 or PTPN12) could be exploited to predict cancer progression. To this end, a logistic regression model was developed to predict EMT states or histological subtypes (UC vs. SARC) of bladder cancer based on the expression levels of ACSS1 and PTPN12, which showed good predictive accuracy (S11 Fig). As such, the early changes in expressions of ACSS1 and PTPN12 during the progression of UC to SARC may be relevant for the early detection of SARC.

In another case study of breast cancer, FOXM1, a drugable target, was identified as a key regulator underlying breast cancer progression (Fig 6) and, importantly, the predicted FOXM1-target regulations were validated (Fig 7) . Furthermore, here, we propose a GRN kinetic signature (S8 Text) based on FOXM1-targeted gene interactions to prognosticate relapse in breast cancer. Kaplan-Meier (K-M) survival curves were plotted for the high-risk group (green) and low-risk group (red) of patients with respect to relapse-free survival (RFS) (S12A-S12C Fig) . The log-rank test p values for all three datasets were less than 1e-4. Moreover, we tested the statistical significance of the FOXM1-targets interactions in predicting relapse in breast cancer using a bootstrapping approach (S8 Text). We compared the prognostic power (Wald test p value) of the FOXM1-predicted targets with that of 10000 sets of 8 randomly selected genes. The permutation test p values for all three datasets were less than 0.05 (S12D-S12F Fig) , verifying the non-randomness of the predicted targeted genes of FOXM1. These results demonstrated that the predicted FOXM1-target interactions could be used as a biomarker for prognosticating relapse in breast cancer. The latent-temporal progressionbased casual network reconstruction method proposed in this study will likely innovate other network-based methodologies, such as those in system genetics [60, 61] , network pharmacology [62, 63] , and network medicine [64, 65] .

Our method has several limitations that could be improved in future studies. For example, in the current method, only gene expression profiles and staging information from patient samples have been used for latent-temporal progression modeling. Other covariates, for example, age, genetic mutation, and molecular subtypes, might also be useful for progression inference [66] . Statistical models that integrate multiple aspects of clinical information will provide better inference of disease progression.

In summary, we have developed a novel latent-temporal progression-based Bayesian Lasso method, PROB, to infer directed and signed gene networks from prevalent cross-sectional transcriptomic data. PROB provides a dynamic and systems perspective for characterizing and understanding cancer progression based on patients' data. Our study also sheds light on facilitating the regulatory network-based approach to identifying key genes or therapeutic targets for the prognosis or treatment of cancers.

Supporting information S1 Fig. Incorporation of staging information significantly improved the accuracy of latenttemporal progression inference. We compared PROB with its several variants: 'ω xy = 1' represents setting the weight coefficient ω xy in Eq (2) to be 1; 'x ref = random' represents randomly assigning the reference point (Eq (9)) to identify the rooting point as the previous pseudotime inference methods usually did; 'No stage' represents leaving out the stage information, i.e., both 'ω xy = 1' and 'x ref = random'. Kendall tau correlation coefficient or determinant coefficient (R 2 ) between the inferred temporal progression and the staging data (or the capture time in scRNA-seq data) was calculated for each method. Microarray data and RNA-seq data on two breast cancer cell lines (BT-20 and MCF-7, respectively) were used for analyses. (a-c) The expression levels of the above three genes in BT-20 breast cancer cells under FOXM1 siRNA or control (mock transfection and GFP siRNA) conditions were analyzed using a set of microarray data (GSE2222) [55] (d-f) RNAseq data (GSE58626) [56] of MCF-7 breast cancer cells was used to analyze the differential expressions of the above three genes after FOXM1 inhibition by using small molecule compound IB that specifically inhibits FOXM1 [56] . The knockdown or silence of FOXM1 significantly reduced the expressions of the above three genes. Wilcoxon rank sum test (one-tailed) p value was calculated to assess the statistical significance. (TIF) S11 Fig. ACSS1 and PTPN12 are predictive of EMT and progression of UC to SARC. A logistic regression model was developed to predict (a) EMT states or (b) histological subtypes (UC vs. SARC) of bladder cancer based on the expression levels of ACSS1 and PTPN12. The samples were randomly divided into training set (n = 56) and test set (n = 56). The AUCs for the EMT phenotype prediction and subtype prediction are 0.8054 and 0.9405, respectively. (TIF) S12 Fig. A GRN kinetic signature predicts relapse in breast cancer. The kinetic features of the FOXM1-target interactions were formulated as a risk score to predict relapse for breast cancer patients in multiple independent cohorts. (a-c) Prognostic significance of the FOXM1-target interactions with respect to predicting relapse-free survival (RFS) in breast cancer evaluated on different datasets (GSE2990 [68] , GSE12093 [69] and GSE5327 [70] ). The log-rank test p value was used to assess the statistical significance of the difference between the Kaplan-Meier (K-M) survival curves of the high-risk group (green) and the low-risk group (red) of patients. (d-f) Nonrandomness test of the FOXM1-target interactions in predicting relapse in breast cancer using a bootstrapping approach (Text S6). The permutation test p values for all three datasets (0.0073, 0.0401 and 0.006, respectively) were less than 0.05, verifying the statistical significance of the prognostic power of the FOXM1-target interactions. (TIF) S1 Table. Out-degree values of the genes in the UC-specific and SARC-specific networks. 

Network medicine: a network-based approach to human disease

Drawing causal inference from Big Data

Quantitative and logic modelling of molecular and gene networks

Growing Seed Genes from Time Series Data and Thresholded Boolean Networks with Perturbation

Dynamic modeling of genetic networks using genetic algorithm and S-system

Modelling non-stationary gene regulatory processes with a non-homogeneous Bayesian network and the allocation sampler

Combining tree-based and dynamical systems for the inference of gene regulatory networks

Methods for causal inference from gene perturbation experiments and validation

Learning a nonlinear dynamical system model of gene regulation: A perturbed steady-state approach

Inference of dynamic networks using time-course data

Dynamic modeling and network approaches for omics time course data: overview of computational approaches and applications

Prioritizing Parkinson's disease genes using population-scale transcriptomic data

Differential Coexpression Network Analysis for Gene Expression Data

Reverse engineering of regulatory networks in human B cells

Regression shrinkage and selection via the lasso: a retrospective

Inferring Regulatory Networks from Expression Data Using Tree-Based Methods

Network-based methods for human disease gene prediction

Identification of Causal Genetic Drivers of Human Disease through Systems-Level Analysis of Regulatory Networks

BAMM tools: an R package for the analysis of evolutionary dynamics on phylogenetic trees

Tumor classification using phylogenetic methods on expression data

Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph

Data quality in genomics and microarrays

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium

Applied and Computational Harmonic Analysis

Spectral graph theory

Diffusion pseudotime robustly reconstructs lineage branching

Gene Regulatory Network Inference from Single-Cell Data Using Multivariate Information Measures

ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics

Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles

Biological Network Inference Using Redundancy Analysis

Regression Shrinkage and Selection Via the Lasso

SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiation

LEAP: constructing gene co-expression networks for single-cell RNA-sequencing data using pseudotime ordering

SINCERITIES: inferring gene regulatory networks from time-stamped single cell transcriptional expression profiles

Single-cell RNA-seq reveals dynamic paracrine control of cellular variation

A high-throughput chromatin immunoprecipitation approach reveals principles of dynamic gene regulation in mammals

Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses

Dysregulation of EMT Drives the Progression to

Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series

Authoritative sources in a hyperlinked environment

Genome-wide mapping of FOXM1 binding reveals co-binding with estrogen receptor alpha in breast cancer cells

Transposon mutagenesis identifies genes and cellular processes driving epithelial-mesenchymal transition in hepatocellular carcinoma

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

The molecular signatures database hallmark gene set collection

FoxM1: Repurposing an oncogene as a biomarker

Identification of FOXM1 as a specific marker for triple-negative breast cancer

FOXM1: From cancer initiation to progression and treatment

CBP/β-Catenin/FOXM1 Is a Novel Therapeutic Target in Triple Negative Breast Cancer

The FOXO3-FOXM1 axis: A key cancer drug target and a modulator of cancer drug resistance. Seminars in Cancer Biology

FoxM1 as a novel therapeutic target for cancer drug therapy

TRANSFAC: transcriptional regulation, from patterns to profiles

TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions

Delineating transcriptional networks of prognostic gene signatures refines treatment recommendations for lymph node-negative breast cancer patients. The FEBS journal

Loss of the Forkhead Transcription Factor FoxM1 Causes Centrosome Amplification and Mitotic Catastrophe

Suppression of the FOXM1 transcriptional programme via novel small molecule inhibition

Diffusion maps for high-dimensional single-cell analysis of differentiation data

Single-cell transcriptome-based multilayer network biomarker for predicting prognosis and therapeutic response of gliomas

Inferring microenvironmental regulation of gene expression from single-cell RNA sequencing data using scMLnet with an application to COVID-19

Systems genetics approaches to understand complex traits

Systems Genetics of Metabolism: The Use of the BXD Murine Reference Panel for Multiscalar Integration of Traits

Network pharmacology

Network Analysis to Identify Multiscale Mechanisms of Drug Action

Network medicine: a network-based approach to human disease

Gene regulatory network inference: evaluation and application to ovarian cancer allows the prioritization of drug targets

Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data

An online survival analysis tool to rapidly assess the effect of 22,277 genes on breast cancer prognosis using microarray data of 1,809 patients

Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis

The 76-gene signature defines high-risk patients that benefit from adjuvant tamoxifen therapy

Lung metastasis genes couple breast tumor size and metastatic spread

We would like to acknowledge Profs. Tianshou Zhou, Jinzhi Lei, Yong Wang for valuable discussion. We would also like to acknowledge Drs. Zifeng Wang and Dongliang Leng for processing the Chip-seq data.