key: cord-0055149-1fxr77nx
authors: Wang, Pei
title: Statistical Identification of Important Nodes in Biological Systems
date: 2021-01-12
journal: J Syst Sci Complex
DOI: 10.1007/s11424-021-0001-2
sha: 5f09ce7dbf50cbf6c3e985c1760ae7a210c88067
doc_id: 55149
cord_uid: 1fxr77nx

Biological systems can be modeled and described by biological networks. Biological networks are typical complex networks with widely real-world applications. Many problems arising in biological systems can be boiled down to the identification of important nodes. For example, biomedical researchers frequently need to identify important genes that potentially leaded to disease phenotypes in animal and explore crucial genes that were responsible for stress responsiveness in plants. To facilitate the identification of important nodes in biological systems, one needs to know network structures or behavioral data of nodes (such as gene expression data). If network topology was known, various centrality measures can be developed to solve the problem; while if only behavioral data of nodes were given, some sophisticated statistical methods can be employed. This paper reviewed some of the recent works on statistical identification of important nodes in biological systems from three aspects, that is, 1) in general complex networks based on complex networks theory and epidemic dynamic models; 2) in biological networks based on network motifs; and 3) in plants based on RNA-seq data. The identification of important nodes in a complex system can be seen as a mapping from the system to the ranking score vector of nodes, such mapping is not necessarily with explicit form. The three aspects reflected three typical approaches on ranking nodes in biological systems and can be integrated into one general framework. This paper also proposed some challenges and future works on the related topics. The associated investigations have potential real-world applications in the control of biological systems, network medicine and new variety cultivation of crops.

Complex network science has interfused with many other scientific areas and has wider and wider real-world applications [1] [2] [3] [4] [5] [6] . Plenty of real-world systems can be described or modeled by complex networks. Such as WWW, Internet, citation networks among scientific journals or authors, social systems and biological systems. Among which, biological systems are typical complex systems [1, [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] .

Biological systems can be described by complex networks at different levels, generally including transcriptional regulatory networks (TRNs), gene regulatory networks (GRNs), proteinprotein interaction (PPI) networks, metabolic networks, signaling networks, and so on [7] . Social networks are another typical example of complex networks. Spreading phenomena in social networks are ubiquitous, especially for epidemic spreading [17] [18] [19] [20] , such as SARS [21, 22] , SARS-CoV-2 [23] [24] [25] . Structure of social networks may affect disease spreading, and thus infectious diseases should be controlled via different approaches in different types of networks. Therefore, one should put the investigation of disease transmission and control in complex social networks. Before the complex network science was widely known, scientists mainly considered the spreading rules of diseases, seldom considered the topological structures of social networks. Network topological structures have strong effects on epidemic spreading [18] [19] [20] . Existing works reported that infectious disease can be easily spreading among people in scale-free networks [26] [27] [28] . Unfortunately, many real-world systems have the scale-free property, which indicates that it is actually difficult to control spreading phenomena in scale-free networks through control transmission strengths, unless spreaders were isolated [28] .

Finding influential spreaders is the first step to control infectious disease spreading or design immunization strategies in social networks. Generally, nodes in complex systems are heterogeneous, indicating that nodes with different topological features or dynamical behaviors may have great differences on infection scopes, thus, a fundamental question is how to rank nodes in a complex system? Or in other word, how to identify important nodes? In fact, under different circumstances, important nodes have different meanings. From the epidemic perspective in complex networks, important nodes are equivalent to influential spreaders, nodes with high propagation capability/spreading scope [29] , and so on. From the functional perspective in biological systems, important nodes may mean important genes [30] , or disease genes [31] , or functional genes [11-16, 32, 33] . In the following, we indiscriminately call important nodes as key nodes, crucial nodes, and so on.

Many problems in real-world systems can be boiled down to the identification of important nodes [19, 32, 33] . For example, the identification of influential spreaders in social networks, the identification of biomarkers or disease-causing genes in biological systems, the screening of stress responsive crucial genes in plants, and the selection of key points in electrical systems. To facilitate the identification of important nodes in biological systems, one needs to know either network structures among entities in the systems or behavioral data of entities. Once we obtained the networks or data from the biological systems, one of our ultimate goals was to identify the most important entities in the system.

A complex network consists of nodes and edges [1] . Mathematically, it can be described by an adjacency matrix A = (a ij ) n×n . Here, n denotes the number of nodes in the complex network. The elements of the adjacency matrix A are always non-negative, a ij > 0 if node i and node j were connected, otherwise, a ij = 0. Moreover, if a ij = a ji for any node i and j, we call network A is undirected, otherwise, it is directed; If a ij took either 0 or 1, then it is called an unweighted network; Otherwise, it is weighted. Mathematically, node ranking in a complex network is equivalent to find a mapping [29] from the network A = (a ij ) n×n to node importance vector S = (s 1 , s 2 , · · · , s n ) T :

Here, s i represents the importance of node i in the network. Additionally, if there were some further node attributes (such as functional annotations of genes, gene expression profiles in different samples) described by matrix B = (b ij ) n×d and edge attributes (such as repression or activation) described by matrix C = (c ij ) n×h , then the mapping may be written as:

Here, d represents the maximum number of node attributes, n represents the total number of nodes in the network, and h denotes the maximum degree of nodes. c ij represents the value of edge attribute for the j'th edge of node i. f (A, B, C) was a function of A, B, C. Actually, the problem can be degenerated into a network embedding problem or dimensional reduction problem [34, 35] . It is noted that the mapping F or function f (·) can be with explicit form, but usually, it is difficult to find the explicit form, such as those methods based on the integration of matrices A, B, C.

With the rapid development of complex network science, various methods have been established to rank nodes if topological structure of complex networks were known [19] . These methods include the neighbor-based methods, path-based methods, iterative refinement measures and so on. But different methods have different advantages and different scopes of applications; There are no universal methods for all cases. Therefore, new measures are continuously proposed. Especially, with the rapid development of big data, some sophisticated statistical methods [36] have been proposed to deal with the cases where only behavioral data of entities in the complex system were known. In this paper, we introduce some works on the statistical identification of important nodes from three aspects, including our works in general complex networks [29] , in biological networks [30, 37] and based on RNA-seq data [32] . The three approaches depend on different conditions, which can be integrated into one general framework, and seen as the dimensional reduction problems. The rest paper is organized as follows. From Sections 2 to 4, we introduce the three aspects of works, and in the last Section 5, we propose some challenge problems of the related topics and give some concluding remarks.

Generally, the importance of a node in a complex network can be reflected by the quantity and quality of its neighbors, as well as the shortest paths that pass from this node. Thus, many neighbor-based methods, path-based methods and iterative refinement measures have been proposed [19] . Neighbor-based methods include degree centrality, semi-local centrality, k-shell, h-index, ING [38] ; path-based methods include betweenness centrality, eccentricity, closeness, Katz centrality; PageRank (PR) [39] , LeaderRank [40, 41] (LR), and eigenvector centrality are iterative refinement measures [19] . The mentioned measures have broad applications in realworld systems. For example, the well-known PR algorithm was originally used by Google to rank websites.

Each measure has limited scopes of applications and many of them can not be well explained through statistical theory [29] . For example, for the PR and LR algorithms in a degree uncorrelated network, based on mean-field theory, we proved that the average importance score for nodes within the node group with degree k = (k out , k in ) is proportional to k in , that is,

Here, s (k) is the average importance score for nodes within the node group with degree k = (k out , k in ). k out , k in denote out-degree and in-degree respectively. n denotes the number of nodes in the complex network. k in denotes the average in-degree of the network. Similar to the LR, by adding a ground node that was bidirectionally connected all the n nodes in the network, we propose two novel algorithms to rank node's propagation capability, which are called Spectralrank (SR) and weighted Spectralrank (WSR) [29] . Mathematically, the algorithms are described as:

Here, S denotes the SR score and S w denotes the WSR score; A is the adjacency matrix of the augmented network,

W = A + P , P = diag{p 1 , p 2 , · · · , p n , p n+1 } is a priori knowledge of node's importance score, and we always set p n+1 = 0 for the ground node. c is a tuning parameter and is usually selected as the reciprocal of dominant eigenvalue of A or W , that is c = 1/λ 1 . On the basis of this fact, the SR and WSR scores reduce to the dominant eigenvectors of A and W , respectively. The algorithms include three steps: Firstly, we add a ground node to the graph, which bidirectionally connected with all other nodes in the complex network. This step can make all nodes in the network strongly connected. In the second step, if the WSR was considered, we add the a priori knowledge, which is a diagonal matrix, with values represent a priori knowledge of node importance. If SR was considered, we omit the second step. We calculate the principle eigenvector of the obtained matrix as SR or WSR score in the last step. Figure 1 shows an illustrative example. If we consider the total degree of each node in the original network as a priori knowledge, we obtain the following weighted augmented adjacency matrix: 

The dominant eigenvectors of A and W can be obtained as follows: Therefore, the ranks for the seven nodes in the original network according to the SR is Rank SR = (2, 4, 3, 5, 1, 7, 6) T and Rank W SR = (2, 3, 4, 5, 1, 7, 6) T corresponds to the WSR. Obviously, the rankings from the SR and WSR were quite similar, except for nodes 2 and 3. This is because that the total degree for node 2 is 3, which is larger than that for node 3, thus, if this priori knowledge was considered, the WSR ranks node 2 more important than node 3. 

The algorithms are very simple and efficient. Moreover, we established some probability frameworks to illustrate that the proposed algorithms are statistically meaningful [29] . For simplicity, we considered undirected and unweighted complex networks, and we assumed that the networks follow the fitness growth model, that is, the probability p(i, j) of adding an edge between nodes i and j is proportional to the product of their importance scores s i and s j :

Furthermore, we supposed that the generated networks follow the Boltzmann distribution [42] :

where the energy is given by the Hamiltonian function H(A; s) = − i,j∈V a ij s i s j and Z A = A∈A p(A; s) is the partition constant. Note that this distribution coincides with the Ising model without an external field [42] . A denotes the ensemble of all the generated networks.

Based on the above assumptions, the following conclusions were obtained: 1) The maximum likelihood estimation of importance score vector S in the fitness model is exactly the eigenvector centrality of the network A under the constraint S T S = 1. Furthermore, c = 1/λ 1 is the necessary condition of the maximum likelihood estimation:

Here, s = (s 1 , s 2 , · · · , s n+1 ) T and s T s = 1, similarly hereinafter.

2) We assume that a priori distribution of s is governed by the conjugate a priori p(s) = e −s T P s /Z F , Z F = p(s)ds is a partition constant. We deduced that the priori knowledge P functioned as a L 2 norm penalty, which can effectively prevent over-fitting:

3) The addition of the ground node is also a kind of L 2 norm penalty, which can also prevent over-fitting:

For details, one can refer to [29] . The proposed algorithms were evaluated by 32 real-world networks [29] . The actual importance scores of nodes were simulated according to the SIR model. Each node was successively set as initial spreader, and the final propagation scope of a node after the spreading process reached its stable state was taken as its actual importance score. The actual average importance score vector was obtained by averaging over 100 independent simulation runs. The Kendall correlation coefficient between actual importance score vector and S was used to evaluate the performance of the proposed algorithm. Numerical results show that the proposed algorithms outperform many other existing algorithms, and it can be applied to binary networks, undirected and directed networks.

We also applied the algorithms to biological networks, including the neural network of C. elegans, TRN of E. coli. Taken the command interneurons in the neural network, the 18 global regulators and 7 key global regulators in the TRN as gold standards, we performed ROC analysis, results show that the proposed algorithms also have good performance in biological networks. The applicability of the algorithms in biological networks indicated that functional important transcription factors (TFs) or neurons may also play important roles in signaling spreading.

Network motif [43] and motif centrality [44] . Network motifs are patterns that recur much more frequently in the real network (a) than in an ensemble of randomized networks (b). Each node in the randomized networks has the same number of incoming and outgoing edges as does the corresponding node in the real network. Red dashed lines indicate edges that participate in the FFL motif, which occurs five times in the real network. 

Real-world biological networks are too complex, which have hindered our comprehensive understanding of them. Fortunately, researchers have founded that biological networks consist of simple regulatory circuits, called as network motifs [43] . Thus, investigations on network motifs are the first step to the system level understanding of biological networks.

The concept of network motifs was proposed by Uri Alon and coauthors in the year 2002 [43] . Network motifs are patterns of interconnections occurring in a complex network at numbers that are significantly higher than those in randomized networks (Figures 2 a and b) . For each network, one generates hundreds of randomized networks. The number of a subgraph in the real-world network is denoted as N rl . The average number in random networks is denoted as N rd , with standard deviation denoted by S d . The Z score measures the significance of the subgraph [43] , which is defined as Z score = (N rl − N rd )/S d . Another index U is defined as the number of times a subgraph appears in the investigated network with distinct sets of nodes. Generally, subgraphs with Z score ≥ 2, U ≥ 4 and N rl − N rd ≥ 0.1N rd are identified as motifs.

Uri Alon and coauthors reported that the three-node feed-forward loop (FFL) is a typical network motif. FFLs thus attracted wide attentions. Many works have been published to clarify the relationships among structures, functions and dynamics of network motifs [45] .

Network motifs are building blocks of complex biological networks, thus, they can be used to evaluate node importance in biological networks. Previously, some researchers have considered to use network motifs to rank nodes. For example, Koschützki, et al. [44] proposed a node centrality measure based on the FFL and the role of each node in the FFL (Figures 2 c-g) . They counted the frequency of each node involved in the FFL according to three roles, and nodes can be ranked by the sum of their total frequencies or according to each role. Except that, network motifs have been also used to evaluate node importance in neuron networks [46] [47] [48] [49] .

Motivated by existing works and network motifs, we proposed a new measure based on principal component analysis (PCA) [50] for directed biological networks [30] . The proposed method includes two steps: Motif counts and PCA analysis. Firstly, we detected all two-node, threenode and four-node network motifs in a biological network, and we counted the frequency of each node involved in each type of motifs, and then different motifs were weighted according to their total frequencies. Finally, we performed PCA analysis on the obtained data and our new measure was proposed as the first principal component.

Mathematically, suppose that there were totally m types of two-node, three-node and fournode network motifs. We denoted the occurrences of node i in the j-th type of motif as u ij , i = 1, 2, · · · , n, j = 1, 2, · · · , m. Then, we derived a matrix U = (u ij ) n×m for the network. In real-world networks, the importance of different types of motifs were varied. Therefore, we endowed each motif with a weight w j , j = 1, 2, · · · , m, where w j = c j / m k=1 c k . Here, c k (k = 1, 2, · · · , m) denotes the number of the k-th type of motif. Subsequently, we derived a revised matrix B = (b ij ) n×m = (b 1 , b 2 , · · · , b m ) = (w j u ij ) n×m . Based on B and the idea of the PCA [50] , we constructed the following index to obtain importance score S for the n nodes in the network:

It was proved that the parameter vector α = (α 1 , α 2 , · · · , α m ) T was just the eigenvector of the dominant eigenvalue of the covariance matrix for matrix B [30] :

Here, B is the column mean vector of matrix B. Figure 3 shows an illustrative example for the proposed motif centrality [30] . We applied the proposed algorithm to five real-world biological networks, including the neuronal network for C. elegan, TRNs for E. coli, drosophila and yeast, a signal transduction network for human [30] . These networks encompassed hundreds to thousands of nodes, and tens to tens of thousands of network motifs. ROC curve analysis revealed that the proposed method can well identify command interneurons in the neuron network, and global regulators in the TRN for E. coli. The proposed algorithm can exclude not important hubs but rank non-hub and actually important nodes at the top. Moreover, based on rich-club analysis, the proposed algorithm can also help us to find densely connected clusters, the top ranked nodes were more densely connected than those identified by the other methods.

The above work only considered directed biological networks. PPI networks are generally undirected. Extensive measures have been proposed to evaluate the structural importance of a node in a complex network [19] . Motivated by the proposed motif centrality and based on PCA, by integrating different centrality measures via PCA, we proposed an integrative measure in PPI networks, which can help to identify structural dominant proteins (SDPs) [37] . The proposed measures can integrate many existing measures (Figure 4 ). For simplicity, we considered the well-known degree centrality, betweenness, closeness, k-shell, semi-local centrality and motif centrality measures, and integrated them together.

Based on literature survey and existing databases, we constructed several real-world PPI networks with different sizes for the yeast. Moreover, in order to see the evolution of SDPs, we also constructed artificial PPIs, which are based on the duplication-divergence (DD) model [37] . We considered the anti-preference duplication process and the edge deletion, dimerization, edge addition and isolated node removal divergence processes. The DD model can generate an ensemble of random networks. The artificial PPI networks have similar topological features as the real-world ones with fine tuning parameters in the DD model [37] .

By applying the algorithm to the constructed PPI networks, we can find SDPs in the PPI networks. Moreover, we found that only a small fraction of proteins were structurally dominant. SDPs evolved more slowly than unimportant nodes. Targeted mutations on SDPs can keep certain robustness, as compared with targeted mutations on hubs [37] .

C : Degree 

RNA sequencing (RNA-seq) uses the next generation sequencing (NGS) technologies to reveal the presence and quantity of RNA molecules in biological samples. RNA-seq analyses can be performed at four different levels: Sample-level, gene-level, transcript-level, and exon-level [51] . In the sample-level analysis, the results are usually summarized into a similarity matrix. The gene-level analysis summarizes the counts of RNA-seq reads mapped to genes in samples of different conditions, and it subsequently compares genes' expression levels that were calculated based on read counts. The transcript-level analysis focuses on reads mapped to different isoforms. The exon-level analysis mostly considers the reads mapped to or skipping the exon of interest.

A flow chart of RNA-seq analysis typically includes the following steps: Experimental design, RNA sequencing, data analysis, biological mechanism clarification and experimental verification ( Figure 5 ). The first step is to design experiments and cultivate samples for certain purpose. And then sequencing the samples via high-throughput sequencer and obtaining sequence data. The subsequently complicated work is to perform data analysis. Such as sequence alignment, finding differentially expressed genes (DEGs) and perform hypothesis test. One of the ultimate goal of data analysis is to find important genes that cause phenotype variation among experimental samples. Based on the selected genes, researchers can clarify the related molecular mechanism. The obtained results should be verified according to qRT-PCR experiments or functional verification (such as gene mutation experiments) for further applications in the cultivation of new crop varieties [10, 32, 33] .

A flow chart of RNA-seq analysis typically includes the following steps:

Experimental design, RNA sequencing, data analysis, biological mechanism clarification and experimental verification

As we have mentioned, one of the most important goal of RNA-seq data analysis is the identification of important genes that possibly cause phenotype variation. Many methods have been proposed to cope with this problem. Traditionally, log 2 fold change values and hypothesis tests are frequently used to select crucial DEGs [32, 33, 51] . If there are m treated samples and m samples served as controls, and the expression values (FPKM, RPKM or other methods) of gene i (i = 1, 2, · · · , n) in the treated samples are denoted as y i = (y i1 , y i2 , · · · , y im ) T , and those in the controlled samples are denoted as x i = (x i1 , x i2 , · · · , x im ) T . The mean expression values under treatment and control are denoted as y i and x i respectively. Then the log 2 fold change value for gene i was defined as:

log 2 (F C i ) > 1 indicates that gene i was up-regulated under treatment in comparison with control, and the expression under treatment was more than two times higher than that under control; log 2 (F C i ) < −1 indicates that gene i was down-regulated under treatment in comparison with control, and the expression under control was more than two times higher than that under treatment. Moreover, one can perform hypothesis t test (the expression values of genes in repeated samples under a certain condition are assumed to be normally distributed) to verify whether the expression of gene i under treatment was significantly different from that under control. Generally, due to experimental costs, the sample size m is very small. Thus, the revised t statistic can be used to test the null hypothesis that the expression values of a gene between the treated and the control samples have no significant difference, which is written as [52] :

Here,

. Under the null hypothesis, the t i statistic follows the t distribution with degree of freedom:

Conventionally, one can use | log 2 (F C i )| > 1 and P i = p{|t(df i )| > t i } < 0.05 as criterion to evaluate whether gene i was a DEG. If there are too many DEGs, one can choose larger threshold values for log 2 (F C) and P . For example, | log 2 (F C i )| > 2 and P < 0.01. Recently, some novel methods to screen DEGs from RNA-seq data have been proposed, which were based on the assumption that the read counts follow negative binomial distribution, and the related R package DESeq [53] and DESeq2 [54] were successively developed. Except the above mentioned method, researchers have developed many other methods to identify crucial genes from omics data. For example, Li, et al. [55] , Chen, et al. [56] and Hou, et al. [57] introduced the hidden Markov model (HMM) for genome-wide association studies (GWAS). Among which, based on the HMM and the 'guilt-by-rewiring' principle of the gene co-expression networks, Hou, et al. [57] identified disease genes in crohn's disease and parkinson's disease.

In the following, taking one of our works on Brassica napus (B. napus) as an example, we briefly introduce one work on the identification of important genes based on RNA-seq data [32] . The Brassica genus includes a diverse range of vegetable and oilseed crops which are important for human nutrition, such as Brassica rapa (B. rapa), Brassica oleracea (B. oleracea) and B. napus. The whole genome map for B. napus [58] , B. rapa [59] , and B. olercea [60] and Arabidopsis thaliana [61] have been published, which facilitate us to genome-widely explore these species.

Plants encompass diverse TF families, such as AP2, bZIP, MYB, NAC and WRKY [32] . Some TF families are crucial for stress responsiveness in plants. Thus, the identification of TF families in B. napus is an interesting yet important problem. Through gene and protein sequence alignment, we identified five families of TFs in the four species. We genome-widely identified totally 2167 TFs in B. napus belonging to the five families, including 518 BnAP2/EREBPs, 252 BnbZIPs, 721 BnMYBs, 398 BnNACs and 278 BnWRKYs, which contained some novel members in comparison with existing results [32] .

We performed structural analysis, synteny analysis and cis-acting element analysis on the identified TFs [32] . Sub-genome distributions of BnAP2/EREBPs and BnMYBs indicated that the two families might have suffered from duplication and divergence during evolution. Phylogenetic analysis revealed that each TF family can be divided into several subfamilies according to their sequence similarity. Synteny analysis revealed strong co-linearity between B. napus and its two ancestors, although chromosomal rearrangements have occurred and 85 TFs were lost. About 7.6% and 9.4% TFs of the five families in B. napus were novel genes and conserved genes, which both showed preference on the C sub-genome.

To see the responsiveness of the five TF families under stress, we designed RNA-seq experiments and obtained RNA-seq data. We cultivated B. napus seedings of 7-day-old as samples, and we considered five treatments, including cold, heat, drought, salt and ABA, The seedlings without any treatments were taken as controls. Seedlings were sampled at 12h after treatments for RNA extraction. Each experiment was repeated three times. GO enrichment analysis revealed that the 315 DEGs totally enriched in 213 biological process terms (P < 0.01), including various biological regulation processes, responding to various stimulus and diverse signaling pathways. Clustering analysis on expression values of the 315 DEGs revealed that crucial TFs in each family were hierarchically clustered [32] . The expression profiles of crucial TFs under drought, salt and ABA were all similar in the five families. TFs from the same subfamilies tended to be clustered.

For the 315 crucial TFs, based on gray correlation coefficient, we constructed gene coexpression networks, and performed comparative analysis with homologous gene network of A thalina. We found that the crucial TFs could trigger the differential expression of targeted genes, resulting in a complex clustered network with clusters of genes responsible for targeted stress responsiveness. To verify the reliability of the data, we verified 40 genes via qRT-PCR experiments. qRT-PCR results revealed that the obtained data are reproducible [32] .

In this paper, we mainly reviewed some recent works on the identification of important nodes in biological systems, which are based on three different approaches and depending on different conditions. We have proposed the SR and WSR algorithms to evaluate node propagation capability in general complex networks, and we build a new probabilistic framework, which provides a theoretic understanding on eigenvector centrality, ground node and a priori knowledge in the algorithm. The proposed SR and WSR can be seen as a mapping from the adjacency matrix A of a complex network to node importance score S. We also proposed a network motif centrality for directed biological networks, which is based on PCA and network motif detection; We further extended the idea of the motif centrality to undirected PPI networks, and we proposed an integrative measure, which can be used to identify SDPs in PPI networks. The motif centrality measure and the integrative measure can not be explicitly written as a mapping from the adjacency matrix A of a complex network. However, the motif participation matrix U or weighted matrix B is actually a function of the adjacency matrix A, which can be described as B = g(A). Here, the function g(·) can not be explicitly written. In this case, the motif centrality can also be seen as a mapping from A to S, similarly for the integrative measure. We explored five TF families in B. napus and investigated their stress responsive characteristics, 315 crucial TFs were screened. The last work was totally based on data, thus, it is different from the previous two works. However, RNA-seq data are obtained from biological systems, which can also be constructed as a complex network. Thus, identifying crucial TFs from RNA-seq can be also seen as a mapping. It is interesting yet meaningful to establish a unified framework for the mentioned works, the recently developed graph representation theory [34] or graph neural networks [35] are promising tools.

Although the proposed methods have some advantages, there are still many issues to be further explored. For example, for the proposed motif centrality, if there are no network motifs in a biological system, then the algorithm will lose efficacy. However, one possible extension of the motif centrality is that one does not consider whether a subgraph was a network motif, and consider all two-node, three-node and four-node subgraphs. Another extension is that the PCA is a linear method, possibly one can extend the importance score S as a nonlinear function. However, nonlinear S will undoubtedly increase the computational difficulty. For the identification of important genes based on RNA-seq data, we mainly used the log 2 fold change value and hypothesis test methods to identify crucial DEGs. Our future works will consider the reconstruction of gene co-expression networks based on RNA-seq data, and then we will further use the gene co-expression networks to identify crucial responsive genes [62] . A big challenge in RNA-seq data analysis is in that one often encounters the cases with p >> n, that is, the number of variables or features is far larger than the number of samples [63] . When p >> n, many traditional statistical methods lose their effectiveness, one must establish new methods to perform data analysis. Our ongoing works will consider the logistic regression method with various penalizations [63] to cope with such problem.

The associated works can help us to understand the complex biological systems, and they may have potential applications in biological network control, network medicine and new variety cultivation of crops.

The Structure and Dynamics of Networks

Coreness and h-index for weighted networks

Compressive-sensing-based structure identification for multilayer networks

Cooperative epidemic spreading on a two-layered interconnected network

Stability and feedback control for a coupled hematopoiesis nonlinear system

Coarse graining method based on generalized degree in complex network

Structural controllability of static and dynamic transcriptional regulatory networks for Saccharomyces cerevisiae

Network medicine: A network-based approach to human disease

Multi-gene co-transformation can improve comprehensive resistance to abiotic stresses in B. napus L

Functional characterization of GhPHOT2 in chloroplast avoidance of Gossypium hirsutum

Fine-tuning stomatal movement through small signaling peptides

Comparative transcriptome analyses of drought-resistant and -susceptible Brassica napus L. and development of EST-SSR markers by RNA-Seq

Use of comparative transcriptome analysis to identify candidate genes related to albinism in channel catfish

Transcriptome analysis of the molecular mechanism of Chrysanthemum flower color change under short-day photoperiods

Genome-wide identification and expression analysis of NADPH oxidase genes in response to ABA and abiotic stresses, and in fibre formation in Gossypium

Identification of influential spreaders in complex networks

Identifying influential spreaders in artificial complex networks

Vital nodes identification in complex networks

Dynamics of information diffusion and its applications on complex networks

A novel coronavirus associated with severe acute respiratory syndrome

Newly discovered coronavirus as the primary cause of severe acute respiratory syndrome

A novel coronavirus from patients with pneumonia in China

Clinical features of patients infected with 2019 novel coronavirus in

Statistical and network analysis of 1212 COVID-19 patients in Henan, China

Epidemic spreading in scale-free networks

Absence of epidemic threshold in scale-free networks with degree correlations

Improving immunization strategies

Spectral learning algorithm reveals propagation capability of complex network

Identification of important nodes in directed biological networks: A network motif approach

Graphical features of functional genes in human protein interaction network

Exploring transcriptional factors reveals crucial members and regulatory networks involved in different abiotic stresses in Brassica napus L

Transcriptomic basis for drought-resistance in Brassica napus L

Graph representation learning: A survey

A comprehensive survey on graph neural networks

Statistics for High-Dimensional Data: Methods, Theory and Applications

Identification and evolution of structurally dominant nodes in proteinprotein interaction networks

Iterative neighbour-information gathering for ranking nodes in complex networks

The anatomy of a large-scale hypertextual web search engine

Leaders in social networks, the delicious case

Identifying important nodes by adaptive LeaderRank

Fundamental of statistical and thermal physics

Network motifs: Simple building blocks of complex networks

Ranking of network elements based on functional substructures

Network motifs: Theory and experimental approaches

Centrality analysis methods for biological networks and their application to gene regulatory networks

Motifs in brain networks

Identification and classification of hubs in brain networks

Complex network measures of brain connectivity: Uses and interpretations

Applied Multivariate Statistical Analysis

Modeling and analysis of RNA-seq data: A review from a statistical perspective

Statistics for the Life Sciences

Differential expression analysis for sequence count data

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

A hidden Markov random field model for genome-wide association studies

Incorporating biological pathways via a Markov random field model in genome-wide association studies

Guilt by rewiring: Gene prioritization through network rewiring in genome wide association studies

Early allopolyploid evolution in the post-neolithic Brassica napus oilseed genome

The genome of the mesopolyploid crop species Brassica rapa

The Brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes

The Arabidopsis Information Resource (TAIR): A comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant

Network-constrained regularization and variable selection for analysis of genomic data

Logistic regression for disease classification using microarray data: Model selection in a large p and small n case