key: cord-0001676-lli46azw authors: Wang, Jinlian; Zuo, Yiming; Man, Yan-gao; Avital, Itzhak; Stojadinovic, Alexander; Liu, Meng; Yang, Xiaowei; Varghese, Rency S.; Tadesse, Mahlet G; Ressom, Habtom W title: Pathway and Network Approaches for Identification of Cancer Signature Markers from Omics Data date: 2015-01-01 journal: J Cancer DOI: 10.7150/jca.10631 sha: 3b5c43f1068e79db27adaca5df0bd717851c37cc doc_id: 1676 cord_uid: lli46azw The advancement of high throughput omic technologies during the past few years has made it possible to perform many complex assays in a much shorter time than the traditional approaches. The rapid accumulation and wide availability of omic data generated by these technologies offer great opportunities to unravel disease mechanisms, but also presents significant challenges to extract knowledge from such massive data and to evaluate the findings. To address these challenges, a number of pathway and network based approaches have been introduced. This review article evaluates these methods and discusses their application in cancer biomarker discovery using hepatocellular carcinoma (HCC) as an example. A better understanding of disease associated with biomarkers could potentially start a new area for uncovering the mechanism of cancer progression, development and offer better targets for drug development [1] . Studies on single gene/protein/ metabolite molecular signatures offer limited insight into the complex interplay among the molecules responsible for progression of complex diseases such as cancer. Thus, there is a shift toward the identification of a panel of genes that interact directly or indirectly in the form of pathway or complex network to evaluate their association to cancer [2, 3] . This is accomplished through massive data derived by high throughput omic technologies such as next generation sequencing, microarray, and mass spectrometry. Although thousands of candidate biomarkers have been discovered by these technologies, few of them have been transferred into practical application in clinical setting and new drug production. The challenges lie in (1) high false positive rate of the candidate biomarkers identified from omics data; (2) Lack of attention on the study of the context of biomarkers who are interacting each other in the form of pathway or network associated with cancer; (3) Fragmental and incomplete information based on biomarkers identified from solely omics platform; (4) Lack of effective algorithms that allow integration of diverse omics data sources to simulate the biological pathway and networks. To meet these challenges, a number of pathway and network based approaches have been introduced. This review article evaluates the advantages and limitations of these methods. The traditional approaches that individual and a panel of cancer biomarkers are selected by analytic Ivyspring International Publisher methods such as analysis of variance (ANOVA), Lasso, pairwise, information theory and support vector machine (SVM) do not explicitly consider interaction between genes, proteins and metabolites. Compared to traditional methods, pathway and network centric methods naturally provide a way to understand the underlying pathways and the interactions between individual signature markers and non-markers. With the large-scale generation and integration of genomic, transcriptomic, proteomic, and metabolomic data, pathway/network-based methods provide a more effective and accurate means for cancer biomarker discovery. Increasingly, pathway and network-based analyses are applied to omics data to gain more insight into the underlying biological function and processes, such as cell signaling and metabolic pathways as well as gene regulatory networks [4] [5] [6] . A number of pathway /network approaches have also been used for improving the prediction of cancer outcome, providing novel hypotheses for pathways involved tumor progression [7] , and exploring cancer associated biomarkers [8] . For example, Taylor et al. [9] combined gene expression data with physical protein-protein interaction data to identify subnetwork markers for the prognosis of breast cancer and lymphoma patients. Torkamani and Schork [10] used gene co-expression network to infer cancer-initiating genes in breast, colorectal cancer, and glioblastoma. Kim et al. applied the MAPIT (Multi Analyte Pathway Inference Tool) algorithm to identify prognostic network markers to predict GBM patient survival time using multi-analyte network markers discovered by integrating gene expression profile, epigenomic profile, and protein-protein interactome [11] . Goh et al. [12] built a human disease network (HDN) by linking hereditary disease that share a disease-causing gene recorded in Mendelian Inheritance in Man (OMIM) database. Although the functional connections in the HDN remain to be further demonstrated, it inspires us to systematically study the relationships among diseases by constructing a network. More detailed descriptions of relationships between human disease and network essential for understanding of human have been recently summarized in reviews [2, [12] [13] [14] . In this review, we provide some pathway and network centric computational approaches and their applications for biomarker discovery. Availability of biomedical pathways and networks based on large-scale data gathering through diverse omics data sources offers new opportunities to explain the causality of relationships between bio-logical entities and cancers [15] . As shown in Figure 1 , the general steps of the biomarker discovery include the following: 1) Define precisely a well-framed, relevant clinical problem and focus the experimental design around appropriate study populations and samples; 2) Collect tissue samples or fluids from patients and suitable assays; 3) Acquire high-throughput data from the omics technologies; 4) Analyze the data using signal processing, statistical and machine learning methods to select relevant features from the data; 5) Integrate the pathway/network knowledge from databases such as KEGG, HMDB and Reatcome mapping candidate biomarkers to the corresponding pathways or networks; 6) Evaluate biomarkers to estimate their diagnostic or prognostic capability and clinical validity using alternative technologies such as Westen blot, ELSA, and RT-PCR. In computational aspect, cross-validation and independent validation are the commonly used methods to evaluate the performance of a biomarkers. P-values, sensitivity, specificity and the area under receiver operating curves (AUC) are used as quantitative indicators of the performance of the methods [16] ; 7) Use the biomarkers for clinical applications after reliable pre-clinical tests and validation of the markers in a large population. Statistical methods test scientific theories when observations, processes or boundary conditions are subject to stochasticity. For examples, the classical t-test has been extensively used for testing differential gene expression in microarray data [35] . However, this kind of procedure relies on reasonable estimates of reproducibility or within-gene error, requiring a large number of replicated arrays. Thus, several methods for improving estimates of variability and statistical tests of differential expression have been proposed. For example, Significance Analysis of Microarrays (SAM) aimed to improve the unstable error estimation in the two-sample t-test by adding a variance stabilization factor which minimizes the variance variability across different intensity ranges [36, 37] . ANOVA model approach is widely used in multiple kinds of omic data. For example, it was used to model microarray data with the effects of array, condition, and condition-array interaction and then to fit the residuals with the effects of gene, gene-condition interaction, and gene-array interaction [38, 39] . Also, it was applied to capture the effects of controlled groups, batches, condition, alias of experimental equipment, and condition-metabolite interaction separately on LC-MS data [40] . To improve the accuracy and sensitivity of analytic results, false discovery rate (FDR) [41] and its refinement, q-value, (q-value package, www.bioconductor.org) have been rapidly adopted for genomic, proteomics and metabolic data analysis including the widely-used SAM, DAVID [42] and other approaches [36] . Another statistics method for biomarker discovery is linear discriminant analysis (LDA), one of the classical statistical classification techniques based on the multivariate normal distribution assumption, is quite robust and powerful to discover biomarker or pathways between omics data for many different applications despite the distributional assumption. Compared to LDA, quadratic discriminant analysis (QDA) requires more observations to estimate each variance-covariance matrix for each class [43] . In addition, logistic regression analysis has been successfully used to evaluate biomarker performance of prostate cancer with mRNA profiling [44] . Logistic regression (LR) model based on the regression fit on probabilistic odds between comparing conditions requires no specific distribution assumption (e.g. Gaussian distribution) but is often found to be less sensitive than other approaches [42, 43] . The modeling fundamentals of graph theory are often used to describe the global topology, structure or the community of a complex system. It emphasizes on entities (e.g, genes, proteins, diseases, biological process) and the relations between them. The complexity of graphical modeling can be either simple only with nodes and edges or more complex where edges have weights, and nodes and edges can be of different types. Recent publications have applied graphical modeling in computational biology to study biological networks, enhance the ability to draw causal inferences from functional MRI experiments, support the early detection of disconnection and the modeling of pathology spread in neurodegenerative disease such as Alzheimer's disease [45] [46] [47] [48] [49] . For example, in mammalian cells, Bleris et al. have had early success in characterizing the dynamics of key feed forward modules and motifs, helping to enable the circuit design of adaptive gene expression [50] . Using graph based approaches, Ma'ayan et al. model cellular machinery including genes, proteins and other subcellular compartments [51] , in which the interactions between components are drawn as edge connections between the relevant nodes [51] . Gene expression data combined with network analysis can yield important information on how expression variation relates to differences between observed states [52] . As closely connected genes tend to be involved in similar functions, network annotation can complement clusters obtained via fold change analysis [7] . A standard systems-based approach to biomarker and drug target discovery consists of placing putative or known biomarkers in the context of a network of biological interactions, followed by different 'guilt-byassociation' analyses [53] . Table 1 . Computational methods for biomarker discovery categorized by their application, examplary tools and URLs. Technique & Application Examples Exemplary Tools &URL Statistical analysis Hypothesis testing, random sampling. ANOVA. Detection of differentially expressed genes/proteins, genotypes, biomarker filtering/selection [24] BRB:http://linus.nci.nih.gov/BRB-ArrayTools.html PAM: http://www-stat.stanford.edu/~tibs/PAM/ SAM: http://www-stat.stanford.edu/~tibs/SAM/ Pattern recognition Machine learning, Probabilistic, instance-based, kernel classification models. Clustering, multi-source data classification, biomarker selection and associations [25] Bayesian regression models [26] , partial least squares [27] , and Genetic Algorithm/KNN [28] . Weka: http://weka.wikispaces.com/ LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ PRTools: http://prtools.org/ R package: http://cran.r-project.org/web/views/Bayesian.html Graph/network theory Network topology analysis, network visualization and data integration, clustering. Genetic, regulatory, protein-protein, signaling network analysis, biomarker/target identification [29] BioNet [4] :http://www.fda.gov/ScienceResearch/BioinformaticsTools/ucm285284.htm Jung: http://jung.sourceforge.net/ http://bioinfo.mc.vanderbilt.edu/dmGWAS.html [30] Data visualization and imaging Sequence and cluster visualization, interactive visualization, statistical analysis graphs. Data exploration, biomarker visualization, model explanation, in vivo/in vitro imaging of molecules and cells [29] Cytoscape [31] : http://www.biotapestry.org/ Medusa: http://coot.embl.de/medusa/ Graphviz: http://www.graphviz.org/ Osprey:http://biodata.mshri.on.ca/osprey/servlet/Index Pajek: http://vlado.fmf.uni-lj.si/pub/networks/pajek/ 3Omics: http://3omics.cmdm.tw Natural language processing and information retrieval Ontologies, text mining, information representation standards, information retrieval and extraction. Inference of functional associations from publications, automated annotation and characterization [32, 33] The goal of visualization is to find patterns and structures that remain hidden in the raw unstructured datasets. Graph visualization is key to display directly the various relationships between entities (e.g., genes, proteins). Challenges of graph visualization lie in 1) the high false positive rate of incorporating heterogeneous multi-omic datasets; 2) Visual representation of the logical structure transformed from the raw data; 3) Graph manipulation and layout algorithm for representing the complicated relationships between biological entities. 4) Heterogeneous omic data from different level visualization needs more flexibility for layered representation. A number of commercial and free sourced graph visualization tools and platforms have been extensively developed. For example, Cytoscape [31] , one of the free open source platforms providing biological network analysis and visualization with more than 172 registered plugins contributed by the community, is very versatile in network applications, such as network importing, network integrating, inference customization, literature mining, topological clustering, functional enrichment, network comparison, and programmatic access [54] . 3DScapeCS, a Cytoscape plugin providing three-dimensional, dynamic, parallel network visualization for Mass Spectrometry (MS) molecular network [55] . IPA [56] , a commercial software tool for pathway analysis with omics data provides powerful graphical visualized pathways and networks overlaid by diseases, drugs and biological process etc. Path-wayStudio provide abstractive graphical interface for users to analyze gene expression, protein interaction and metabolic data to analyze and explore the pathways and networks identified from data. STRING not only gives the graphical visualized protein interaction of both known and predicted but also quantifies each pair of proteins by their interaction types such as physical interaction and gene fusion etc. [57] . Bayesian methods allow informative priors so that prior knowledge or results of a previous model can be used to inform the current model. In cancer bioinformatics and systems biology, the primary application of Bayesian methods include Bayesian inference, Bayesian network, Naive Bayes classifier and Bayesian variable selection. Among these methods, Bayesian network is one of the most common modeling tools for pathway and network analysis [19] . Bayesian network is a form of directed statistical modeling designed to capture conditional dependencies between probabilistic events [58] . It consists of a dependency structure and local probability model also named probabilistic graph models which include Hierarchical Bayesian Networks (HBN), Probabilistic Boolean Networks (PBN), Hidden Markov Models (HMM), and Markov Logic Networks (MLN) [59] [60] [61] . The dependency structure specifies how the variables are related to each other by drawing directed edges between the variables without creating directed cycles. Each variable depends on a possibly set of other variables, termed "parents." Compared with other pathway/network centric method, Bayesian network model is capable of integrating heterogeneous data, missing value and dependent relationships between variables [62] . In a Bayesian network model, probabilities define the relationship between the current node and its predecessor or parent in a graph [63] . The power of these methods lies in their ability to facilitate the reverse engineering of multiplex networks based on molecular expression, molecular activity and/or cell behavior data, serving as a precursor to synthetic modifications of existing molecular pathways [64] . Bayesian inference is one of the very important Bayesian methods widely used in cancer biomarker discovery, signaling pathway and network inference [65, 66] . It has previously been applied to gene expression data for inference of gene regulatory networks [67, 68] , infer both protein signaling networks [69, 70] and gene regulatory networks [71] . To incorporate an explicit time element, dynamic Bayesian Inference was proposed to interrogate dynamic signaling responses within a Bayesian framework, with existing signaling biology incorporated through an informative prior distribution on networks [66] . In addition, Bayesian variable selection aims at solving the problems of "large p, small n" existing in omic data set and using prior knowledge such as pathway and protein interaction to estimate the posterior probability by Markov Chain Monte Carlo (MCMC) also widely used to infer functional interactions in biochemical pathway, model the interactions between different functional modules of a biological network [72] and pathway based cancer biomarker discovery [73, 74] . For example, Yang et al. [21] used a Bayesian network to construct HCC cell networks and identify functional modules and interactions between these modules. Stochastic simulation models offer an alternative, but they are hitherto associated with a major disadvantage: their likelihood functions cannot be calculated explicitly, and thus it is difficult to couple them to well-established statistical theory such as maximum likelihood and Bayesian statistics. A number of new methods, among them Approximate Bayesian Computing and Pattern-Oriented Modeling, bypass this limitation. The difference between Bayesian and frequentist inference lies in the following: 1) Bayesian inference provides answers conditional on the observed data and not based on the distribution of estimators or test statistics over imaginary samples not observed (Rossi et al., 2005, p. 4) ; 2) It includes uncertainty in the probability model, yielding more realistic predictions. 3) It safeguards against overfitting by integrating over model parameters. But the quality of the prior information directly impacts the performance of the Bayesian methods. Also, they are unable to account for feed-back regulation, a hallmark of signaling networks. With the growth of information in literature and biomedical databases, biological and clinical scientists need efficient means of handling and extracting diagnostic methods and prognostic terms and information from scientific literature. For this purpose, text mining that comprises the discovery and extraction of knowledge from free text to generate new hypotheses particularly relevant and helpful in biomedical research [14] . Text mining complements the reading of scientific literature by individual researchers, allows rapid access to information contained in large volume of documents and increases the reproducibility of literature searches by enabling users to process all documents for a specific result. The primary application of text-mining in biomedical research roughly lies in three aspects: 1) Simple text-mining such as transforming textual information into database content and integrating with existing knowledge resources to suggest novel hypotheses; 2) Literature analysis including clustering and classification of entities or diseases; 3) Integrative biology for producing or testing hypotheses against knowledge bases. Currently, text mining is being successfully applied to the identification of molecular causes of diseases using facts from databases and literature [75] [76] [77] . For example, text-mining has been used to suggest disease biomarkers from the scientific literature, and made on the basis of the assumption that two proteins are likely to interact with each other if they share a substantial amount of contextual information [78, 79] . By defining a gene of interest, a network is constructed from all scientific publications related to the query-defined gene. The results can be browsed by navigating through the visualized network. CoPub makes uses of lexical resources for genes, proteins, Gene Ontology labels, diseases, pathways, drugs and tissues to identify and statistically to qualify the significance of a specific term for a gene or a set of genes [80] . The results return a set of annotations for their genes of interest. Besides, text mining has been widely used in industrial large scale knowledge base for query genes, proteins, metabolic compounds and drugs functional analysis. To visualize knowledge contained in the scientific literature, software tools have been developed that provide improved integration of text-mining results with other data resources. For example, IPA (Ingenuity) [56] , KEGG [81] , Pathway Studio [82] and HPRD [83] use text-mining to integrate gene/protein-phenotype associations linking genes and protein variants to the diseases, toxic effects and drug response to their knowledge databases. Depending on the tasks researchers address, text-mining can achieve different objectives. This include primarily the following: 1) retrieval information from relevant documents; 2) Identification of entities such as genes, diseases, complex relationship between entities and diseases and interactions between proteins and genes [80] ; 3) Deposit extracted information into database or used to support manual database curation efforts [15] ; 4) Generation hypothesis [79] and test novel research questions [78] . The trend of text-mining technique is shifting from the analysis of only abstracts to the full text of papers, from the analysis of gene and protein-related information to the information about cells, tissues and whole organisms. The most prominent shift is to integrate information from the literature with data sets from other domains such as gene expression profiles [84] , genome-wide association studies (GWASs), biochemistry and phenotype [84, 85] . Text-mining is prone to integration with machine learning, statistical techniques. In the future, text-mining might face several major challenges such as improve literature analysis, integrate to existing knowledge base, visualization of extracted information. Machine learning methods have been used for the biomarker discovery from high-throughput omics data, inferring causal relations between mutations and diseases [21] , interactions between genes and proteins [86] [87] [88] and relations between environmental features and cancer [89] as well as pathway and network modeling. There are two kinds of basic machine learning techniques, one is unsupervised machine learning such as hierarchical clustering, self-organizing mapping (SOM) etc. [90] . The other is supervised machine learning which needs known knowledge from data train a model and then apply this model to predict the output variables [3] . A number of machine learning such as SVM [14] , Artificial Neural Network [91] , decision tree and random forests (RFs) etc. have been widely for various applications including identification of breast cancer biomarkers [92] , diagnosis biomarker of Parkinson disorders [93] , subcellular locations of proteins [94, 95] , the prediction of protein functions on the basis of protein structures [96, 97] , the annotation of mutations [98, 99] . For example, Han proposed a machine learning based derivative component analysis method to select implicit feature by capturing subtle data behaviors and removing system noises from a proteomic profile to overcome the reproducibility problem for biomarker discovery in proteomics [100] . Another interesting study by Hoshida et al [101] combined eight independent cohorts of gene expression profiles to reveal the subclass of HCC and their related pathways using unsupervised machine learning methods. They found that three common subclasses (S1-S3) of hepatocellular carcinoma (HCC) were significantly correlated to Wnt pathway, MYC, AKT and hepatocyte differentiation respectively. Westen blotting; knockout and immunohistochemical staining were used for experimental validation of their discovery. Another framework called knowledge-driven matrix factorization (KMF) proposed by Yang et al. was used to reconstruct phenotype-specific modular gene networks [21] . Integration of data from multiple omic studies not only can help unravel the underlying molecular mechanism of carcinogenesis but also identify the signature of signaling pathway/networks characteristic for specific cancer types that can be used for diagnosis, prognosis and guidance for targeted therapy. The methods described in Sections A-E have proven useful for discovering biomarkers from high-throughput omic data, analyzing protein-protein, protein-DNA, and kinase-substrate interactions, as well as for genetic interactions among genes [102] . These efforts have yielded good results in cancer biomarker discovery, protein interaction and interaction between genotype and diseases [103] . However, current omic technologies provide only limited fragmented reality of the biological functions within cell or cancer mechanism. Separate analysis of the data generated from each of these technologies is limited to revealing only partial aberrant molecular changes, because the interaction of multiple molecules cannot be modeled by isolated analysis of genes, proteins or metabolites. Furthermore, limitations such as intrinsic high noise, incomplete data, small sample-size, bias have motivated the use of integrative omic analysis and use of prior biological knowledge and information bases, rather than as mere collections of single large-scale omic studies [14, 34, 104] . However, integration of multiple disparate data types remains a significant challenge in systems biology research. Most recently, attempts at integration of multiple high-throughput omics data have concentrated on capturing regulatory associations between genes and proteins by comparing expression patterns across multiple conditions [105] [106] [107] , combining functional characterization and quantitative evidence extracted from different data sources of all levels of gene products, mRNA, proteins and metabolites, as well as their interaction [108] [109] [110] . Some previous works [81, [111] [112] [113] in integrative analysis utilize pathways in the form of connected routes through a graph-based representation of the metabolic network [114] . Other approaches focus on the functional module of protein interaction network and analyze experimental data in the context of pathways using multiple source omics data [14, 115, 116] . We and others have developed advanced bioinformatics tools and algorithms to facilitate the integration of diverse data types [34, 110, [117] [118] [119] [120] . Different biological types of data, such as sequences, protein structures and families, proteomics data, ontologies, gene expression and other experimental data sets show a growing complexity produced by numerous heterogeneous application areas. The integration of heterogeneous data is therefore becoming more and more important. In order to gain insights into the complexity and dynamics of biological systems, the information stored in these data repositories needs to be linked and combined in efficient ways. Hepatocellular carcinoma (HCC) is the fifth most common malignancy and the third leading cause of cancer death in the world, with the five-year survival rate approaching 7% [33] . Treatments of HCC include surgical resection and transplantation, ablation and transarterial chemoembolization, and systemic chemotherapy. Even so, no existing systemic chemotherapy is effective for advanced HCC [121, 122] . For example, Lovet et al. [123] reported that targeted therapy with sorafenib which inhibits multiple tyrosine kinase receptors (RAS/VEGFR) may prolong survival by about three months. However, due to the redundancy and compensation of the signaling network in HCC, a significant reorganization of the signaling network observed such as down regulation of tumor suppressors (p53 and CHK1 when XIAP silenced or p-RB when CDK6 silenced) and upregulation of tumor promoting proteins (ETS1 when XIAP silenced or p-CREB when CDK6 silenced) may confer the growth benefit for cancer cells [124] . This example suggests providing pathways and network information may improve the efficacy of systemic chemotherapy of HCC. Chang et al [125] partitioned the complex oncogenic signaling networks into basic units, or functional modules, of signaling activity (e.g., a protein phosphorylating another protein to activate its kinase activity) and demonstrated that gene expression signatures based on these modules can predict the effectiveness of pathway-specific therapeutics [125] . Except for surgical resection/transplantation of early stage HCC, the survival time is not significantly prolonged by any of these treatments. Added to pathway and network centric method making use of omics data with systematic chemotherapy will benefit the development of newer therapeutic targets for HCC treatment. In recent years, computational methods for models take more and more important roles in the HCC investigations [114, 126, 127] . Some computer systems have also been developed. For example, Shannon et al. [128] developed a java based tool Gaggle by integrating diverse databases (e.g., KEGG, Bi-oCyc, String) and software (e.g., Cytoscape, R ) to simultaneously explore the experimental data (e.g., mRNA and protein abundance, protein-protein and protein-DNA interactions), functional associations, metabolic pathways (KEGG) and Pubmed abstracts. Recently, Zheng et al [129] identified the molecular events underlying the development of HCV induced HCC by integrating gene expression profile and protein interaction data. To get the subnetworks, they refined the network by removing a network component if the number of nodes is smaller than five. They found four subnetworks called normal-cirrhosis, cirrhosis-dysplasia, dysplasia-early and early-advanced HCC networks. From each of the sub networks they identified functional modules and hub genes. By comparing the pathways in each sub networks, they observed changes of pathways and network activities. Their findings were validated by literature. Even though the types of omics data they used only include gene expression and protein interactions, they provide a way to study the changes of network activities by analysis of omic data. Zhang et al. used systematical method including partial least squares, literature mining technique and with GeneGO Meta-Core to discover the biomarkers of HCC with gene expression as well as protein data. Based on these marker genes, they constructed down regulated and up regulated networks. In the former, they identified 10 up regulated hub genes (MAPK1, SP1, HDAC1, YY1, ABL1, PTK2, SMAD2, NCOA3; CDC25A and NCOA2). They identified 7 hub genes (FOS, ESR1, JUNB, EGFR, SOCS3; FOLH1 and IGF1) in the latter. Partial least squares were employed to construct a classifier with these biomarkers. They used five-fold cross-validation and two independent datasets to evaluate the performance of the classifier. Furthermore, they used experimental immunohistochemistry and western blot measurements to verify the marker genes predicted by the classifier. Their results show that the network-based approach facilitates biomarker identification and improves classification accuracy [130] . Hollywod et al [131] identified driver genes which are potent diagnosis markers and mechanism study of HCC using t-statistic map (TM) and transcriptome correlation map (TCM) approaches with integration of DNA copy number measured by genomics CGH array and gene expression. They found 50 driver genes with significant prognostic relevance to HCC key signaling pathways such as mTOR, AMPK, and EGFR. siRNA-mediated knockdown experiments was used to evaluate the functional significance of the 50 driver genes [131] . Even though collection of diverse omics data to analyze the relationships between HCC phenotype and biological entities within the cell has been proved powerful enough, such integration is still fragmentary, incomplete and inadequate to reflect the whole picture of the cancer information and development. The amount of omics data from genomics, proteomics, metabolomics and interactomics is increasing. In pace with the explosion of omics data, a number of open-access databases, containing comprehensive gene, protein interaction, biological pathway and network information, are being developed to provide biologists with valuable tools for analyzing the data from complex biological systems. These include In-tAct, BioGRID, MINT, KEGG, PID, STRING and REACTOME etc. all of which provide very useful qualitative mappings of functional associations between key components in canonical pathways [14] . Table 2 summarizes primary data source and URLs specific to HCC. URLs EHCO [132] http://ehco.iis.sinica.edu.tw/ Onco.HCC [133] http://oncodb.hcc.ibms.sinica.edu.tw/index.htm HCVpro [134] http://cbrc.kaust.edu.sa/hcvpro/ HCVdb [135] http://euhcvdb.ibcp.fr/euHCVdb/ Hepatitis Virus Database (HVDB) [136] http://s2as02.genes.nig.ac.jp Los Alamos National Laboratory in the United States [137] http://hcv.lanl.gov LiverAtlas [138] http://liveratlas.hupo.org.cn dbHCCvar [139] http://GenetMed.fudan.edu.cn/dbHCCvar With wide applications of omics technique, more accurate and ubiquitous biomarkers have been identified, but only few have been brought to clinical setting and many have proved to be irreproducible [140] . One of the concerns is that biomarkers identified suffer from low diagnostic specificity and sensitivity which leads to current cancer biomarkers have not yet made a major impact in reducing cancer burden. For instance, serum alpha-fetoprotein (AFP) is the most widely used biomarkers for detecting and monitoring of HCC, but the false negative rate with AFP levels may be high as 40% for patients with early stage of HCC, for advances patients, the AFP levels remain small in 15%-30% of patients [141] . One of the important limitations is possible artifacts in conducing biological experiments such as instrument variability. Others include bias in sample collection and sample handling which lead to cohort differences. For example, Sreekumar et al. [142] reported sarcosine as a prostate cancer biomarker through metabolomics analysis. However, subsequent validation study done by Jentzmik et al. [143] concluded that the levels of sarcosine measured by GC-MS could not differentiate malignant from nonmalignant tissue. Collestelli et al. reported no statistically significant difference between prostate cancer and healthy controls in the sarcosine to creatinine ratios and that the levels of sarcosine were about 11.7% higher in the healthy controls [144] . Another important limitation relates to lack of computational methods that can extract knowledge from omic data involving substantial amount of noise, high dimensionality, missing values, etc. Although the use of pathway and network-based approaches and the integration of prior biological knowledge with omic data are promising in addressing some of the computational challenges, they too have some limitations as outlined below: • mRNA levels and DNA alterations may not accurately reflect the corresponding protein levels and fail to reveal changes in posttranscriptional protein modulation (e.g., phosphorylation, acetylation, methylation, ubiquitination, etc.) or protein degradation rates. Correlation of mRNA with its associated protein expression can be relatively low. The signaling network constructed using these approaches does not reflect the dynamic signal flow in a spatial relationship. Also, the genomic changes (mRNA level, SNP, CNV, methylation) ultimately affect protein expression, activation and inactivation, which, in turn, controls cellular behavior. • Current proteomic technologies provide only limited coverage of the proteome and more sensitive technologies are needed to identify and quantitate low abundant proteins [145, 146] . • Interpretation of pathway mapping results from the fact that pathway annotations currently take little consideration of tissue specificities of genes or proteins in the pathway. This limits the tissue and/or isoform specificity in pathway annotations. Thus, specific steps of a pathway may not be actually active in tissues/cells from which the omics data may be generated. In some cases, this may occur because protein isoforms or splice variants have been annotated as a protein class or a canonical protein sequence, respectively, in the pathway while they may be expressed differentially in different tissues/cells. • Because biological pathways are inherently complex and dynamic, pathway annotations in different pathway databases vary significantly in pathway models and in a number of other aspects, e.g., specific protein forms, dynamic complex formation, subcellular locations, and pathway cross talks. Current computational methods thus need to provide a solution to these issues including revealing patterns within the data, modeling heterogeneity, profiling of disease classes and subclasses, producing a predictive of patients' classification, etc.. Biomarker discovery is now changing research away from identification of individual biomarkers to searching for perturbed pathways and network activities. Early detection of cancer improves survival and enhances quality of life. An ideal marker would be one that can be measured easily and reliably using an assay with high sensitivity and specificity and undergo rigorous validation before they are introduced into routine clinical care. Currently, the treatment of most cancers is based on the tissue types and clinical stages. This approach is often ineffective due to the heterogeneity of the tumors. Pathway and network based method have taken more important role in analysis of high-throughput data. Pathway and network based methods provide a global and systematical way to explore the relationships between biomarkers and their interacting partners. Thus, future work is likely to focus on using pathway and network based methods for biomarker discovery. It is our expectation that methods discussed above will become a component in a shared infrastructure of biomedical resources that can be used by researchers to identify and to retrieve the most relevant work, to formulate hypothesis, to find supporting and contradicting evidence for hypotheses, to integrate research results into a framework of whole biological systems and to support the translation of research results across domains and into clinical applications. A pathway-based view of human diseases and disease relationships The future of model organisms in human disease research The genetics of autoimmune diseases: a networked perspective atBioNet--an integrated network analysis tool for genomics and biomarker discovery Therapeutically targeting glypican-3 via a conformation-specific single-domain antibody in hepatocellular carcinoma Myth of the chronic fatigue syndrome Network-based classification of breast cancer metastasis Phenotypic characterization of human colorectal cancer stem cells Dynamic modularity in protein interaction networks predicts breast cancer outcome Identification of rare cancer driver mutations by network reconstruction Multi-analyte network markers for tumor prognosis The human disease network Systems biology and the future of medicine Identification of aberrant pathways and network activities from high-throughput data Integrating text mining into the MGI biocuration workflow Omics" data and levels of evidence for biomarker discovery A statistical framework for Illumina DNA methylation arrays A directed protein interaction network for investigating intracellular signal transduction A framework for elucidating regulatory networks based on prior information and expression data Creating and analyzing pathway and protein interaction compendia for modelling signal transduction networks EpCAM-positive hepatocellular carcinoma cells are tumor-initiating cells with stem/progenitor cell features Non Linear Programming (NLP) formulation for quantitative modeling of protein signal transduction pathways MicroRNA-mRNA interaction network using TSK-type recurrent neural fuzzy network Cloud-based solution to identify statistically significant MS peaks differentiating sample categories Circulating transcriptome reveals markers of atherosclerosis Predicting the clinical status of human breast cancer by using gene expression profiles Partial least squares proportional hazard regression for application to DNA microarray survival data Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method Pathways of matrix metalloproteinase induction in heart failure: bioactive molecules and transcriptional regulation dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks Cytoscape: a software environment for integrated models of biomolecular interaction networks BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text Network analysis of epidermal growth factor signaling using integrated genomic, proteomic and phosphorylation data Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays Significance analysis of microarrays applied to the ionizing radiation response Does BCR/ABL1 positive acute myeloid leukaemia exist? Assessing gene significance from cDNA microarray expression data via mixed models Plasma Protein Biomarkers of the Geriatric Syndrome of Frailty Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments Statistical significance for genomewide studies Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources Developing optimal prediction models for cancer classification using gene expression data Logistic Regression Analysis with Standardized Markers Hepatocellular carcinoma in cirrhosis: incidence and risk factors Proteomic identification of CIB1 as a potential diagnostic factor in hepatocellular carcinoma Highly multiplexed proteomic platform for biomarker discovery, diagnostics, and therapeutics Machine learning and social network analysis applied to Alzheimer's disease biomarkers From brain topography to brain topology: relevance of graph theory to functional neuroscience Synthetic incoherent feedforward circuits show adaptation to the amount of their genetic template Toward predictive models of mammalian cells Molecular networks in microarray analysis Ontology-and graph-based similarity assessment in biological networks A travel guide to Cytoscape plugins 3DScapeCS: application of three dimensional, parallel, dynamic network visualization in Cytoscape Pathway analysis tools and toxicogenomics reference databases for risk assessment STRING v9.1: protein-protein interaction networks, with increased coverage and integration Bayesian network analysis of signaling networks: a primer Structure learning for Bayesian networks as models of biological networks Genetic studies of complex human diseases: characterizing SNP-disease associations using Bayesian networks A comparison of neural network models, fuzzy logic, and multiple linear regression for prediction of hatchability Bayesian Network for Omics Data Integration. Washington DC Bayesian methods for proteomics Bayesian design of synthetic biological systems Modeling signaling networks using high-throughput phospho-proteomics Bayesian inference of signaling network topology in a cancer cell line Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks An empirical Bayesian method for estimating biological networks from temporal microarray data Network inference using informative priors Systems analysis of EGF receptor signaling dynamics with microwestern arrays Using Bayesian networks to analyze expression data Integrating Bayesian variable selection with Modular Response Analysis to infer biochemical network topology Integrating Bayesian variable selection with Modular Response Analysis to infer biochemical network topology Integrating biological knowledge into variable selection: an empirical Bayes approach with an application in cancer biology G2D: a tool for mining genes associated with disease Association of genes to genetically inherited diseases using data mining Conceptual biology: unearthing the gems Novel protein-protein interactions inferred from literature context BioProspecting: novel marker discovery obtained by mining the bibleome CoPub: a literature-based keyword enrichment tool for microarray data analysis Observing metabolic functions at the genome scale Pathway studio--the analysis and navigation of molecular networks Human Protein Reference Database--2009 update A Bayesian network model for omics data integration The structural and content aspects of abstracts versus bodies of full text journal articles are different Reverse PCA, a Systematic Approach for Identifying Genes Important for the Physical Interaction between Protein Pairs Infection of common marmosets with hepatitis C virus/GB virus-B chimeras Identification and validation of dysregulated metabolic pathways in metastatic renal cell carcinoma Genes-environment interactions in obesity-and diabetes-associated pancreatic cancer: A GWAS data analysis Mohamed Salleh AH. A Review for Detecting Gene-Gene Interactions Using Machine Learning Methods in Genetic Epidemiology Neural networks Recursive SVM biomarker selection for early detection of breast cancer in peripheral blood Applying bioinformatics to proteomics: is machine learning the answer to biomarker discovery for PD and MSA Application of proteomic marker ensembles to subcellular organelle identification Computational identification of post-translational modification sites and functional families reveal possible moonlighting role of rotaviral proteins Measurements of protein sequence-structure correlations Evaluation and integration of existing methods for computational prediction of allergens How changes in extracellular matrix mechanics and gene expression variability might combine to drive cancer progression Genome-wide association and sequencing studies on colorectal cancer A novel profile biomarker diagnosis for mass spectral proteomics Integrative transcriptome analysis reveals common molecular subclasses of human hepatocellular carcinoma Differential network biology Global physiological understanding and metabolic engineering of microorganisms based on omics studies Transcriptomic profiling reveals hepatic stem-like gene signatures and interplay of miR-200c and epithelial-mesenchymal transition in intrahepatic cholangiocarcinoma Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles Topological analysis of protein co-abundance networks identifies novel host targets important for HCV infection and pathogenesis Bottlenecks and hubs in inferred networks are important for virulence in Salmonella typhimurium Gene expression-based chemical genomics identifies potential therapeutic drugs in hepatocellular carcinoma Integrative genomics: liver regeneration and hepatocellular carcinoma A network integration approach to predict conserved regulators related to pathogenicity of influenza and SARS-CoV respiratory viruses Global Metabolomic and Network analysis of Escherichia coli Responses to Exogenous Biofuels Co-regulation of metabolic genes is better explained by flux coupling than by network distance Network-based prediction of human tissue-specific metabolism MetaRoute: fast search for relevant metabolic routes for interactive network navigation and visualization Integration of expression data in genome-scale metabolic network reconstructions Systems analysis of the NCI-60 cancer cell lines by alignment of protein pathway activation modules with "-OMIC" data fields and therapeutic response signatures Integrative identification of Arabidopsis mitochondrial proteome and its function exploitation through protein interaction network Integrating the Alzheimer's disease proteome and transcriptome: a comprehensive network model of a complex disease Data merging for integrated microarray and proteomic analysis MetaboSearch: tool for mass-based metabolite identification using multiple databases Design and endpoints of clinical trials in hepatocellular carcinoma Novel advancements in the management of hepatocellular carcinoma in 2008 Sorafenib in advanced hepatocellular carcinoma Proteomics, pathway array and signaling network-based medicine in cancer A genomic strategy to elucidate modules of oncogenic pathway signaling networks LS-CAP: an algorithm for identifying cytogenetic aberrations in hepatocellular carcinoma using microarray data A tumor progression model for hepatocellular carcinoma: bioinformatic analysis of genomic data The Protein Information and Property Explorer 2: gaggle-like exploration of biological proteomic data within one webpage CD133 positive hepatocellular carcinoma cells possess high capacity for tumorigenicity A systems biology-based classifier for hepatocellular carcinoma diagnosis Metabolomics: current technologies and future trends Detection of the inferred interaction network in hepatocellular carcinoma from EHCO (Encyclopedia of Hepatocellular Carcinoma genes Online) HCC: an integrated oncogenomic database of hepatocellular carcinoma revealed aberrant cancer target genes and loci HCVpro: hepatitis C virus protein interaction database The euHCVdb suite of in silico tools for investigating the structural impact of mutations in hepatitis C virus proteins Development and public release of a comprehensive hepatitis virus database The Los Alamos hepatitis C sequence database LiverAtlas: a unique integrated knowledge database for systems-level research of liver and hepatic disease dbHCCvar: a comprehensive database of human genetic variations in hepatocellular carcinoma Proteomics research to discover markers: what can we learn from Netflix? Molecular and serum markers in hepatocellular carcinoma: predictive tools for prognosis and recurrence Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression Sarcosine in prostate cancer tissue is not a differential metabolite for prostate cancer aggressiveness and biochemical progression Sarcosine in urine after digital rectal examination fails as a marker in prostate cancer detection and identification of aggressive tumours Metabolomics approaches for discovering biomarkers of drug-induced hepatotoxicity and nephrotoxicity Significance of DNA polymerase delta catalytic subunit p125 induced by mutant p53 in the invasive potential of human hepatocellular carcinoma This study was supported by NIH Grant R01CA143420 awarded to HWR. The authors have declared that no competing interest exists.