key: cord-0003234-mowj6wyl authors: Zhou, Xuezhong; Lei, Lei; Liu, Jun; Halu, Arda; Zhang, Yingying; Li, Bing; Guo, Zhili; Liu, Guangming; Sun, Changkai; Loscalzo, Joseph; Sharma, Amitabh; Wang, Zhong title: A Systems Approach to Refine Disease Taxonomy by Integrating Phenotypic and Molecular Networks date: 2018-04-06 journal: EBioMedicine DOI: 10.1016/j.ebiom.2018.04.002 sha: 2bfa3061be84264c2b2bfb1a972894bd847b550f doc_id: 3234 cord_uid: mowj6wyl The International Classification of Diseases (ICD) relies on clinical features and lags behind the current understanding of the molecular specificity of disease pathobiology, necessitating approaches that incorporate growing biomedical data for classifying diseases to meet the needs of precision medicine. Our analysis revealed that the heterogeneous molecular diversity of disease chapters and the blurred boundary between disease categories in ICD should be further investigated. Here, we propose a new classification of diseases (NCD) by developing an algorithm that predicts the additional categories of a disease by integrating multiple networks consisting of disease phenotypes and their molecular profiles. With statistical validations from phenotype-genotype associations and interactome networks, we demonstrate that NCD improves disease specificity owing to its overlapping categories and polyhierarchical structure. Furthermore, NCD captures the molecular diversity of diseases and defines clearer boundaries in terms of both phenotypic similarity and molecular associations, establishing a rational strategy to reform disease taxonomy. Disease taxonomy plays an important role in defining the diagnosis, treatment, and mechanisms of human diseases. The principle of the current clinical disease taxonomies, in particular the International Classification of Diseases (ICD), goes back to the work of William Farr in the nineteenth century and is primarily derived from the differentiation of clinical features (e.g. symptoms and micro-examination of diseased tissues and cells) (Council et al., 2011) . Despite its extensive clinical use, this classification system lacks the depth required for precision medicine with the limitations of its rigid hierarchical structure and, moreover, it does not exploit the rapidly expanding molecular insights of disease phenotypes. For example, many diseases (e.g. cancer, chronic inflammatory diseases) in the current disease taxonomies have either high genetic heterogeneity (Bianchini et al., 2016; McClellan and King, 2010) or manifestation diversity (Arostegui et al., 2014; Jeste and Geschwind, 2014; Mannino, 2002) , which give little basis for tailoring treatment to a patient's pathophysiology. Furthermore, disease comorbidities (Hu et al., 2016; Lee et al., 2008; Hidalgo et al., 2009) , temporal disease trajectories (Jensen et al., 2014) in clinical populations, various molecular relationships between disease-associated cellular components and their connections in the interactome (Blair et al., 2013; Goh et al., 2007; Barabasi et al., 2011; Rzhetsky et al., 2007; Zhou et al., 2014) , and many successful drug repurposing cases (Li and Jones, 2012; Chong and Sullivan Jr., 2007; Ashburn and Thor, 2004; Wu et al., 2016; Evans et al., 2005) altogether demonstrate the vague boundary between different diseases in current disease taxonomies. Moreover, the deep understanding of diseases based on the advances in disease biology, bioinformatics, and multi-omics data necessitates the reclassification of disease taxonomy (Mirnezami et al., 2012) . In the past decade, efforts to reclassify diseases based on molecular insights have increased with studies related to molecular-based disease subtyping in different disease conditions, such as acute leukemias (Golub et al., 1999; Alizadeh et al., 2000) , colorectal cancer (Dienstmann et al., 2017) , oesophageal carcinoma (Cancer Genome Atlas Research et al., 2017) , pancreatic cancer (Bailey et al., 2016) , cancer metastasis (Chuang et al., 2007) , neurodegenerative disorders (Mann et al., 2000) , autoimmunity disorders (Ahmad et al., 2003) , multiple cancer types across tissues of origin (Hoadley et al., 2014) , and a network-based stratification method for cancer subtyping (Hofree et al., 2013) . Further insights will arise from integrating all types of biomedical data with a single framework to exploit diseasedisease relationships. Data integration methods that utilize multiple types of data, including ontological and omics data, have been used to classify and refine disease relationships (Gligorijevic and Przulj, 2015; Menche et al., 2015; Gligorijevic et al., 2016) . Despite these efforts, the development of a molecular-based disease taxonomy that links molecular networks and pathophenotypes still remains challenging Hofmann-Apitius et al., 2015; Jameson and Longo, 2015) . Here, we aim to refine a widely used clinical disease classification scheme, the ICD. To achieve this, we first quantify the category similarity among the ICD chapters using ontology-based similarity measures and investigate the molecular connections of disease pairs in the same ICD chapters. Furthermore, we seek the correlation between category and molecular similarity, and check for the heterogeneity of molecular specificity and correlated boundary between categories in ICD taxonomy. Finally, we construct a new classification of diseases (NCD) with overlapping structures. The aim is to provide clear boundaries between distinct diseases belonging to different categories using a new disease classification scheme ( Fig. 1 & Fig. S3 ). Fig. 1 . Overview of the new disease taxonomy construction and validation. a. Similarity calculation between the disease pairs in ICD taxonomy, including the calculation of 1) category similarity; 2) Phenotype similarity (based on ICD-MeSH term mapping) and 3) Molecular profile similarities (based on ICD-UMLS term mapping) of disease pairs in ICD; b. Module or community annotations of disease association network by chapters in ICD or NCD. We generate disease association network, in which nodes represent diseases and the link weights represent their corresponding phenotype or molecule profile similarities. The module annotations of the disease network correspond to ICD chapters or NCD categories; c. Construction of integrated disease network (IDN) and generation of NCD. The links of IDN are fused from the multiple similarities (e.g. phenotype similarity and shared gene similarity). Based on IDN, NCD is generated by community detection algorithms with overlapping disease members; d. Quality evaluation and validation of ICD and NCD. The molecular specificity (or inverse molecular diversity) and network modularity are used for evaluation and comparison of the quality of two disease taxonomies. Furthermore, we validate the robustness of NCD with two independent phenotype-genotype association datasets, namely GWAS and PheWAS. In this work, large curation efforts are performed to generate the related data sources (details see Supplementary Materials (SM) Section 1). We obtained the updated text version of ICD-9-CM (2011) and extracted the list of ICD codes with their hierarchical structures. While we recognize the improvements of the currently used ICD-10 over ICD-9, nevertheless, we chose to use ICD-9-CM as the adoption of ICD-10 has been slow in the United States (Butler, 2014) and since it was still being widely used at the time of the data collection for this paper (Blair et al., 2013; Wang et al., 2017) . Furthermore, although ICD-10 does have more codes than ICD-9-CM, the structure is kept almost the same. We obtained the high-quality phenotype-genotype (disease-gene) associations from Disease Connect database (2015 version) (Liu et al., 2014) , leaving out the less reliable text mining entries and focusing only on Genome-wide association study (GWAS), Online Mendelian Inheritance in Man (OMIM) and differential expression evidence types, and manually mapped those diseases in unified medical language system (UMLS) codes to ICD and MeSH codes (SM Section 1.6). To calculate the molecular network and phenotype characteristics related to disease phenotypes, a high-quality subset of human proteinprotein interactions was filtered from STRING V9.1 (Franceschini et al., 2013) using the score threshold at ≥ 700, as well as a wellestablished disease-phenotype (disease-symptom) association dataset (i.e. disease network with symptom similarity, HSDN) derived from PubMed bibliographic records and the gene ontology annotations from NCBI gene database are adopted. To ensure the results are not biased by computational predictions in the STRING database, we replicated the classification pipeline with manually curated PPI networks , which rely only on physical protein interactions with experimental support, and found that the results are robust (SM Section 8.3). In addition, to validate the robustness of our results from independent data sources, we filtered the GWAS and Phenome Wide Association Studies (PheWAS) data from University of California Santa Cruz (UCSC) Genome Browser (Tyner et al., 2017) and PheWAS catalog (Denny et al., 2010) respectively, and performed additional ICD mapping task to prepare the data for validation analysis. The GWAS evidence of the DiseaseConnect database, which we used to build the disease associations, comes from the National Human Genome Research Institute (NHGRI) GWAS catalog (Welter et al., 2014) , whereas for validation, we used the UCSC-GWAS Genome Browser. We have ensured that the GWAS data used to build the networks and to validate them have a very small overlap (SM Section 8). Here, we systematically evaluated the consistency of disease categories in ICD taxonomy from both clinical phenotype and molecular profiles (details are in SM Section 2). We investigated the quality of ICD disease taxonomy by evaluating the correlation between the closeness of disease pairs in the disease taxonomy and the underlying molecular connections (and symptom phenotype similarities) between disease pairs. For example, if two disease pairs have close positions (e.g. have a low level common parent disease) in the disease taxonomy, then we would expect that those disease pairs might have common genes or shared protein-protein interactions or similar phenotypes. We calculated the category similarity between disease pairs using a widely used semantic similarity measure (i.e. Lin measure using information content) (Lin, 1998; Pesquita et al., 2009) to represent the closeness of disease pairs located in the ICD taxonomy. Information theoretic measures such as information content have been used in the context of ICD-9-CM previously (Dahlem et al., 2015) . The category similarity measure takes as input two concepts c1 and c2 and outputs a numeric measure of similarity. If two ICD codes have a very specific common parent code in the taxonomic tree structure, then the category similarity would be~1. The molecular and phenotype similarity between disease pairs are calculated by evaluating the shared genes and their GO annotations, molecular network similarities, and shared phenotypes by established similarity measures (e.g. Cosine measure and Jaccard measure). In particular, to propose a more robust representation of molecular network profiles of diseases, we partitioned the STRING network into 314 topological modules (Data S2) and used them to construct the relevant module vectors of diseases using Odds Ratio (OR) as weighting measure. For example, an ICD disease code would be represented with a 314-dimensional vector, which has a value of w ij if its related gene is in a module or 0 otherwise. Suppose we have N genes in total and m i genes of a module i. Now for a disease d j with n j genes, which has k ij overlapping genes with the module i, we calculated the value of w ij as the following equation, We used the cosine measure to calculate the molecular module similarity between disease pairs after the molecular module vector (i.e. OR weighting) of each disease was constructed. Furthermore, as ICD taxonomy proposes a framework for organizing the diseases, it is expected that there should overlapping molecular interactions or phenotype relationships between the diseases of the same chapters than those of the different chapters. Thus, we assumed that when we collapse the ICD chapters as the module annotations, such that all the diseases in one chapter would be considered as members of a same module, the modularity of the disease association networks, i.e. the disease networks with molecular or phenotype associations as links, would reflect the quality of ICD disease taxonomy. This means that the higher the modularity, the higher the quality of the ICD chapters as a disease category framework. To evaluate the quality of community structures in complex network, the modularity measure (Newman, 2006) was proposed to quantify the extent to which the connection in communities is above the random expectation in the whole network. Let a network have m edges and A vw be an element of the adjacency matrix of the network. Suppose the vertices in the network are divided into communities such that vertex v belongs to community c v . Then the modularity Q is defined as: where the function δ(i,j) is 1 if i = j and 0 otherwise, and k v is the degree of vertex v. The value of the modularity lies in the range [−1/2,1]. It is positive if the number of edges within groups exceeds the number expected on the basis of chance. Otherwise, it would be negative. We use it to measure the consistency of disease categories (ICD chapter or NCD) as an annotation of topological module (or community) structures within disease networks. We hypothesize that if a disease category framework captures the molecular or phenotypic profiles of diseases, then there would be more links existing between the disease members in a category than random expectation. As a quantification of the molecular diversity (or the inverse specificity) of a disease, we calculated the maximum betweenness of diseaserelated genes in the PPI network (Data S3). Betweenness (Freeman, 1977) is a widely used centrality measure to quantify how many shortest paths run through a given node. In particular, bridging nodes that connect disparate components of the network often have a high betweenness. The betweenness centrality of a node v is given by: where n st (v) denotes the number of shortest paths from s to t that pass through v and g st is the total number of shortest paths from s to t. We will adopt the convention that nstðvÞ g st ¼ 0 if both n st (v) and g st are zero. We assume the molecular diversity of diseases would largely lie on the related genes with maximum betweenness. For example, to quantify the molecular diversity (in terms of maximum betweenness) of Alzheimer's disease (AD), we calculated all the betweenness values for the AD-related genes, such as APP, APOE, TNF and NOS3. Finally, we considered the molecular diversity of AD as 8.44e-3 since we found that APP has the maximum betweenness of 8.44e-3 among those genes (see Fig. S5a ). In fact, this kind of measurement has been successfully used in a previous study to evaluate the diversity of diseases, which indicated that the diversity of disease manifestations has a strong positive correlation with the molecular diversity of diseases. For disease taxonomy with good quality, we would expect it to have its lowest level diseases (the leaf nodes in the tree-structure disease taxonomy) with similar molecular diversities. We calculated the edge density to quantify the molecular interactions between ICD chapters. To further detect the significant interactions between diseases in different chapters, we find an approach to obtain the diseases that have significant interactions with diseases in chapters other than their own. Given a disease d i for investigation, we evaluate whether the proportion of interactions (i.e. edge density) of d i to the disease set D C k of a chapter C k is significantly larger than the average proportion of interactions between the diseases in C k (Fig. S6) . We use binomial test to filter the significant interacting disease-chapter pairs, in which the edge density of the disease to the chapter is significantly higher than the average edge density of the diseases in the corresponding chapter (details are in SM Section 4). The results showing positive correlations between category similarity and molecular similarity, and the high molecular diversity of many diseases imply that it would be possible to predict the multi-category map for each disease using its underlying molecular connections. To demonstrate a pilot method for multiple disease category prediction by integrating molecular module and shared gene similarities, we provided a novel algorithm to generate the possible associated additional disease categories for a given disease with the corresponding molecular association scores. (details are in SM Section 5, Fig. S7 ). In this algorithm, we integrated the correlation between category similarity and module similarity with significant disease-chapter associations (which are based on the shared gene similarity) to predict the additional chapters for a given disease. We divide the disease pairs in the same chapter to three subsets, which correspond to those pairs with shared root parents, shared second-level intermediate parents and shared third-level intermediate parents, respectively, to help predict to what degree a pair of diseases would be located closely in the disease taxonomy. The principle of the algorithm adheres to the positive correlation between category similarity (or the closeness of position of the disease pairs in ICD disease taxonomy) and molecular profile similarity of disease pairs, which means that strong molecular profile similarity between disease pairs would indicate close locations of them in the disease taxonomy. To ensure detecting the significant disease-chapter associations, we next filtered the predicted disease-chapter associations with positive association scores by the significant disease-chapter interactions based on shared genes. To integrate disease associations derived from both molecular and phenotype features, we performed several sequential analytical steps to generate a highly reliable disease network with strict filtering criterions of the disease links (details are in SM Section 6). Firstly, we generated three disease association networks: disease network with module similarity (MSDN) with 598,420 links and 1744 nodes, disease network with shared genes (SGDN) with 133,469 links and 1868 nodes, and disease network with symptom similarity (HSDN) with 1,639,791 links and 1814 nodes (Fig. S10 & Table S10) according to shared genes, shared phenotypes and molecular module similarity, respectively. To reduce the possible noise and bias of disease related data sources, we applied a multi-scale backbone algorithm (Serrano et al., 2009 ) to obtain high reliable disease links (with significantly high weights than the random expectations) from the three disease networks. We finally obtained 53,241, 8554 and 134,370 high reliable links for MSDN, SGDN and HSDN, respectively and retained most nodes (1744 for MSDN, 1782 for SGDN and 1814 for HSDN) of these networks. To further reduce the possible weak associations (the disease pairs with high module similarity but no direct protein interactions) derived from module similarity, we calculated the minimum length of the shortest paths (MSPLs) between each disease pairs and used it as a filtering criterion (with MSPL≤1) for MSDN, which resulted in a more biological meaningful subset of MSDN with 33,611 links and 1694 diseases. SGDN would capture strong associations between disease pairs if they have high degree of shared genes even their related genes are not forming functional modules. However, MSDN would give high weights for disease links if the disease pairs have similar co-locations on the topological modules of molecular network even they have no shared genes. Therefore, MSDN and SGDN are actually two complementary molecular association evidences for disease pairs and we finally obtained the union of the subset of MSDN and SGDN as the molecular association disease network (MADN), which contains 35,389 links and 1811 nodes with the weights derived from the two original networks. Next, we adopted a highly strict criterion to obtain an integrated disease network (IDN) from the fusion of MADN and HSDN links, which contains 35,114 disease links and 1857 nodes. Finding the overlapping disease categories could be transformed to the task of detecting the overlapping communities (i.e. modules) from the IDN. BigClam (Yang and Leskovec, 2013 ) is a state-of-the-art overlapping community detection algorithm based on a variant of nonnegative matrix factorization, which achieves near linear running time and comparable high quality community results. We used the BigClam algorithm, which is packaged in SNAP complex network software (http://snap.stanford.edu/snap/) to automatically detect overlapping communities from IDN network. Finally, we obtained 223 overlapping disease communities with 1797 distinct ICD disease codes. These 223 disease subcategories contain different numbers of ICD codes, ranging from 5 to 168 (Fig. S12 & Data S10). To obtain a top-level category framework of diseases corresponding to the chapters in ICD taxonomy, we calculated the overlapping degree of the 223 disease sub-categories by using Jaccard similarity to measure the common number of diseases held by two given disease categories. This generated a disease category network with 2685 links representing shared ICD codes (a link is established if two disease categories share at least an ICD code and the weights of links correspond to the Jaccard similarity) and nodes representing disease categories. After that, we clustered the 223 disease sub-categories additionally by a widely used non-overlapping community detection algorithm (considering the link weight and setting the resolution parameter as 0.5) into 17 top-level categories (which corresponds to the number of original chapter-level categories in ICD, which we named as New Chapters, NCs) using the shared ICD codes (Fig. S11c & Data S10). The modularity of these 17 top-level categories (this makes a good comparable partition with ICD chapters) in the network of 223 sub-categories is 0.426, which means a rather good partition of the network. These 17 NCs contain different numbers of sub-categories ranging from 4 to 25 or of diseases ranging from 53 to 369 (Fig. S11c & Data S10), covering diseases from all of the 17 chapters of ICD taxonomy (Table S11) . These 17 NCs would still contain overlapping disease codes since the 223 disease subcategories have overlapping disease codes. Therefore, 17 NCs with 223 disease sub-categories form a disease taxonomy consisting of two hierarchical levels with polyhierarchical categories although with a limited number (1797) of disease members. To validate the robustness of NCD, we obtained two external phenotype-genotype data sources (i.e. UCSC-GWAS and PheWAS catalog), which have not been integrated yet for generating NCD for further investigation. By measuring whether the disease members in the subcategories in NCD tend to incorporate the associations of shared genes from these two data sources, we would be able to validate the quality of NCD. If the diseases in such NCD sub-categories would tend to involve shared genes, then the diseases would be more likely associated with one another than other diseases. To test this hypothesis, we obtained the overlapping disease codes (ODC) in both NCD and the two external phenotype-genotype association databases and evaluate the degree of these ODC disease links in each NCD sub-category when considering two diseases linked if they share common genes. In detail, we firstly obtained the common disease codes involved in both NCD and UCSC-GWAS or PheWAS database. Then we generated a disease network with shared genes derived from the two datasets, in which two diseases linked if they shared at least one common gene. After that for each NCD sub-category, we generated a complete disease network with the ODC diseases in it and overlaid the network on the disease network with shared genes. Finally, the overlapping percentage of disease links would be calculated for evaluating the degree of molecular associations involved in diseases in each NCD sub-categories (details are in SM Section 8, Figs. S23-S25). We use R 3.1.0 as the main statistical tool in our work. The comparison of two percentages was calculated by Binomial test or Chi-squared test. Wilcoxon rank sum test was used for compare two independent list of values (e.g. two types of molecular diversities and two groups of MSPLs). All the correlations between two variables were calculated by Pearson's product moment correlation coefficient. Due to the incompleteness and bias of disease-related data (i.e. disease-gene associations and disease-symptom associations), we need to distinguish the information from the background noise. Therefore, for comparison with random expectation, we reshuffle (100 random permutations) the symptom features and the related genes of each disease using the Fisher-Yates method (Fisher and Yates, 1948) . The calculations from random permutations were used for the correlation between category similarity and molecular similarity, as well as phenotype similarity. In addition, this was used for detection of the disease categories with high molecule diversity. We curated 1883 distinct ICD disease codes (Table S1) from the 5level tree structure of 14,292 ICD-9-CM codes, as well as high confidence protein-protein interactions consisting of 15,551 nodes and 218,409 edges (Franceschini et al., 2013) . We compiled 153,277 distinct diseasegene associations between 4552 distinct diseases in UMLS codes and 14,975 genes reported in the DiseaseConnect database (Liu et al., 2014) (Fig. S3) . Next, by manually mapping the DiseaseConnect identifiers to ICD codes, we obtained 160,754 disease-gene records involving 1883 distinct ICD codes and 14,906 genes (Figs. S1-2 and Data S1). To evaluate the closeness of two diseases in the ICD tree structure, we applied an established semantic similarity algorithm named category similarity (see Methods, SM Section 2.1). This similarity measure is based on the information content, which quantifies the specificity of a term and can be applied to any categorization scheme that has a rooted tree structure, including the ICD-9-CM disease classification. We then created a disease network comprising 1883 nodes (representing ICD codes) and 154,563 edges. The edge weight reflects the category similarity values and higher values reflect higher similarity between diseases whose code positions are adjacent in the ICD tree. The category similarity distribution showed that most disease pairs (135,271, 87.52%) had similarities between 0.2 and 0.5 (Fig. S4a) . Disease pairs within this range mostly belong to different disease subcategories in the same chapter, such as diseases of other endocrine glands and disorders of thyroid gland. For example, type 2 diabetes (ICD: 250.00) and simple goiter (ICD: 240.0), which are in ICD chapter 3, have a category similarity of 0.37. However, there do exist disease pairs with high category similarities, such as type 2 diabetes (ICD: 250.00) and type 1 diabetes (ICD: 250.01) with a category similarity 0.83. Overall, this measure indicates the capability of ICD in bringing together similar diseases in its tree structure, and the overrepresentation of lower similarity scores is indicative of its limitations in doing so. While the ICD classification was derived from clinical manifestations (including symptoms and signs) and does not necessarily reflect the connections among the molecular components of diseases, it is informative to quantify to what extent it carries molecular information. We investigated the correlations of category similarity of disease pairs with 1) the degree of shared genes and shared clinical phenotypes, 2) GO term (Cell Component, Molecular Function, Biology Process) similarity (Mistry and Pavlidis, 2008) , and 3) topological similarity (i.e., minimum shortest path length and molecular module similarity) among them (Methods, SM Section 2.2). We found that close disease codes (disease pairs with a high category similarity) actually have higher clinical phenotype similarity (Methods, SM Section 2.3), which adheres to the construction principle of ICD taxonomy based on symptom phenotypes (Fig. S4b , PCC = 0.960, 95% CI = [0.854, 1.000], p = 2.079e-05). Furthermore, we observed strong correlations between higher category similarity bins for molecular profiles, compared to lower category similarity bins (Fig. S4c-i and Table S2 . See Methods, SM Section 2 for detailed information). In particular, we observed that in addition to the strongly positive correlations, the percentage overlap of disease pairs with shared genes was generally larger than the random controls ( Fig. S4c and d, see Methods, SM Section 2.4). The top 10 disease pairs with the largest number of shared genes are all from Chapter 2, which consists of cancer types. This might reflect the fact that cancers are the most studied and complex disease phenotypes involving various gene mutations (Table S3 , see Methods, SM Section 2 for detailed information). Overall, these findings indicate that diseases in the same ICD chapter tend to have a higher degree of shared genes, and the closer their positions in the ICD tree, the higher is the degree of shared genes. We measured the maximum betweenness of disease-related genes in the protein-protein interaction (PPI) network to quantify the molecular diversity (the inverse of specificity) of each disease, as described previously ) (see Methods, SM Section 3). A high maximum betweenness indicates a high molecular diversity. For example, C 0 1 C 0 8 C 0 4 C 1 0 C 0 7 C 1 3 C 0 3 C 0 5 C 1 2 C 1 6 C 0 6 C 0 9 C 1 4 C 1 5 the molecular diversity of Alzheimer's disease could be represented by the maximum betweenness of its related genes (i.e., the betweenness of the APP gene) in the PPI network (Fig. S5a) . We observed that the molecular diversity of diseases in the ICD taxonomy is heterogeneous, with molecular diversity values varying from 10 −8 to 10 −2 with a median value of 8.93e-04 ( Fig. 2a and Data S3). The top two disease chapters with the highest median molecular diversity were Chapter 2 (3.87e-03) and Chapter 1 (1.31e-03) (Fig. 2b) . Furthermore, we found that neoplasm (Chapter 2) and infectious disease (Chapter 1) categories tended to have higher molecular diversity compared to their complementary categories (Neoplasms vs. Non-Neoplasms p b 2.2e-16, Infectious diseases vs. non-infectious diseases p = 2.0e-02, Fig. S5b-c) and random controls. We also found that disease categories annotated as "other/unspecified" categories (SM Section 3.1) had higher molecular diversity compared to disease categories with specific conditions (p = 9.75e-03, Fig. S5d , Data S4; see SM Section 3.1) and its random control. These results indicate that the diseases in neoplasms, infectious diseases, and "Other/unspecified diseases" categories should be further investigated for molecular subtypes. A detailed discussion of disease cases is offered in SM Section 3, Data S5 & Tables S4-S5. In the current ICD taxonomy, we observed many instances where there exists a significant number of links between diseases in different chapters, comparable to the number of links between diseases within the same chapter (Table S6 & Fig. 2d , see Methods & SM Section 4). For example, strong shared-gene relationships were detected between respiratory diseases (Chapter 8) and mental, behavioral, and neurodevelopmental disorders (Chapter 5) ( Fig. 2c-d , more examples shown in SM Section 4, Tables S7-9). In addition, by calculating the shared molecular connections between diseases in the context of chapters, we could detect 768 diseases with a significant number of shared genes with diseases other than those in their own chapters (Data S6 & SM Section 4). To further quantify the molecular boundaries between the disease categories in ICD disease taxonomy, we evaluated the modularity, a structural measure of the tendency of the network to form close-knit communities (see Methods, SM Section 2.5), generated by either shared molecular profiles or shared phenotypes. When we mapped ICD chapters as grouping annotations on the disease networks filtered by with appropriate weight thresholds, and calculated their modularity, we observed very low modularity values (Fig. 2e) . Since modularity is a widely used measure to validate the quality of partitions/module structures in complex networks, this means that the grouping of ICD chapters does not agree with the natural topological groupings of their corresponding molecular networks (disease modules). This finding gives strong evidence for the blurred disease boundaries of the ICD taxonomy, possibly arising from the complexity of the underlying molecular mechanisms, in particular the possible overlap of their respective subnetworks, or disease modules, in the interactome. Furthermore, although the modularity of disease networks with shared phenotypes (similarity ≥ 0.1) is slightly positive, the weak correlation (PCC = 0.08, p-value = .7588) between phenotypic similarity and category similarity of disease pairs in each chapter (Fig. 2f) indicates that ICD taxonomy does not adequately incorporate phenotype similarity knowledge into disease category structures. These observations indicate that the strict tree structures in the ICD taxonomy wherein terms can only have one lineage (Cimino, 2011 ) may be inefficient for disease classification given the contemporary knowledge of disease pathobiology, and should therefore be refined to be polyhierarchical in structure. It has been proposed that if two disease modules overlap in the molecular interaction network, local perturbations in one disease might disrupt the biological pathways in the other disease, resulting in shared pathobiological characteristics . We observed a strong positive correlation between category similarity and module similarity (see Methods, SM Section 5.1) of diseases, indicating that two diseases with higher module similarity would be more closely localized in the disease category (Fig. 3a, Here, we utilize the module similarity between disease pairs to predict the categories of similar diseases. In particular, we determine the taxonomic closeness (SM Section 2.1) of each given disease pair to predict the additional categories of diseases, by applying heuristic rules incorporating the positive correlation between category similarity and module similarity (see Methods, SM Section 5.1). In particular, using the 598,420 disease pairs with positive module similarity (Data S7), we generated 2057 predicted additional category results for 722 out of 1883 disease codes (38.3%) in which each disease code had~4 categories on average (Data S7&8). We found that the number of predicted categories positively correlated with the molecular diversity of the original disease codes (Fig. 3c , PCC = 0.547, 95% CI = [0.514, 0.578], p b 4.94e-324; for External validations see SM 5.2). This indicates that diseases with multiple pathogenic pathways could be captured by polyhierarchical mapping. For example, the 20 diseases in Chapter 8 (i.e. Diseases of the Respiratory System) have been predicted to belong to over five additional chapters, such as neoplasms, infectious diseases, and diseases of the skin and subcutaneous tissue (Fig. 3d) , which is consistent with the heterogeneous pathogenesis of COPD and asthma (Grainge et al., 2016; Sharma et al., 2015) . A detailed discussion on the polyhierarchial map of the mental disorders is offered in SM Section 5.3 (Fig. S8) . Furthermore, we found that the predicted category framework, which is based only on molecular module similarity, also had higher phenotype similarity than diseases with shared root codes in the original ICD chapters (see SM Section 5.2, median: 0.0703 vs. 0.0563; mean: 0.125 vs 0.109; p b 2.2e-16, Fig. 3e ). This observation helps to establish that the predicted category results are of higher quality than ICD with respect to their phenotype homogeneity. To extend and redefine disease concepts by discovering additional categories of a disease, we generated a novel disease taxonomy by constructing an integrated disease network (IDN) with: (a) Shared clinical phenotypes including shared symptoms; (b) Shared molecular profiles including (i) shared genes and molecular module similarity and (ii) shortest path lengths in the PPI network, based on a systematic integration process to filter out possible false positive associations (see Methods, SM Section 6, Fig. S9 and Fig. S11a) , which includes 1857 diseases and 35,114 links (Data S9). Next, we applied high performance community detection algorithms to identify overlapping community structures in the IDN Fig. 2 . Lack of molecular specificity in ICD taxonomy and the blurred boundary between disease categories in ICD taxonomy. a. The distribution of molecular diversity of 1883 ICD diseases; b. The boxplot of molecular diversity of 17 ICD chapters (ordered by median values); c. The disease network with shared genes in which the diseases belong to Chapter 5 and Chapter 8. The ICD codes 295, 296, in Chapter 5 have dense relationships to the ICD codes in Chapter 8; d. The disease category network with shared genes. The nodes indicate the disease chapters and the weights of edges represent the edge densities between disease chapter pairs; the nodes with same color are considered as a chapter cluster, which is detected by community detection algorithm; e. Modularity of disease networks with chapter as module annotations; f. The correlation between category similarity and phenotype similarity of ICD chapters. (see Methods, SM Section 7 and Fig. S11a ). In particular, we first used BigClam (see Methods) since this method is able to detect overlapping communities whereby a disease can belong to multiple communities, in line with our main premise of creating a molecular based flexible disease classification. This resulted in 223 disease subcategories with overlapping diseases as members ( Fig. S11a and Data S10), which included 1797 distinct diseases from the ICD taxonomy. These 223 disease sub-categories contain different numbers of ICD codes, ranging from 5 to 168 (Fig. S12) , therefore, they represent different levels of disease categories similar to ICD chapters and their sub-categories. To further develop a more unified view of the disease category quality, we used the established BGLL method (Blondel et al., 2008) , which detects non-overlapping communities, to cluster these 223 subcategories further into 17 non-overlapping, distinct parts, such that these represent the 17 new chapter-level categories (called new chapters, or NCs) using the shared ICD codes (see Methods, SM Section 7.2, Fig. S11b ). Overall, this clustering order effectively ensures distinct top-level categories that have overlapping subcategories. The resulting 17 NCs contain different numbers of sub-categories ranging from 4 to 25, or of diseases ranging from 53 to 369 (Fig. S11c) . We denote the 17 NCs together with their 223 disease sub-categories as our new overlapping disease classification (NCD). Each of the resulting NCs reflects the shared features of integrative molecular and phenotypic profiles (SM Section 7.4; Fig. S11c & Table S12 , Data S11-14). For example, NC08 could be denoted as the "limbic system development-vision disorders-related diseases" since the most enriched PPI module (p = 4.9e-324, Relevance ratio = 0.7778) of its constituent diseases was mainly related to biological process; "limbic system development" (p = 1.13e-04), and 73.84% (127/172) of diseases in NC08 shared the phenotype, "vision disorders" (p = 4.9e-324) (Tables S13-S14). To confirm the phenotypic and molecular cohesiveness of our overlapping disease categories, we compared the modularity of NCD with that of the ICD taxonomy. We found that the 17 NCs consistently have much higher modularity than the original ICD chapters for all types of disease association networks (SM Section 7.3; Fig. 4a-h, Fig. S13 ). This finding indicates that the phenotypic and molecular links between the diseases of a category in NCD are much denser compared to ICD taxonomy. Furthermore, to ensure that the performance of NCD is indeed due to the combined effect of the molecular and phenotypic profiles, we performed a controlled experiment where we determined the new disease categories based on molecular-based networks and phenotype-based networks only by running the entire analytical pipeline and applying the same category prediction algorithm (SM Section 7.3). We found that NCD outperforms both molecular-based categories and phenotype-based categories in capturing the gene similarity, GO term similarity and phenotypic similarity (Figs. S14-S16). This suggests the importance of integrating both clinical phenotypes and molecular profiles to obtain a high-quality disease taxonomy. Furthermore, we found that the minimum shortest path lengths in the PPI network between disease pairs that belong to the same NCD categories had a larger percentage of low values (i.e., [0,2]) compared to ICD (Fig. 4i, 62 .86% vs 58.95%, p b 4.9e-324; SM Section 7.3). This result indicates that diseases within an NCD category have a significantly higher degree of shared genes (or shorter path lengths) in comparison to diseases within a category in ICD. On the other hand, the MSPLs between disease pairs in different NCD categories had a significantly lower percentage of low values than those in the ICD (47.27% vs 54.88%, p b 4.9e-324, Fig. 4i & Fig. S17 Polyhierarchical map of the disease codes in Chapter 8, indicating that the 20 disease codes in Chapter 8 have significant associations with two disease category clusters: 1) Chapter 1 (infectious disease) and Chapter 2 (neoplasms); 2) Chapter 3 (endocrine, nutritional and metabolic diseases and immunity disorders), Chapter 5 (mental disorders), Chapter 6 (nervous diseases), Chapter 9 (digestive diseases), Chapter 12 (skin and subcutaneous disease) and Chapter 13 (musculoskeletal system and connective tissue diseases); e. The boxplots of phenotype similarity of predicted disease pairs and original ICD disease pairs in the same top-level chapters (p b 2.2e-16, Wilcoxon test). which indicates a lower degree of shared genes (or shorter path lengths) between diseases from different categories in NCD than in ICD. These findings demonstrate that our NCD framework has clearer boundaries between distinct diseases belonging to different categories than those in the original ICD disease taxonomy. Moreover, to validate the robustness of NCD predictions, we calculated the degree of associations in terms of network density among the diseases in each sub-category of NCD. To this end, we investigated the overlaps with the disease pairs connected by shared genes using two independent phenotype-genotype association databases, GWAS and PheWAS (see Methods, SM Section 1.2,1.6 & 8). We found that for the 223 sub-categories in NCD, network density was significantly higher compared to random controls (GWAS: p-value = 9.42e-197, Fig. 4j ; PheWAS:p-value = 1.31e-14, Fig. 4k ). This means that the diseases in the 223 sub-categories in NCD would tend to have shared genes. For example, the New Chapter: NC12 in NCD, including 11 sub-categories and 136 ICD diseases (belonging to eight ICD chapters), is enriched with respiratory and airway diseases (e.g. COPD and asthma). We obtained 37 overlapped diseases from the GWAS database, which have a high degree of shared genes with the diseases in each sub-category of the NC12 (Fig. 5a) . In particular, the sub-categories, such as NC12.M06 (p-value = 2.53e-30), NC12.M03 (p-value = 1.80e-38) and NC12.M02 (p-value = 6.89e-19) have significantly higher density than those of the whole GWAS disease network (Fig. S22) . Furthermore, we found that the overlapping subcategories of the NCD are able to differentiate between different components (i.e. asthma/allergy vs. COPD) of the same broad group of diseases (i.e. respiratory diseases) (see Fig. S22 for a detailed example). Indeed, in the NC12 disease chapter chiefly containing respiratory diseases, the two sub-categories, namely NC12.M06 and NC12.M07, overlap in the underlying molecular interaction network while still containing the respective disease (asthma and COPD, respectively) genes separately (Fig. 5a) . A detailed discussion is offered in SM Section 8(with results in Data S21-22, Tables S18-19 & Figs. S23-S25). In addition, in NCD, a disease can be classified into multiple categories, and the number of categories of a disease positively correlates with its molecular diversity (Fig. 4l , PCC = 0.352, 95% CI = [0.311, 0.392], pvalue b 4.94e-324; External validations in SM 8.3). For example, we reclassified neoplastic diseases into multiple categories due to their high molecular diversity. Two hundred and fifty-eight neoplastic diseases in our NCD were divided into 144 sub-categories and 17 NCs (Figs. S20 & S21) . Thirty-nine out of 144 sub-categories (27.08%) were enriched with "neoplasm" diseases (Data S19, p-value = 2.78e-5). There were mainly 4 NCs (i.e., NC01, NC06, NC11, and NC16) containing these 32 sub-categories and 188 "neoplasm" disease codes (Fig. 5b) , where 76.06% (143/188) of the neoplastic diseases were classified into N1 sub-category, ranging from 2 to 15(Data S20 & Table S17 ). The neoplasm with the highest molecular diversity, "malignant neoplasm of connective and other soft tissue" (ICD: 171; molecular diversity: 0.035), was reclassified into 15 sub-categories, and "malignant neoplasms of thyroid gland" (ICD: 193; molecular diversity: 0.0028) was assigned to 14 sub-categories. Furthermore, related diseases had been reclassified together in NCD, such as the well-known diseasecorrelations among H. pylori infection (ICD: 041.86), stomach cancer (ICD: 151), and duodenal ulcer (ICD: 532) or peptic ulcer (ICD: 533) (Fig. 5b) (Sitas, 2016; Graham, 2015) . More interestingly, some diseases, like viral hepatitis C (ICDs: 070 arthritis (ICDs: 714/714.0), each from different chapters in ICD taxonomy, were classified together into a unique NCD sub-category (NC06.M10) since 50% (13/26) of these diseases share a PPI module related to immune response (SM Section 7.4, Fig. 5c , Data S15-16, Tables S15-16). In addition, diseases originally in the same ICD chapter, such as viral pneumonia (ICD: 480) and influenza (ICD: 487) from respiratory system-related diseases (Chapter 8), were reclassified into different categories in the NCD (NC12, NC10). Influenza shared more phenotype profiles with "episodic mood disorders" (ICD: 296) in NC10.M01, rather than viral pneumonia in NC12 (Fig. S18& Data S17) , which is in accordance with recent epidemiological studies between episodic mood disorders and influenza (Okusaga et al., 2011; Canetta et al., 2014) , and, furthermore, we also found that influenza shared some molecular profiles with "episodic mood disorders" (ICD: 296) in NC10.M01 (Fig. S19 , Data S18). These findings suggest that NCD offers a promising integrative framework incorporating both clinical phenotypes and molecular profiles for disease taxonomy that has very practical implications for the precise investigation of disease subtyping and etiologies. Given the molecular network mechanisms (Barabasi et al., 2011; Zanzoni et al., 2009) , genetic pleiotropy (Solovieff et al., 2013) , as well as complicated genotype-phenotype associations underlying diseases, the establishment of a molecular-based disease taxonomy with clear boundaries is essential but challenging. From the molecular network perspective, we first investigated the utility, shortcomings, and inconsistencies of ICD-9-CM, the established disease taxonomy for clinical settings. We found that there exist a considerable number (~40% of our investigated diseases) of diseases, for example, cancer and infectious diseases, that have diverse molecular network mechanisms and tend to interact more with diseases from other chapters. It is also these molecularly diverse diseases that mainly contribute to the blurred boundary of ICD disease taxonomy (see Methods, SM Section 4&7). Upon exploring the molecular diversity and cross-chapter interactions between diseases, we propose a novel disease classification system based on the integration of the clinical and molecular profiles of diseases. In particular, we integrate disease networks taking into account molecular and phenotypic connectivity among diseases, predict the multiple disease categories that diseases belong to, and finally validate the biological cohesiveness of our NCD by network topological measures such as modularity and shortest path length. Our findings indicate that although general correlations exist between disease closeness in ICD taxonomy and underlying molecular profiles, ICD still displays significant limitations with regard to the heterogeneity of molecular diversity and clear category boundaries. In our NCD, a disease with a high molecular diversity tends to be classified into multiple disease categories, which indicates that there exist more disease subtypes for that disease. For example, "malignant neoplasm of the pancreas" was reclassified into 11 sub-categories and 4 NCs, which is consistent with a recent study wherein 4 phenotypic subtypes of pancreatic cancer were enriched for 10 distinct molecular mechanisms (Bailey et al., 2016) . Therefore, we believe that the new disease classification system may help facilitate precise clinical diagnosis and correct prognosis (Jameson and Longo, 2015) , and does so in alignment with refined molecular network diagnostics. Furthermore, the molecular network underpinnings and overlapping disease categories of NCD provide a credible relationship map between diseases and disease categories that may radically transform our current understanding of diseases and relevant treatment paradigms. On the one hand, our approach accurately links diseases with all possible underlying mechanisms in the molecular interaction network. On the other hand, it presents a promising approach to the identification of targeted drugs for the treatment of related diseases. For example, breast cancer and influenza (both in NC11.M02) may share potential drug targets (Park, 2012) . As another example, metformin, widely prescribed to treat metabolic syndrome (in NC11.M02), could alter the gut microbiome composition and function, improve gut microbial dysbiosis (Forslund et al., 2015; Cabreiro et al., 2013) , and also prevent colorectal cancer (also in NC11.M02) through microbiomeinfluenced immune response modification (Nakatsu et al., 2015) . Here, it is important to note that while a considerable number of diseases have a strong environmental component, our main focus has been the many diverse molecular determinants. In the future, additional environmental factors such as epigenetic changes can be added into the data integration scheme to further refine the classification. There exist several potential limitations of this work. Although we have aimed to address the possible confounders by constructing random controls and using external evaluations, the incompleteness and bias incorporated in the integrated data sources are likely to influence the generalization of our results. For example, DiseaseConnect yields an incomplete disease-gene database: 1883 ICD diseases could be mapped (Table S11) , leading to only 1797 diseases included in the NCD. Furthermore, as with other studies that rely on literature-based and ontological knowledge, investigation bias remains an issue, where the molecular mechanisms (e.g. related genes and their interactions) of some diseases (e.g. cancer) being more intensively studied than others may influence the results . We expect the results of similar works to be more refined in the future as biomedical datasets become more complete. The incorporation of more comprehensive disease-gene data sources, such as DISEASES (Pletscher-Frankild et al., 2015) and MalaCards (Rappaport et al., 2017) , could result in an improved study. While we have chosen to keep individual external gene expression datasets outside the scope of this study since gene expression is highly tissue-and cell-type dependent, it presents an interesting future direction and could potentially improve the quality of the resulting disease categories if exhaustive lists of tissue-specific expression datasets are used in dedicated studies. In addition, our NCD merely delivers a two-level taxonomy framework without elaborated hierarchical structures in the same disease categories, which could be further refined or optimized through methods like hierarchical clustering algorithms (Murtagh and Contreras, 2012) and systematic posteriori ontology engineering method (Gessler et al., 2013) . Finally, high-quality ontologies, such as the Human Phenotype Ontology (Kohler et al., 2014) and Disease Ontology (Kibbe et al., 2015) , can be used for external validation, or for further integration to obtain more robust and extensive NCDs. In this big-data era, the dramatically increasing multi-omics databases, as well as clinical data from electronic health records (EHR) involving phenotypic, therapeutic and environmental factors information (Jensen et al., 2012) , should also be incorporated into the new disease taxonomy refinement for patient stratification and disease treatment. At this point, a realistic assumption is that the translation of this classification to the clinic will need some time. That said, while Fig. 5 . Biological insights of new disease taxonomy. a. The New Chapter containing airway diseases (NC12) consists of 11 sub-categories and 136 ICD diseases belonging to 8 ICD chapters. The subcategories overlap in the underlying molecular interaction network, while still separately including the disease genes (asthma and COPD, respectively) that characterize each subcategory; b. The disease network of neoplasms in NCD. The 32 sub-categories significantly representing neoplasms are divided into 4 NCs (G1). Helicobacter pylori [H. pylori] (041.86), malignant neoplasm of stomach (151), duodenal ulcer (532), peptic ulcer, site unspecified (533), which have significant relationships, are clearly clustered into a subcategory (NC11. M07) (G2); c. A sub-category (NC06.M10) in NCD, which includes diseases from 8 different ICD chapters with shared molecular mechanism and phenotypes. Fifty percent (13/26 = 50%) of the diseases in NC06.M10 share a PPI module, the biological function of which is enriched with immune system response, while over 90% (25/26 = 96.2%) of the shared common phenotype of this module is "Pain". the ICD is originally made "by clinicians for clinicians", it is now widely used by biomedical researchers as well to gain a deeper understanding of human diseases. We therefore believe that researchers will be the first and direct beneficiaries of our approach. In conclusion, our study provides valuable insights into the polyhierarchical network-based disease classification beyond the traditional tree structure. Our integrated disease network approach is sufficiently powerful to elucidate the tangled underpinnings of human diseases and uncover distinct disease boundaries. Our work may provide a new framework for the disease taxonomy reform based on bigdata fusion, so as to generate further the robust infrastructure needed for precision medicine. Supplementary data to this article can be found online at https://doi. org/10.1016/j.ebiom.2018.04.002. The work was supported by National Natural Science Foundation of China (61105055, 81230086 and 81673833), National Science and Technology Major Project for New Drugs Research and Development of China (2017ZX09301-059, 2017ZX09503-001-003), National Key R&D Project (2017YFC1703506), and the Fundamental Research Funds for the Central public welfare research institutes (ZZ0908029, 2017JBM020, DUT16ZD227, DUT17ZD222, DUT18ZD301). We also acknowledge the support by National Institutes of Health (NIH) grants P50-533 HG004233-CEGS, MapGen grant (U01HL108630) and P01 HL083069, U01 534 HL065899, P01 HL105339, R01HL111759, 1P01HL132825-01 and RC HL10154301. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Genotype-based phenotyping heralds a new taxonomy for inflammatory bowel disease Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling Subtypes of patients experiencing exacerbations of COPD and associations with outcomes Drug repositioning: identifying and developing new uses for existing drugs Genomic analyses identify molecular subtypes of pancreatic cancer Network medicine: a network-based approach to human disease Triple-negative breast cancer: challenges and opportunities of a heterogeneous disease A nondegenerate code of deleterious variants in Mendelian loci contributes to complex disease risk Fast unfolding of communities in large networks Not so fast! Congress delays ICD-10-CM/PCS. Examining how the delay happen, its industry impact, and how best to proceed Metformin retards aging in C. elegans by altering microbial folate and methionine metabolism Consortium Biospecimen Core Resource: International Genomics, Hospital Research Institute at Nationwide Children's, Services Tissue Source Sites: Analytic Biologic Serological documentation of maternal influenza exposure and bipolar disorder in adult offspring New uses for old drugs Network-based classification of breast cancer metastasis High-quality, standard, controlled healthcare terminologies come of age Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease Predictability bounds of electronic health records PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations Consensus molecular subtypes and the evolution of precision medicine in colorectal cancer Metformin and reduced risk of cancer in diabetic patients Statistical Tables for Biological, Agricultural and Medical Research Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota STRING v9.1: protein-protein interaction networks, with increased coverage and integration A set of measures of centrality based on betweenness A posteriori ontology engineering for data-driven science Methods for biological data integration: perspectives and challenges Integrative methods for analyzing big data in precision medicine The human disease network Molecular classification of cancer: class discovery and class prediction by gene expression monitoring Helicobacter pylori update: gastric cancer, reliable therapy, and possible benefits Year in review 2015: asthma and chronic obstructive pulmonary disease A dynamic network approach for the study of human phenotypes Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin Towards the taxonomy of human disease Network-based stratification of tumor mutations Network biology concepts in complex disease comorbidities Precision medicine-personalized, problematic, and promising Mining electronic health records: towards better research applications and clinical care Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients Disentangling the heterogeneity of autism spectrum disorder through genetic findings Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data The implications of human metabolic network topology for disease comorbidity Drug repositioning for personalized medicine An information-theoretic definition of similarity. The Fifteenth International Conference on Machine Learning DiseaseConnect: a comprehensive web server for mechanism-based disease-disease connections Molecular classification of the dementias COPD: epidemiology, prevalence, morbidity and mortality, and disease heterogeneity Genetic heterogeneity in human disease Disease networks. Uncovering disease-disease relationships through the incomplete interactome Preparing for precision medicine Gene ontology term overlap as a measure of gene functional similarity Algorithms for hierarchical clustering: an overview Gut mucosal microbiome across stages of colorectal carcinogenesis Modularity and community structure in networks Association of seropositivity for influenza and coronaviruses with history of mood disorders and suicide attempts Drugs zero in. Breast cancer, flu and obesity are in the crosshairs as drug companies produce more-targeted treatments Semantic similarity in biomedical ontologies DISEASES: text mining and data integration of disease-gene associations MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search Probing genetic overlap among complex human phenotypes Extracting the multiscale backbone of complex weighted networks A disease module in the interactome explains disease heterogeneity, drug response and captures novel pathways and genes in asthma Twenty five years since the first prospective study by Forman et al. (1991) on Helicobacter pylori and stomach cancer risk Pleiotropy in complex traits: challenges and strategies The UCSC Genome Browser database: 2017 update Classification of common human diseases derived from shared genetic and environmental determinants The NHGRI GWAS Catalog, a curated resource of SNP-trait associations An Ancient, Unified Mechanism for Metformin Growth Inhibition in C. elegans and Cancer Overlapping community detection at scale: a nonnegative matrix factorization approach A network medicine approach to human disease Human symptoms-disease network The authors declare that they do not have any competing financial interests.