key: cord-0268298-qlqjsnpt authors: Luscher-Dias, T.; Dalmolin, R. J.; Amaral, P. P.; Alves, T. L.; Schuch, V.; Franco, G. E.; Nakaya, H. I. title: The evolution of knowledge on genes associated with human diseases date: 2021-06-21 journal: nan DOI: 10.1101/2021.06.16.21259049 sha: 58000da7b92bf0516084ec43f5886b4037181227 doc_id: 268298 cord_uid: qlqjsnpt Thousands of scientific articles describing genes associated with human diseases are published every week. Computational methods such as text mining and machine learning algorithms are now able to automatically detect these associations. In this study, we used a cognitive computing text-mining application to construct a knowledge network comprised of 3,723 genes and 99 diseases. We then tracked the yearly changes on these networks to analyze how our knowledge has evolved in the past 30 years. Our approach helped to unravel the molecular bases of diseases over time, and to detect shared mechanisms between clinically distinct diseases. It also revealed that multi-purpose therapeutic drugs target genes which are commonly associated with several psychiatric, inflammatory, or infectious disorders. By navigating in this knowledge tsunami, we were able to extract relevant biological information and insights about human diseases. Thousands of scientific articles are published every day, piling up with millions of already 32 published papers (Fortunato et al., 2018) . Keeping abreast of scientific significance has become 33 an overwhelming task for researchers in their own fields and in other areas of science. In this 34 scenario, computational methods such as text mining, machine learning, and cognitive Network medicine (Barabási et al., 2011) , a contemporary approach to studying 52 relationships between genes and diseases, has also been made possible because of the large 53 6 94 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint Table S2 ). In 1990, only 95 genes were connected in the network (Fig. 108 1B), and no association between psychiatric disorders and inflammatory or infectious diseases 109 could be established through shared genes (Fig. 1A) . Accordingly, the overall similarity between 110 diseases (between or within categories) was low in 1990 ( CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint 63 infectious diseases were connected to less than 100 genes in 2018 (Fig. 1C) . The most 135 connected inflammatory diseases were psoriasis (346 genes), systemic lupus erythematosus 136 (393 genes), and arthritis (490 genes; Fig. 1C ). In the category of psychiatric disorders, 137 Alzheimer's disease was the most connected (657 genes), followed by schizophrenia (547 138 genes) and depression (402 genes; Fig. 1B ). The imbalance in the distribution of genes 139 connected to infectious diseases likely reflects a bias in the research interest toward the 140 discovery of genes related to diseases already connected to more genes. In fact, the 2018 141 network showed that the number of scientific papers that mentioned poorly connected diseases 142 (less than 100 genes) is significantly lower than the number of papers published on highly 143 connected diseases (more than 100 genes) (Fig. S1C) . 144 Distinct historical trends of discovery were seen for each disease category ( Fig. 1D and 145 Table S3 ). Prominent peaks of gene-association discovery occurred in 1996 for infectious 146 diseases, in 2005 for inflammatory diseases, and in 2013 for psychiatric disorders (Fig. 1C) . 147 From 2010 to 2017, the rate of gene discovery in all three categories increased (Fig. 1C) . The 148 significant increase in the number of genes associated with infectious diseases observed in 149 1996 was mostly driven by 154 new genes associated with HIV infection (Fig. 1D ), which 150 corresponded to 50% of the new genes added to the network in that year (Table S3 ). The triple 151 therapy for HIV using nucleoside reverse-transcriptase inhibitors and protease inhibitors was 152 established in 1996 (Hammer et al., 1996) , which likely influenced this outburst of genetic 153 discovery. The 2005 increase in the number of genes associated with inflammatory diseases 154 was mostly related to the new genes connected to psoriasis (41 genes) and systemic lupus 155 erythematosus (33 genes; Fig. 1D ), which together corresponded to 20% of the new genes 156 associated with all of the diseases in 2005 (Table S3) . The Th 17 cell lineage was discovered in 157 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Fig. 1D ), which 160 corresponded to 17% of the new genes in the network in that year (Table S3) . We could not 161 detect any specific scientific landmark in 2013 that could explain this peak. Nevertheless, 162 important genes related to the innate immune response to pathogens and inflammation are 163 among the new genes associated with Parkinson's disease in 2013, such as interleukin 1 beta 164 (IL1B) and the p105 subunit of the nuclear factor kappa B (NFKB1). 165 166 Next, we investigated the evolution of the similarity between diseases from different 168 categories according to their shared genes (see Methods section). For the top 9 most 169 connected diseases of each category in 2018 (i.e., diseases connected to more genes), we 170 detected the diseases from the other two categories with the most significant gene sharing 171 between them and analyzed how these relationships evolved from 1990 to 2018 (Figs. 2, S2, 172 S3, and S4). Alzheimer's disease was the psychiatric disorder with the highest similarity to 173 inflammatory diseases in 2018, including arthritis and systemic lupus erythematosus ( Fig. 2A) . 174 The relationships between Alzheimer's disease and these disorders grew steadily in 175 significance from 1990 to 2018 (Fig. S2A) , which captures the now well-established relevance of 176 inflammatory processes in the pathophysiology of Alzheimer's disease (Newcombe et al., 2018) . 177 Surprisingly, fibromyalgia was similar to several psychiatric diseases: depression, anxiety, 178 bipolar disorder, schizophrenia, and Huntington's disease ( Figs. 2A and S2 ). The total number 179 of genes associated with fibromyalgia in 2018 was low (25 genes), but 72% of these (17 genes) 180 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. are also associated with depression. These are genes related to nervous system development, 181 such as brain derived neurotrophic factor (BDNF), nerve growth factor (NGF), and neuropeptide 182 Y (NPY), and inflammatory response, including interleukin 6 (IL6), C-X-C motif chemokine 183 ligand 8 (CXCL8), and tumor necrosis factor (TNF). In fact, fibromyalgia patients often present 184 psychiatric comorbidities such as depression and anxiety (Galvez-Sánchez et al., 2020). 185 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint 241 We then examined the number of publications retrieved from PubMed using the topmost 242 similar pairs of diseases from distinct categories as queries (see Methods section; Fig. 3 ). The 243 goal was to find out whether the gene-sharing similarities between diseases from different 244 categories detected in our networks could also be captured from direct co-occurrence in the 245 general peer-reviewed literature over the 30-year period. For each disease pair, we obtained a 246 ratio between the similarity score of the diseases (i.e., the significance of the gene sharing 247 between them) and the total number of studies retrieved from PubMed that mention both 248 diseases of the pairs together (Table S4 ). This similarity-to-paper ratio was used to detect 249 potentially understudied pairs of diseases that significantly share genes. Low similarity-to-paper 250 ratio values (Figs. 3A and 3B, light green, and Table S4 ) represent similar diseases with many 251 papers already published about them or dissimilar disease pairs. An example of such a pair is 252 fibromyalgia and depression. These diseases have significant gene sharing and also hundreds 253 of scientific papers that explore their relationship in the literature (Fig. 3B) . Conversely, the 254 genetic association between osteoporosis and mycobacterial infection is low and so is the 255 number of papers that investigate these diseases together (Fig. 3B ). These cases were 256 considered as examples of a low knowledge gap between the genetic similarity obtained from 257 our network analysis and the established literature coverage of the disease pairs. 258 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. Table 270 S4) were considered as cases of moderate knowledge gap (Fig. 3A) , which was the case for 271 arthritis and hepatitis B (Fig. 3B) Table S4 ). We suggest that these cases might represent potentially underexplored fields of 286 research that deserve further investigation. Surprisingly, the number of papers published until 287 2018 that mentioned psoriasis and malaria together was neglectable (Fig. 3B) . These diseases 288 share 31 genes, one-third of the genes associated with psoriasis, and over 10% of the genes 289 associated with malaria in the 2018 network. Hydroxychloroquine, a drug used to treat malaria 290 Table S5 ). We detected 433 Reactome pathways that presented significant 303 enrichment (p.adjust < 0.01) among the genes of at least one disease (Table S5) . Functional 304 enrichment analysis, such as ORA, often yields too many significant pathways, making these 305 results difficult to interpret at the individual pathway level. For this reason, we used a network 306 approach to reduce the complexity of the obtained set of enriched pathways (see Methods 307 section). Briefly, we built a pathway network (Fig. 4) with the significant Reactome pathways 308 obtained from the ORA. We connected these pathways to each other according to the gene 309 sharing between them, similar to what was done in Fig. 1A . We then identified 11 clusters of 310 closely connected pathways in the network and annotated these clusters according to the main 311 biological functions of the pathways within them ( Fig. 4 and Table S5 ). One of the detected 312 clusters grouped several pathways associated with interferon-stimulated genes, interleukins, 313 and antigen presentation ( Fig. 4 and Table S5 ). The pathways in this cluster were significantly 314 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint enriched among the genes of diseases in all categories, including malaria, HIV infection, 315 arthritis, lupus, depression, and Alzheimer's disease (Fig. 5) . The pathways related to 316 interleukin signaling (e.g., "interleukin 10 signaling"), for instance, were among the top enriched 317 pathways associated with depression genes in the 2018 network ( Fig. 5 and Table S5 ). Another 318 cluster of pathways that showed consistent enrichment across all disease categories was NFκB-319 mediated inflammation induced by toll-like receptors (TLRs), T-cell receptors (TCRs), and B-cell 320 receptors (BCRs; Fig. 4 ). These results illustrate the most recurring theme detected in our 321 study: psychiatric, inflammatory, and infectious diseases share common immunological 322 mechanisms that are mostly related to innate immunity and inflammation. 323 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Conversely, we found a cluster of closely connected pathways related to 349 neurotransmission that were enriched mostly among the genes of psychiatric disorders (Fig. 4 350 and Table S5 ). However, three inflammatory and infectious diseases (hepatitis B, arthritis, and 351 HIV infection) presented enrichment for pathways in this cluster (Fig. 5 and Fig. S5 ). The genes 352 related to these diseases presented enrichment for the pathway "transcriptional regulation We also detected other clusters of pathways with similar enrichment results between 370 diseases of different categories (Fig. 4) . The genes related to arthritis and those related to 371 Alzheimer's disease presented enrichment for pathways related to the extracellular matrix 372 organization, coagulation, and lipoprotein metabolism (Fig. 5) . In arthritis, fibroblast-like 373 synoviocytes become hyper-inflammatory and disrupt the extracellular matrix integrity, which 374 leads to the degradation of synovial joint collagen (Nygaard and Firestein, 2020) . In Alzheimer's 375 After determining the major biological functions related to the genes connected to 383 infectious, inflammatory, and psychiatric diseases in the 2018 network, we investigated how this 384 knowledge evolved from 1990 to 2018 (Fig. 6) . The pathways related to interferon-stimulated 385 genes, interleukins, and antigen presentation became enriched for the genes associated with 386 inflammatory and infectious diseases already since the early 1990s (Fig. 6) . Surprisingly, this 387 enrichment appeared earlier for inflammatory diseases, despite the highly relevant role of 388 interferon-stimulated genes and antigen presentation in infectious diseases. Conversely, there 389 was a significant increase in the enrichment of these pathways for the genes related to 390 depression, autism, and schizophrenia since 2010 (Fig. 6) . Recently, the specific roles of the related to apoptosis, senescence, and cell differentiation with psychiatric disorders has also 397 occurred recently, except with Alzheimer's disease, which began early in the period (Fig. 6) . 398 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint Alzheimer's, Parkinson's, and Huntington's diseases are neurodegenerative conditions in which 399 chronic neuronal death happens in distinct parts of the brain (Dugger and Dickson, 2017). We 400 also found an increasing association in recent years of genes related to autism and depression 401 to cell fate pathways (Fig. 6) , showing that these disorders might also have a neurodegenerative Lastly, we examined how drugs that are used to treat inflammatory, infectious, and 411 psychiatric diseases target the genes that are shared between the three categories. We found 412 that 345 genes were common to all disease categories (Fig. 7A) . Ninety-nine genes were 413 shared only between inflammatory and psychiatric diseases; 259 were common only between 414 psychiatric and infectious diseases; and a total of 409 genes were related exclusively to 415 inflammatory and infectious diseases (Fig. 7A) . The remaining genes were unique to 416 inflammatory (493 genes), psychiatric (869 genes), and infectious diseases (1,209 genes; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. CXCL8 also appeared in our networks in the early 1990s and were first connected to 439 inflammatory diseases (Fig. 7C) . Eight drug target genes were first connected to psychiatric 440 disorders (Fig. 7C) : caspase 3 (CASP3; 1996), prostaglandin-endoperoxide synthase 2 441 (PTGS2; 1997), heme oxygenase 1 (HMOX1; 2000) , BCL-2-associated X (BAX) and mitogen-442 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. in the 1990s (Fig. 7C) . These are very well-known genes involved in inflammation (e.g., IL6 and 450 IL1B), innate immunity (e.g., IFNG), apoptosis (e.g., CASP3 and CASP8), cell cycle (e.g., 451 TP53), and other key biological functions that are altered in several diseases. 452 Next, we found the top 20 therapeutic drugs that affect the most hub genes of 453 inflammatory, psychiatric, and infectious diseases (Fig. 7D) . Valproic acid, a class I histone 454 deacetylase (HDAC) inhibitor (Göttlicher et al., 2001) , was the drug that affected the most hub 455 genes, 259 (Fig. 7D) . According to CTD, among the diseases we analyzed in this study, valproic 456 acid is a therapeutic drug for anxiety, autism, bipolar disorder, and schizophrenia (Fig. 7E) . This 457 drug is also an efficient anti-convulsant used to treat epilepsy (Tomson et Valproate was also speculated as a potential repurposing candidate to treat diseases caused by 463 infectious agents, such as COVID-19 (Pitt et al., 2021) and toxoplasmosis (Goodwin et al., 464 2008) . HDAC inhibitors promote epigenetic modifications in the genome that induce the 465 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint expression of genes in many biological functions and cell types (Hull et al., 2016) . This could 466 explain valproic acid's versatility and why it ranked first in our analysis. 467 Among the other top 20 drugs, we found molecules that are currently under investigation 468 for repositioning from one disease category to another. Methotrexate (Fig. 7D) , which affects 469 141 genes among the 345 hubs, is used to treat several inflammatory diseases, including 470 psoriasis, lupus, and arthritis (Fig. 7E) . Recently, a randomized clinical trial revealed a potential 471 for methotrexate to treat positive symptoms in schizophrenia patients (Chaudhry et al., 2020) . One of the advantages of using text mining and network medicine to study the 509 relationships between genes and diseases is the possibility of detecting novel connections from 510 established scientific knowledge. When two diseases share a genetic mechanism, they can also 511 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint present common clinical or epidemiological characteristics, despite having distinct etiological 512 backgrounds (Barabási et al., 2011) . These similarities can inform researchers of potential 513 treatment options (Lüscher Dias et al., 2020) . Here, we showed that diseases from 514 inflammatory, psychiatric, and infectious etiologies significantly share genes with each other. 515 This sharing was strong between disease pairs that were well studied together, such as 516 depression and fibromyalgia. Conversely, the gene sharing between psoriasis and malaria could 517 be perceived in our knowledge networks since the 2000s, but the number of papers featuring 518 the two conditions together in PubMed is virtually null. We detected a few such cases, mostly 519 involving neglected infectious diseases, which could explain the knowledge gap. We also found 520 cases of diseases that just recently began to share genes that also lack many publications 521 directly connecting them in the literature. A case in point is autism and RSV. We also found 522 disease pairs, such as dementia and Toxoplasma gondii infection, for which there have been 523 direct associations in the literature since 1990, but that just recently started to share genes in 524 the network. Our results reveal potentially underexplored pathways for future research on the 525 association between diseases of distinct categories and also for the discovery of new genes 526 related to well-studied disease pairs. 527 The sharing of genes between diseases from distinct categories also reflects in the 528 overlap of biological functions, particularly those related to immunological processes. The genes 529 of several diseases in all categories presented enrichment for Reactome pathways related to 530 the interferon response, cytokines, and NFkB-mediated inflammation. This pattern was 531 detectable in our networks since the early 1990s for inflammatory diseases and gradually 532 appeared for infectious and psychiatric diseases as well. Pathways associated with 533 neurotransmission were almost exclusively enriched among the genes of psychiatric diseases. 534 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Our network medicine text mining approach also revealed how shared genes between 541 disease categories can signal toward common therapeutic solutions. The findings presented in 542 the last section of our study emphasize the relevance of drugs that target shared genes for the 543 treatment of distinct diseases. Our results show that the genes targeted by therapeutic drugs 544 shared by inflammatory, psychiatric, and infectious diseases have been associated with these 545 disorders early in the past 30 years of scientific research. These genes are associated with 546 inflammation, the cell cycle, apoptosis, and central pathways of cellular function. We also 547 demonstrated that well-established and promising cases of repositioning involve drugs that 548 target shared genes between diseases. Future studies should aim to reveal more common 549 molecular mechanisms between these categories of diseases as well as to harness that 550 knowledge for novel drug discovery and repurposing. 551 In summary, we could apply a machine learning and cognitive computing text-mining 552 strategy using WDD to extract knowledge about genes related to inflammatory, infectious, and 553 psychiatric diseases from the scientific literature and depict how this knowledge evolved during 554 the past 30 years. 555 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint We built knowledge networks containing interactions between diseases and genes using 558 the WDD (Y. . WDD discovers connections between genes and diseases 559 using a natural language processing algorithm that reads full texts from PMC open access 560 journals, patents, and abstracts in the MEDLINE (PubMed) database. A connection is found 561 when two terms of interest (e.g., genes and diseases) are detected in the same sentence, 562 separated by a preposition or a verb. These connections can be derived from many sources of 563 evidence, such as gene expression, disease-associated mutations, genome-wide association 564 studies, or protein expression experiments. WDD attributes a confidence score (0-100%) to 565 each association based on the number of documents in which the relation is found and also on 566 the semantic relevance of the link, determined by the natural language processing algorithm. 567 We performed independent searches on WDD with 27 inflammatory diseases, 63 568 infectious diseases, and 9 psychiatric and neurological disorders (Table S1) in July 2018. WDD 569 returned lists of genes related to these diseases according to the scientific literature in each 570 year from 1990 to 2018. These associations are cumulative, that is, the genes associated with 571 the diseases in 2018 include all the associations present in the previous year. We only kept 572 connections between genes and diseases supported by a confidence score of at least 50% and 573 2 documents of evidence. Custom R code was used to process, filter, and analyze data and to 574 plot figures. The full code of all analyses and figures in this study is available at 575 https://github.com/csbl-usp/evolution_of_knowledge. 576 577 578 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint We calculated Fisher's exact test p-value of the gene overlap between each pair of 580 diseases in each year from 1990 to 2018. The total number of genes connected in the network 581 in each year was used as Fisher's exact test universe. For each year, a disease-disease 582 knowledge network was developed using the -log 10 pval of the Fisher's exact test ("disease-583 disease similarity") as the edge weight for each disease pair. The networks were constructed 584 using the R package igraph (Csardi and Nepusz, 2006 ) and plotted using the package ggraph. 585 We detected new genes in each year by comparing the list of genes of the diseases in one year 586 to the list of genes of the same disease in the previous year. Thus, we obtained a list of new 587 genes that were added to the network in each year from 1991 to 2018. The total number of 588 genes associated with each disease was also calculated for each year. Line, violin, and ridge 589 plots were created to illustrate the results using ggplot2 (Wickham, 2016) . 590 591 For the top 9 diseases of each category that were connected to the most genes in 2018 593 ("top 9 diseases"), we detected the diseases from the other two categories with the most 594 significant gene sharing between them ("disease pairs") and analyzed how these relationships 595 evolved from 1990 to 2018. The disease-disease similarity scores obtained previously were also 596 used in this analysis. We used the MeSH.db R package (Tsuyuzaki et al., 2015) to obtain the 597 MeSH IDs and terms of all 99 diseases. Using the obtained MeSH terms of the diseases in each 598 pair, we used the easyPubMed R package to search for PubMed papers in which both disease 599 MeSHes were found together. We then used an adapted version of the fetch_pubmed_data 600 function (see code in GitHub) of the easyPubMed package to retrieve the number of papers that 601 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint contained the searched MeSH pairs in each year from 1990 to 2018. We used the disease-602 disease similarity score and the number of papers in 2018 to calculate a similarity-to-paper ratio 603 for each disease pair as follows: 604 Low similarity-to-paper ratios (<10) were considered as cases of low knowledge gap between 605 the gene sharing and the general scientific interest in the disease pairs. Pairs in this category 606 included those in which the diseases did not share a significant amount of genes or pairs of 607 similar diseases for which there is also a proportional number of papers that cite the two 608 diseases together. Ratios between 10 and 40 were considered as cases of intermediate 609 knowledge gap, that is, the diseases in the pair are similar in the genes they share, but the 610 number of papers on the two diseases together is not proportionally high. High similarity-to-611 paper ratios (>40) were interpreted as cases of a large knowledge gap. The pairs that fell in this 612 category include diseases that share a significant proportion of their genes but that have almost 613 never been studied together, evidenced by the very low number of papers including the two 614 MeSH terms. 615 616 We used the enricher function of the R package clusterProfiler (Yu et al., 2012) to 618 perform an ORA against Reactome pathways of the genes associated with the top 9 diseases of 619 each category in each year. We selected the significant Reactome pathways (p.adjust < 0.01) of 620 the top 9 diseases in 2018 and calculated the significance of the gene overlap between these 621 pathways with Fisher's exact test. We considered only the genes of each significant pathway 622 that were also present in the 2018 gene-disease network. By doing this, we limited pathways to 623 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint cluster according to the genes shared from our data set, not all the genes in the pathways. We 624 then built a pathway network connecting the significant Reactome terms using the -log 10 pvalue 625 of the Fisher's exact tests as edge weights, similar to what was done for the disease-disease 626 network in Fig. 1A . We detected clusters of pathways in this network using the cluster_louvain 627 function (Blondel et al., 2008) of the igraph R package (Csardi and Nepusz, 2006) . Edge 628 weights were considered for the cluster detection. We calculated the weighted degree of each 629 pathway in the network using the strength function of the igraph package (Csardi and Nepusz, 630 2006) . We manually annotated the detected clusters for their major biological function using the 631 pathways with the highest weighted degree in each cluster as reference. The significance 632 values (-log 10 pval) of ORA for the pathways in each cluster were used to make box and ridge 633 plots to illustrate the results for each disease in 2018 and how these results changed from 1990 634 to 2018. 635 636 Evolution of drug target hub genes 637 Using the 2018 gene-disease network, we detected the genes common to all three 638 categories of diseases ("hub genes"). We used the R package UpsetR to visualize the number 639 of genes shared and exclusive to the disease categories. We downloaded the drug-gene and 640 the drug-disease interaction databases from the CTD (http://ctdbase.org/; Davis et al., 2021) . 641 We used the MeSH terms of the 99 diseases to filter the drug-disease database and kept only 642 interactions between drugs and diseases that were listed as "therapeutic" by CTD. These are 643 cases of a "chemical that has a known or potential therapeutic role in a disease (e.g., chemical 644 X is used to treat leukemia)", according to the CTD glossary (Davis et al., 2021) . We filtered the 645 drug-gene database and kept only the interactions between the therapeutic drugs and the hub 646 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint genes of our analysis. This final drug-gene list was used to detect the top 20 drugs that target 647 the most hub genes and the top 20 hub genes most targeted by the therapeutic drugs. We 648 visualized these drug-gene interactions in a network built with the R packages igraph and 649 plotted with ggplot2 and ggraph. We used the yearly gene-disease networks to detect when the 650 top 20 drug target hub genes were first connected to diseases in each category to build a 651 timeline. 652 653 We declare that the authors have no conflicts of interest. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.16.21259049 doi: medRxiv preprint A method for exploring 673 implicit concept relatedness in biomedical knowledge network Drug-induced psoriasis: clinical perspectives Network medicine: a network-based approach to 678 human disease Hydroxychloroquine: from malaria to 680 autoimmunity Fast unfolding of communities in 682 large networks Neurologic sequelae of primary HIV infection Expanding rare disease drug trials based on shared 686 molecular etiology A disease similarity matrix based on the uniqueness 688 of shared genes Molecular and therapeutic potential 690 and toxicity of valproic acid A 693 randomised clinical trial of methotrexate points to possible efficacy and adaptive immune 694 dysfunction in psychosis Folic acid 696 supplementation mitigates alzheimer's disease by reducing inflammation: A randomized 697 controlled trial Valproic 699 acid attenuates traumatic spinal cord injury-induced inflammation via STAT1 and NF-κB 700 pathway dependent of HDAC3 IBM watson: how cognitive computing can be 703 applied to big data challenges in life sciences research Presence of hepatitis B virus in synovium and its clinical significance in 707 rheumatoid arthritis Alzheimer's disease: the role of microglia in brain 709 homeostasis and proteopathy Valproic Acid Prevents Renal Dysfunction and 711 Inflammation in the Ischemia-Reperfusion Injury Model The igraph software package for complex network research Guillain-Barré syndrome: The first documented COVID-19-triggered 716 autoimmune neurologic disease: More to come with myositis in the offing Comparative Toxicogenomics Database (CTD): update 2021 Innate immune response is 723 differentially dysregulated between bipolar disease and schizophrenia Rett syndrome: An autoimmune disease? Cellular stress and apoptosis contribute to the 729 pathogenesis of autism spectrum disorder Neutrophil 733 hyperactivation correlates with Alzheimer's disease progression Pathology of neurodegenerative diseases. Cold Spring Harb 736 Alzheimer's Disease-Associated 739 β -Amyloid Is Rapidly Seeded by Herpesviridae to Protect against Brain Infection Science of 743 science Depression and trait-745 anxiety mediate the influence of clinical pain on health-related quality of life in fibromyalgia Transcriptome-wide isoform-level 750 dysregulation in ASD, schizophrenia, and bipolar disorder Virus infection, antiviral immunity, and 753 autoimmunity Evidence for a dysregulated immune system in the etiology of 755 psychiatric disorders Evaluation of the mood-758 stabilizing agent valproic acid as a preventative for toxoplasmosis in mice and activity 759 against tissue cysts in mice Valproic acid defines a novel class of HDAC inhibitors 762 inducing differentiation of transformed cells Synthetic antimalarial drugs and the 765 triggering of psoriasis -do we need disease-specific guidelines for the management of 766 patients with psoriasis at risk of malaria? A trial comparing 769 nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell 770 counts from 200 to 500 per cubic millimeter. AIDS Clinical Trials Group Study 175 Study 771 Team Herpes simplex virus type 1 and other pathogens are key causative 773 factors in sporadic alzheimer's disease HDAC inhibitors as epigenetic regulators of the 776 immune system: impacts on cancer therapy and inflammatory diseases Transcriptional regulation of T helper 17 cell differentiation Inflammation in Depression and the Potential for 781 Initial sequencing and analysis of the 786 human genome IL-23 drives a pathogenic T cell population that induces 789 autoimmune inflammation New IBD genetics: common pathways with 791 other diseases Inflammation and depression: a causal or coincidental link to the 793 pathophysiology? Increased risk of autoimmune diseases in dengue 795 patients: A population-based cohort study Validity of machine learning in biology and 800 medicine increased through collaborations across fields of expertise Rett syndrome and MeCP2 Drug repositioning for psychiatric and neurological disorders through a 806 network medicine approach 808 Neuroinflammation in autism: plausible role of maternal inflammation, dietary omega 3, and 809 microbiota Genome sequencing in 813 microfabricated high-density picolitre reactors Viral arthritis CIHR Team in Defining the Burden and 819 Managing the Effects of Psychiatric Comorbidity in Chronic Immunoinflammatory Disease Arthritis and hepatitis Genes associated with T helper 17 cell differentiation and function 826 Inflammation: the link between comorbidities, genetics, and Alzheimer's disease Restoring synovial homeostasis in rheumatoid arthritis by 829 targeting fibroblast-like synoviocytes Neuro-inflammation and anti-inflammatory treatment options for 832 Alzheimer's disease Inflammation, Antipsychotic Drugs, and Evidence for 834 Effectiveness of Anti-inflammatory Agents in Schizophrenia Potential repurposing of the 837 HDAC inhibitor valproic acid for patients with COVID-19 Asthma and chronic obstructive 840 pulmonary disease: common genes, common environments? Scalable and accurate deep learning with electronic health 845 records Dexamethasone in 849 Hospitalized Patients with Covid-19 Anti-inflammatory agents in the treatment of bipolar 853 depression: a systematic review and meta-analysis The role of neuroglia in autism spectrum disorders Extracellular matrix proteomics in schizophrenia and Alzheimer's 858 disease Valproic acid attenuates sepsis-induced myocardial 860 dysfunction in rats by accelerating autophagy through the PTEN/AKT/mTOR pathway The remarkable story of valproic acid Unsupervised word embeddings capture latent knowledge from materials science 866 literature MeSH ORA 868 framework: R/Bioconductor packages to support MeSH over-representation analysis The sequence of the human 873 genome Pervasive pleiotropy between psychiatric 875 disorders and immune disorders revealed by integrative analysis of multiple GWAS ggplot2 -Elegant Graphics for Data Analysis Inflammation-related biomarkers in major psychiatric 880 disorders: a cross-disorder assessment of reproducibility and specificity in 43 meta-881 analyses clusterProfiler: an R package for comparing biological 883 themes among gene clusters Th17 cells in autoimmune and infectious diseases Machine learning 888 for integrating data in biology and medicine: principles, practice, and opportunities