key: cord-0740097-omjh3ydf authors: Cheerkoot-Jalim, Sudha; Khedo, Kavi Kumar title: Literature-based discovery approaches for evidence-based healthcare: a systematic review date: 2021-10-25 journal: Health Technol (Berl) DOI: 10.1007/s12553-021-00605-y sha: 0b4fa181418e1911d0fad778c99f511dedb26636 doc_id: 740097 cord_uid: omjh3ydf PURPOSE: Literature-Based Discovery (LBD) is a text mining technique used to generate novel hypotheses from vast amounts of literature sources, by identifying links between concepts from disparate sources. One of the main areas where it has been predominantly applied is the healthcare domain, whereby promising results, in the form of novel hypotheses, have been reported. The purpose of this work was to conduct a systematic literature review of recent publications on LBD in the healthcare domain in order to assess the trends in the approaches used and to identify issues and challenges for such systems. METHODS: The review was conducted following the principles of the Kitchenham method. The selected studies have been scrutinized and the derived findings have been reported following the PRISMA guidelines. RESULTS: The review results reveal useful information regarding the application areas, the data sources considered, the approaches used, the performance in terms of accuracy and reliability and future research challenges. The results of this review will be beneficial to LBD researchers and other stakeholders in the healthcare domain, by providing them with useful insights on the approaches to adopt, data sources to consider, evaluation model to use and challenges to reflect on. CONCLUSION: The synthesis of the results of this work has shed light on recent issues and challenges that drive new LBD models and provides avenues for their application in other diverse areas in the healthcare domain. To the best of our knowledge, no such recent review has been conducted. Healthcare management, being one of the highest priorities of most governments, attracts huge investments in terms of health and medical research worldwide. Medical research was found to be the main contributing factor in the improvement of health and longevity of individuals and populations in developed countries [1] . Researchers in the field are making new discoveries and generating knowledge, which has the potential to enhance healthcare delivery, improve patient health outcomes and reduce healthcare costs, thus strengthening the overall healthcare system and economy. This is only achievable if the knowledge is actually put into action [2] . However, the transfer of research findings into healthcare practice in the clinical setting, known as knowledge translation [3] , is a very complex and slow process, often resulting in patients not being provided with the most appropriate care, although better treatment recommendations have been proposed and demonstrated. A frequently stated average time lag for knowledge translation is 17 years [4] . Understanding the various stages of knowledge translation and speeding up the process is a policy priority for many health research systems [4] . In order to leverage new medical research findings more quickly for the benefit of patients, medical practitioners are encouraged to adopt the practice of evidence-based medicine, whereby medical practitioners are expected to scrutinize the scientific and clinical research literature in their respective areas in an attempt to translate health research knowledge into effective healthcare action more quickly. However, due to the large volumes of biomedical literature available and the time constraints of medical practitioners, the practice of evidence-based medicine has become a major challenge [5] . This limitation can be considerably overcome by the use of appropriate computation techniques for the automated or semi-automated knowledge extraction from relevant research literature. A broad term commonly used for such techniques is literature based discovery (LBD), whose main goal is to generate novel hypotheses from the vast available biomedical literature by discovering unknown associations in existing knowledge [6] . Recent advances in machine learning, text mining and statistical analysis techniques have spurred research in this field and have resulted in many publications on the design and application of LBD systems for various use cases in the biomedical and healthcare domains. The purpose of this work is to perform a systematic literature review of recently published research papers on the application of LBD for evidence-based healthcare, with the objective of identifying and integrating the findings of the most relevant individual studies. It is expected that the results of this review will give insights on the different LBD approaches and tools used in various application areas in the healthcare domain. It will help establish to what extent research has progressed in the field, with a focus on performance criteria like effectiveness, accuracy and reliability. A main outcome would be to identify research challenges, which will invoke further studies and thus, provide avenues for future research in other areas in the healthcare domain. The Kitchenham guidelines for performing systematic literature reviews [7] was adopted and the reporting of this paper follows PRISMA (preferred reporting items for systematic reviews and meta-analysis) guidelines [8] . To the best of our knowledge, no such recent review has been performed for evidence-based healthcare. The challenges of knowledge translation have become a major concern to individuals who seek and need healthcare, healthcare providers, policy makers and funders of health services. The incorporation of scientific medical discoveries into practice guidelines and policies in the clinical setting can greatly improve healthcare delivery and patient health outcomes, and is the basis of evidence-based healthcare [9] . Evidence-based practice involves clinical decision making which considers the best and most up-to-date available scientific evidence, together with patient values and preferences, the clinical judgment of the medical practitioner and the context in which the care is provided [10] . Healthcare professionals seek evidence to support and justify any activity or intervention for patient care. In their practice of evidence-based medicine, medical practitioners are expected to scrutinize the best available evidence for making decisions about the care of individual patients. However, with the increasing volume of academic research papers and related structured knowledge resulting from medical research worldwide, they only focus on publications that are directly relevant to their respective area of specialization and often skip other potentially relevant research. Thus, discoveries in one field remain unknown to others and potential connections between sub-fields are often missed out [11] . This limitation can be greatly curbed by LBD, which can automate or semi-automate the analysis of online resources from disparate sources to find new discoveries. With the exponential growth of scientific literature, LBD is becoming an increasingly important tool for facilitating research [12] . LBD generates discoveries not yet published anywhere, by combining knowledge extracted from varied literature sources and therefore, supports hypothesis generation [13] . There are two modes of discovery in LBD, namely open discovery and closed discovery. Open discovery starts with a concept X and tries to generate a potential association between X and another concept Z, based on an intermediate concept Y. This follows from the ABC co-occurrence model, which states that if A and B are often associated to each other, and B and C are also often associated to each other, there may potentially be an association between A and C, even if this association is not mentioned in any research paper [14] . In contrast, in closed discovery, both the start concept X and end concept Z are known, and an association between X and Z is predicted, based on a hypothesis about the relationship between X and Z. This technique then attempts to demonstrate the hypothesis through an intermediate concept Y. LBD approaches in healthcare are becoming essential, since biomedical knowledge is spread out across a larger number of publications [15] . Potential discoveries in healthcare can be associations that exist between biomedical concepts, which are not usually discussed together in the literature. Appropriate implementation of LBD techniques have the potential to predict future strong associations between these concepts [15] and therefore entails further research. In the LBD approach the starting concept X may be a disease and the end concept Z may be a treatment or cause for the disease. The results of such discoveries need to be further investigated through experimental methods or clinical studies. This review has been performed following the guidelines on undertaking systematic literature reviews by Kitchenham and Charters [7] and the reporting follows the PRISMA guidelines [8] . The methodology consisted of first setting out the research questions to give a focus for this review, followed by the specification of the search strategy, the application of assessment criteria for the selection of papers and finally the data analysis and extraction. Based on the objectives of this review, the research questions have been set out and elaborated as follows: RQ1: What are the main application areas of literature based discovery in evidence-based healthcare? We seek to find out the different application areas in which the application of LBD techniques has proved to be successful in the healthcare domain. RQ2: Which important/impactful literature sources are considered by researchers/practitioners for literature based discovery? The foundation of LBD is the large amount of scientific literature available for a specific field of study. It is therefore important to identify the different literature sources which have been harnessed for LBD in the different studies. RQ3: Which specific literature based discovery approaches and tools have proven to be effective in the healthcare domain? Due to the peculiarity of the healthcare domain, LBD techniques have to be adapted to specific application areas. There is therefore the need to investigate the specific LBD techniques/approaches which are more relevant and effective for the healthcare domain. RQ4: How do literature based discovery systems in the healthcare domain perform in terms of accuracy and reliability? Accuracy and reliability are imperative evaluation criteria for any computational technique in the healthcare domain, since a wrong intervention can lead to harmful consequences for the patient. We therefore study the different evaluation strategies used for LBD systems and find out their performance in terms of accuracy and reliability. The search strategy involved the identification of potential research papers to be included in the review by performing a search on Google Scholar, with keywords '"Literature-based discovery" in health'. Google Scholar was chosen since it indexes scientific articles from various scholarly publishers and professional societies like Springer, ScienceDirect, ACM, IEEE Xplore, ResearchGate amongst others [16] . It also indexes biomedical-specific journals like the Journal of Biomedical Informatics, PLOS ONE and BioMed Central (BMC). Gusenbauer [17] performed a comparative study of academic search engines in 2019 and concluded that "Google Scholar is currently the most comprehensive academic search engine". Keyword search was then followed by a manual screening of reference lists of relevant primary studies to extend the search space. Based on the objectives of this systematic review, we have set some inclusion and exclusion criteria to guide the study selection process, as follows. The focus of this review being on recent advances in LBD techniques and approaches, we considered studies carried out during the last five years, that is, since 2015. We only considered peer-reviewed papers published in the English language. Primary studies were included while secondary and tertiary studies, like surveys, systematic reviews and meta analyses were excluded. During an initial screening of studies, we came across papers which describe general LBD techniques without showing their application in the healthcare domain. Such studies were not included, since the objective of this review was to get insights on the different approaches which are more appropriate for specific application areas of LBD. We thus considered papers which describe the use of LBD approaches in a specific application area in the healthcare domain. The database search was performed on 2 nd February 2021. The keyword search returned 650 results, after applying the filter on year of publication. The manual screening of reference lists of relevant studies returned 12 eligible studies. 8 duplicate studies were identified from the two sources, resulting in 654 studies to screen. After a rigorous screening of the titles and abstracts based on the inclusion and exclusion criteria, 29 studies were pre-selected for the review. After initial screening based on the inclusion and exclusion criteria, the pre-selected studies were assessed for "quality" in order to integrate more detailed inclusion and exclusion criteria. Based on the research questions, four quality assessment criteria were set as shown in Table 1 . The possible outcomes for each criteria were "Yes" if the paper met the criteria and "No" if it did not meet the criteria. Two of the quality assessment criteria also had a "Partially" outcome. During the quality assessment phase, appropriate scores were given to each pre-selected study. A score of 1 was given for a "Yes" outcome, 0 for a "No" outcome and 0.5 for a "Partially" outcome. Studies which obtained a score of at least 2.5 were included in the final review. This would allow for one "No" and one "Partially" outcome in the outmost scenario. After the quality assessment phase, 23 studies have been selected for the final review, based on the scores obtained. Figure 1 shows the PRISMA flow diagram for the study selection process. The selected studies were thoroughly analyzed with an objective to extract information which would give insights to the research questions. More particularly, the information extracted were: the medical application area in which LBD was utilized and the discovery made as a result of LBD, the literature source/s considered, the type of discovery (open or closed), the techniques and tools used in the LBD approach, the performance of the system and the challenges identified by the authors. The data synthesis is shown in Table 2 . The selected studies were scrutinized with a major focus on the objectives of this review. The work of the various authors and their findings were mapped to the research questions and are discussed in the following sub-sections. Has the LBD approach used been described in detail? Yes: The LBD approach used has been described in detail Partially: The LBD approach used has been briefly described No: The LBD approach used has not been described QC2 Was there a discovery following the research work? Yes: There was a discovery No: No discovery was made QC3 Did the study include a concise evaluation strategy? Yes: A concise evaluation was done Partially: The evaluation was not intensive No: No evaluation was done QC4 Does the study give insights on research challenges and future directions? Yes: The study gives insights on research challenges and future directions No: The study does not give insights on research challenges and future directions From the studies analyzed, it was found that LBD techniques have been implemented in a myriad of application areas in the healthcare domain, as described below. Drug repurposing is one main application area in which researchers have put efforts, mostly because of the promising results achieved by the different LBD approaches proposed. Due to the huge costs and excessive amount of time involved in developing new drugs, it is regarded as a better alternative. Several studies [18, 19, 21, 23, 25 ] generated a list of potential drug-disease pairs by using drug-gene and genedisease semantic predications. Phenotypes and symptoms have also been used as the linking concept between drug and disease [16] . Some studies have used knowledge-graph based drug discovery methods [18] [19] [20] . Pharmacovigilance involves the continuous monitoring of drug safety after drugs are put on the market, which is necessary since some adverse drug events (ADEs) remain undetected during clinical trials and unreported in adverse event reporting systems such as FAERS (FDA Adverse Event Reporting System). The health hazards that ADEs may pose to individuals motivate the extensive work on the application of various computational methods for pharmacovigilance. Authors of this study have either used an open LBD [15, 23, 24] or a closed LBD [22, 25] approach for the detection of drug/ADE pairs. LBD's potential to contribute to the advancement of the medical field has been demonstrated by the development of text mining systems which have been able to identify possible causes, therapies or treatments for specific diseases. Discoveries about connections between diet and degenerative diseases [34, 35] were made from scientific literature to support better understanding and treatment of such diseases. LBD techniques have been used for rehabilitation therapy repositioning for stroke [31] and treatment repurposing for inflammatory bowel disease [36] . Other discoveries were made in the area of cancer [32] and chronic kidney disease [33] . Disease comorbidity is very common and is a popular area of research in the medical community, because of its impact on the treatment of diseases. Knowledge of the association between diseases can significantly improve the understanding of the mechanisms of diseases, thus aiding in better prevention and treatment [37] . Thus, Chen et al. [37] have used an open LBD approach for the detection of associations among complex diseases. Closed LBD approach was also used for the explanation of the correlation between epilepsy and inflammatory bowel disease [38] , and between myocardial infarction and depression [39] . Rather et al. [40] proposed the use of deep learning for the discovery of potential new biomedical knowledge . Since the data sources mainly consist of free-text, the main techniques behind LBD are text mining and natural language processing (NLP). Most LBD approaches proposed have extracted meanings from biomedical text by using Unified Medical Language System (UMLS) concepts and MeSH terms. The approaches used by authors of studies in this review are broadly categorized and described below. The ABC model of LBD is a common relation extraction technique used by many authors [18-20, 26, 30, 31, 39] . The associations between the different concepts are usually deduced from semantic predications extracted from NLP tools, like SemRep and MetaMap, which have been the most preferred tools. If the output of the ABC method consists of a long list of C terms, then these are ranked based on specific criteria and the higher-ranked C terms are considered as plausible hypotheses. Co-occurrence-based metrics are often used for analyzing the strength of entity associations, and prioritization of C terms are often based on the total frequency of co-occurrence [32] . Furthermore, Gubiani et al. [35] proposed a method to identify outlier documents by making use of two tools, namely OntoGen for outlier document detection and CrossBee for cross domain exploration. Table 4 shows the different biomedical concepts A, B and C which have been considered in the studies in this review. While most LBD methods apply co-occurrence-based methods to assess the relatedness of biomedical concepts, distributional models are also widely used. These models build vector representations of concepts which are based on the context in which they appear in literature. Relatedness between a pair of concepts is then derived based on the similarity between the vectors. Various distributional semantic techniques which have been proposed include Semantic Predications [18, 25, 30] , Latent Semantic Analysis (LSA) [37] , Predication-based Semantic Indexing (PSI) [28] and composite feature vectors [29] . Mower et al. [29] have shown that distributional models perform better than co-occurrence-based models. Several authors have used machine learning in different steps of their LBD methodology. For text analysis, Pyysalo et al. [32] propose the use of machine learning-based methods for the recognition of biomedical entity names and their grounding to domain-specific ontology identifiers. Ranking of LBD-generated hypotheses have been performed by Zhang et al. [27] through a machine learning-based filter (lasso regression filter) and Rastegar-Mojarad et al. [21] by using a binary classifier. Machine learning algorithms like logistic regression [22, 23, 29] and k-Nearest Neighbor (kNN) [29] have been incorporated in models proposed by authors in this review. Rather et al. [40] integrated Word2vec, a neural network based algorithm, in their LBD approach and showed that the model was able to retrieve strong relationships which were not identified by UMLS. Deep learning has also been used in LBD techniques [18, 35] . Knowledge-graph models use graph theory to identify novel associations among various concepts. In their LBD approach for drug discovery, Zhao et al. [22] constructed a biomedical knowledge graph based on semantic predications. A path ranking algorithm was then used to extract drug-disease relation path features. Sang et al. [23] also use a knowledge graph-based drug discovery method, which involves the training of a logistic regression model by learning the semantic types of paths in the knowledge graph. Knowledge graph embedding and knowledge graph completion have also been used [24, 25] . The papers analyzed have shown that diverse performance evaluation methods have been used for LBD systems, mostly due to the peculiarities of the healthcare domain and the specific requirements of the varied application areas. The evaluation of LBD systems in terms of accuracy and reliability is quite challenging in the healthcare domain. It becomes difficult for researchers to reliably distinguish between false positive signals and new discoveries. Most authors therefore have to rely on manual review by experts to confirm the final candidates for LBD. Many authors have claimed that there was no gold standard against which they could accurately benchmark the performance of their approaches [18, 21] and that precision and recall were not good metrics to measure the performance in all conditions [20] . The performance of the systems developed in several studies of this review is highly impacted by the performance of the tools and resources used in the LBD approach. In the evaluation of their system, Rastegar-Mijarad et al. [18] used the Comparative Toxicogenomics Database (CTD) resource, which does not annotate the type of relationship between drug and disease, therefore resulting in loss of valuable information. Sources of error are also often introduced in text mining tools like SemRep, due to inaccuracies in language processing or in the literature itself [21, 22, 25, 27, 38] and MetaMap whose accuracy reduces in the presence of ambiguity, resulting in the inability to resolve word sense disambiguation [23] . Sosa et al. [24] have acknowledged that the performance of their algorithm could considerably be improved if NLP tools improved their capability to capture complex relationships from unstructured text. The resource requirements for most LBD systems, specially those which use the open discovery approach are huge. Therefore, it is quite challenging for researchers to make their model computationally feasible, thereby imposing certain limitations resulting in suboptimal outcomes [24, 25, 28, 32] . One limitation of Pyysalo et al.'s [32] open discovery method is that it can recognize only a single correct target response for each case and their system is currently limited to discovery over paths of length two. Since the graph generated by the relations in SemMedDB is very large, making models computationally intensive, Zhang et al. [25] have used a sub-graph instead which resulted in loss of information, therefore affecting the accuracy of their model. Many authors agree that the use of larger and more variate data sets would improve the accuracy of their models. Limitations encountered include the use of unbalanced [21] and small [22, 32] data sets. The models proposed by Zhao et al. [22] and Pyysalo et al. [32] perform well using a rather small data set. However, the authors agree that their system's computational efficiency may be greatly reduced if the knowledge base is large. Yang et al. [19] believe that the rankings of the drug-disease pairs generated by their model may be adversely affected since their methodology did not consider aliases for drug names. The proposed LBD approaches have demonstrated considerable achievements and promising results in the discovery process. An in-depth analysis of the techniques used has revealed major insights to the main research challenges and future directions for such systems. The proper handling of the research challenges will definitely result in improved accuracy and performance in the LBD process. From the analysis of the various studies, it was found that extensive manual expert review was required for the selection of the final LBD candidates from a very large number. There is therefore the need to develop approaches to prioritize LBD candidates, which will provide domain experts with essential evidence instead of information overload. The following approaches are proposed to decrease the effort required by domain experts: • Determine a suitable threshold score for LBD candidates [18, 21] . Candidates below that threshold would be considered as false positives and those above the threshold would be considered for further investigations and experiments. • Develop a tool to provide recommendations for hypothesis generation [35] • Make use of rigorous statistical techniques to replace the manual review step by a more automated approach [18] • Design NLP techniques to detect false predications which occur due to negative associations [19] [20] [21] 23] Most models designed have only considered PubMED and MEDLINE abstracts as their main text corpus. Many authors have proposed the incorporation of additional data sources as the text corpus of their models to improve accuracy. A larger knowledge base has the potential to produce more complex relation paths. The additional data sources which could be considered include: • NIH grants summary to identify potentially hidden and novel associations by investigating exploratory analysis methods [40] • Biological data to find more drug candidates for Covid-19 drug repurposing [25] • Biomedical ontologies to consider additional interesting associations [38] • Drug-disease databases like CTD and DrugBank for better training in drug-repurposing [19] • FAERS data for pharmacovigilance methods instead of only relying on EHR data [28] • Spontaneous reporting data for the extraction of drugside effect associations [29] Studies in this review have clearly indicated the quest for researchers to obtain more accurate results. Due to the very large datasets and the multitude of possible pathways, the LBD models proposed are computationally intensive, therefore leading to certain limitations. Techniques proposed to improve accuracy include: • Integration of machine learning and deep learning algorithms in LBD models [27, 29, 37] , • Development of high-quality NLP tools for better accuracy, due to the reported shortcomings of existing tools • Use of relevant tools for the normalization of gene and disease targets [19] • Consideration of full texts of research articles instead of only titles and abstracts [32] • Use of graph embedding to obtain long paths [23] • Consideration of indirect relationships from knowledge graphs [24] The purpose of this work was to carry out a systematic literature review of recent publications in Literature Based Discovery approaches in the field of evidence-based healthcare. Four research questions had been set out in the planning phase of the review and the papers were deeply analyzed so as to get insights on the research questions. This work has revealed the potential of LBD techniques to discover hidden knowledge in emerging areas of healthcare and provides a comprehensive contextualization to various stakeholders in the health informatics community. The results of this review will therefore help the latter to have a good understanding of the appropriate approaches used in different application areas and contexts, and the challenges they will have to face. The synthesis of the results of this work has shed light on recent issues and challenges that drive new LBD models and provides avenues for their application in other diverse areas in the healthcare domain. The research challenges identified show different perspectives to address further research in the field and, if properly tackled, will result in better overall accuracy and performance of LBD systems, therefore contributing in the speeding up of the knowledge translation process. Authors' Contributions All authors have made a substantial, direct, intellectual contribution to this study. Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Data availability Not applicable. Conflicts of interest None. The anatomy of medical research: US and international comparisons How to translate health research knowledge into effective healthcare action Getting evidence into practice-implementation science for paediatricians The answer is 17 years, what is the question: understanding time lags in translational research Barriers associated with evidencebased practice among nurses in low-and middle-income countries: A systematic review Emerging approaches in literature-based discovery: techniques and performance review. The Knowledge Engineering Review Guidelines for performing systematic literature reviews in software engineering Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement Translational science and evidencebased healthcare: a clarification and reconceptualization of how knowledge is generated and used in healthcare Evidence based medicine: what it is and what it isn't Exploring relation types for literature-based discovery Literature based discovery: models, methods, and trends Using literature-based discovery to identify novel therapeutic approaches A context-based ABC model for literaturebased discovery A collaborative filtering-based approach to biomedical knowledge discovery Google Scholar's coverage of the engineering literature: an empirical study Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases A new method for prioritizing drug repositioning candidates extracted by literature-based discovery Literaturebased discovery of new candidates for drug repurposing SKiM-A generalized literature-based discovery system for uncovering novel biomedical knowledge from PubMed Prioritizing adverse drug reaction and drug repositioning candidates generated by literature-based discovery Relation path feature embedding based convolutional neural network method for drug discovery SemaTyP: a knowledge graph based literature mining method for drug discovery A literature-based knowledge graph embedding method for identifying drug repurposing opportunities in rare diseases Drug repurposing for COVID-19 via knowledge graph completion Literature based discovery of alternative TCM medicine for adverse reactions to depression drugs Mining biomedical literature to explore interactions between cancer drugs and dietary supplements Using the literature to construct causal models for pharmacovigilance Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications Using literature-based discovery to explain adverse drug effects Adopting literature-based discovery on rehabilitation therapy repositioning for stroke LION LBD: a literature-based discovery system for cancer biology Literature-related discovery and innovation: chronic kidney disease Mining scientific literature about ageing to support better understanding and treatment of degenerative diseases Outlier based literature exploration for cross-domain linking of Alzheimer's disease and gut microbiota Treatment repurposing for inflammatory bowel disease using literature-related discovery and innovation Gene fingerprint model for literature based detection of the associations among complex diseases: a case study of COPD Investigating the role of interleukin-1 beta and glutamate in inflammatory bowel disease and epilepsy using discovery browsing Using literature-based discovery to identify candidate genes for the interaction between myocardial infarction and depression Using deep learning towards biomedical knowledge discovery Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations