key: cord-0784585-lob989fj authors: Zhu, Xian; Gu, Yueming; Xiao, Zhifeng title: HerbKG: Constructing a Herbal-Molecular Medicine Knowledge Graph Using a Two-Stage Framework Based on Deep Transfer Learning date: 2022-04-27 journal: Front Genet DOI: 10.3389/fgene.2022.799349 sha: c3637ae891321c0b97e90d9c526a55e603a77eb4 doc_id: 784585 cord_uid: lob989fj Recent advances have witnessed a growth of herbalism studies adopting a modern scientific approach in molecular medicine, offering valuable domain knowledge that can potentially boost the development of herbalism with evidence-supported efficacy and safety. However, these domain-specific scientific findings have not been systematically organized, affecting the efficiency of knowledge discovery and usage. Existing knowledge graphs in herbalism mainly focus on diagnosis and treatment with an absence of knowledge connection with molecular medicine. To fill this gap, we present HerbKG, a knowledge graph that bridges herbal and molecular medicine. The core bio-entities of HerbKG include herbs, chemicals extracted from the herbs, genes that are affected by the chemicals, and diseases treated by herbs due to the functions of genes. We have developed a learning framework to automate the process of HerbKG construction. The resulting HerbKG, after analyzing over 500K PubMed abstracts, is populated with 53K relations, providing extensive herbal-molecular domain knowledge in support of downstream applications. The code and an interactive tool are available at https://github.com/FeiYee/HerbKG. A knowledge graph (KG) serves as a useful tool to represent real-world semantic phenomena in an organized way (Wang et al., 2017) . Specifically, a KG consists of a collection of three tuples, each of which follows a format of [head, tail, relation] . The head and tail specify an entity pair, and the relation defines how the two entities are semantically related. Both entities and relations can have domain-specific properties. A broad spectrum of large-scale KGs encoding generic knowledge, such as YAGO3 (Mahdisoltani et al., 2014) , Freebase (Bollacker et al., 2008) , DBpedia (Auer et al., 2007) , and BabelNet (Navigli and Ponzetto, 2012) , have been developed and have created massive value that benefits a variety of downstream applications, such as knowledge visualization (Kerdjoudj and Curé, 2015) and reasoning (Chen et al., 2020) , information retrieval (Wise et al., 2020) , and question answering (Saha et al., 2018) . Furthermore, domain-specific KGs have also gained extensive interests (Ernst et al., 2014) by domain experts who desire to have efficient access to high-quality domain knowledge. For instance, a biomedical KG allows researchers and medical practitioners to mine and discover complex interactions between millions of bioentities (e.g., chemicals, genes, and diseases), facilitating academic knowledge query and clinical decision making (Su et al., 2021; Zheng et al., 2021) . Adding such domain knowledge into existing healthcare applications can greatly improve the quality and efficiency of current medical operations and IT systems. As a sub-field of medicine, herbal medicine, also referred to as herbalism, is the study of pharmacognosy and the use of medicinal plants, forming the basis of traditional medicine that has been existing and evolving for over five thousand years across multiple continents and countries, including Africa, Americas, ancient Egypt, Greece, China, and India (Wikipedia contributors, 2004c) . In the modern era, herbalism has received criticism and skepticism due to the lack of strong evidence of efficacy and safety found in high-quality scientific publications. Currently, herbalism is still the primary health care in many underdeveloped regions and is widely used to treat chronic diseases, such as diabetes (Egede et al., 2002) , cancer (Burstein et al., 1999) , end-stage kidney disease (Roozbeh et al., 2013) , and asthma (Szelenyi and Brune, 2002) . In the past decades, more and more researchers have taken a modern scientific approach to investigating the biological function of herbs and herbal contents, validating the interconnection between herbs, extracted chemicals, diseases, and genes (Babu et al., 2007; Brackman et al., 2008) . These studies bridge modern molecular medicine and traditional herbalism, which provides more evidence to support the medicinal usage of herbs, opening a promising direction to boost the development of herbalism in the 21st century. In addition, these studies provide valuable domain knowledge in herbalism that should be restructured for building a herb KG. Existing efforts of herb KG construction mainly focus on the diagnosis and treatment side of herbalism. Wang et al. (Wang et al., 2019) propose a Knowledge Graph Embedding Enhanced Topic Model (KGETM) for traditional Chinese medicine (TCM). The used KG in the study considers relations between symptoms, syndromes, treatments, and herbs to support a herb recommendation application. Zheng et al. (Zheng et al., 2020) have developed a TCM KG that stores herbs, therapies, prescriptions, diseases, syndromes, symptoms, and the relations between them. Another group (Somé et al., 2019) has developed the West African Herbal-based Traditional Medicine KG consisting of 143 identified West African medicinal plants and 108 recipes to treat 110 diseases and symptoms. Similar herb KGs that aim to facilitate prescription can be found in literature Miao et al., 2018; Gong et al., 2021) . On the other hand, prior efforts have also investigated generic biomedical KGs. Zheng et al. (Zheng et al., 2021) propose PharmKG, which consists of 500,000 relations between genes, drugs, and diseases. Another study by Su et al. (Su et al., 2021) proposes the Cornell Biomedical Knowledge Hub (CBKH) that takes into account genes, drugs, diseases, anatomies, molecules, and symptoms, resulting in a KG with 2,231,297 entities of six types and 48, 678, 651 relations of eight types. To our best knowledge, there is no existing KG in the literature to model herbs, the contained chemicals, and their interactions with genes and diseases from the viewpoint of molecular medicine. Thus, our study aims to fill this gap. The contributions of this study are summarized as follows: • We propose a Herb ontology named HerbOnt composed of four entity types and five relation types to encode the interplay between herbs, chemicals, genes, and diseases. • We develop a learning framework to automate the construction of HerbKG. We leverage an existing named entity recognition (NER) model and design a BERT-based model for relation extraction (RE) to annotate raw PubMed abstracts that are screened to match the subject of herbalism. • To validate the RE model, we create a herb RE dataset with 3,526 human-annotated relations. BERT and two of its variants, SciBERT and BioBERT are evaluated on the herb RE dataset. In addition, two performance boosters, including a fine-tuning strategy and a substitution-based generative augmentation module, have been utilized for performance improvement. Our ablation studies show that the two boosters can bring consistent gains in the F1 score due to the additional domain knowledge injected into the models. The best model, the fine-tuned BioBERT that is further trained on the augmented dataset, can achieve an F1 score of 95.9%. The self-created herb RE dataset, with the evaluated models, can serve as a credible benchmark for future research. • The proposed system has analyzed a total of 516,393 PubMed abstracts and identified 4,130 herbs, 6,331 chemicals, 2,187 diseases, and 2,641 genes, with 53,754 distinct relations, providing valuable domain knowledge in herbalism from the molecular perspective. In addition, the constructed HerbKG can support multiple downstream tasks like evidence-based graph queries and drug repositioning. We have made the code publicly available at https://github.com/FeiYee/HerbKG, where an interactive tool and a system tutorial are also provided. The rest of this paper is structured as follows. Section 2 provides the technical details of the proposed framework for HerbKG construction. Section 3 presents the experimental results and a discussion about several downstream tasks. Section 4 offers a summary of this study with limitations and future plans. In information sciences, an ontology shows the properties of a subject area and how they are related, by defining a set of categories and concepts that represent the subject (Guarino et al., 2009) . To unify the terminology throughout the article, we use entity types to refer to categories and entities to refer to concepts that are instantiated from entity types. An ontology serves as a template for constructing a KG since only relations defined in the ontology can be added into the KG. As shown in Figure 1 , HerbOnt consists of four entity types, including Herb, Chemical, Disease, and Gene, with five coarse-grained relation types, including HerbHasCompoundChemical (HHC), HerbTreatsDisease (HTD), ChemicalActsOnDisease (CAD), ChemicalAssociatesGene (CAG), and GeneInfluencesDisease (GID). A description of the entity types is as follows. • Herbs in this study can be a part or produced from parts of a plant (either fresh or dried), including the leafy green or flowering parts, seeds, bark, roots and fruits. Examples include "abrus precatorius", "ginkgo biloba", "salvia officinalis", and "cinnamomum cassia". • Chemicals refer to chemical compounds that can be used as medicine. In our study, we mainly focus on the chemicals extracted from herbs. Examples include "essential amino acids", "isoflavanquinones", "diphenhydramine", and "abruquinone A". • A Disease refers to a particular abnormal condition that negatively affects the structure or function of all or part of an organism, and that is not due to any immediate external injury (Wikipedia contributors, 2004a). Examples include "anemia", "otoconia", "hypoosmotic swelling", and "gastric ulcer". • A Gene refers to a basic unit of heredity and a sequence of nucleotides in DNA or RNA that encodes the synthesis of a gene product, either RNA or protein (Wikipedia contributors, 2004b) . Examples include "caspase-3″, "AP-1″, "Bax", and "cytochrome c". We also provide a description of the relation types below: • An HHC describes a containment relation between a herb and a chemical, which is extracted from the herb. A herb may contain one or more chemicals that can be used for medical purposes. For example, cassia barks contains cinnamaldehyde. • An HTD indicates that a herb has positive effect on the treatment of a disease. • A CAD refers to a relation between a chemical and a disease. The effect of the chemical on the disease can be either positive or negative. CAD allows us to understand which chemical extracted from the herb causes what effect on the disease. • A CAG describes an association between a chemical and a gene. For example, a study (Li et al., 2016) shows that cinnamaldehyde (a chemical) can inhibit the PI3K/Akt (a gene) signaling pathway, inducing apoptosis and affecting the biological behavior of human colorectal cancer cells. • A GID indicates a connection between a gene and a disease. For example, a study (Lee et al., 2007) shows that AP-1 (a gene) inactivation can inhibit SW620 colon cancer (a disease) cell growth. Two learning tasks, including NER and RE, are involved in the construction of HerbKG. We provide a formal definition for each task as follows. The goal of NER is to locate entity mentions in an input text and classify them into a set of pre-defined categories. Formally, let s = [t 1 , t 2 , . . . , t n ] denote a sentence s with n tokens. Taking s as an input, an NER model outputs a list of tuples < I s , I e , k > , each of which is an entity mention in s. Here, I s and I e specify the indices of the starting and ending tokens of an entity mention. Thus, both I s and I e are in [1, n], and I s ≤ I e . Also, k belongs to a category set. In our study, k ∈ {"Herb", "Chemical", "Disease", "Gene"}. For the RE task, we choose to develop a medium-sized dataset, because there is no existing one that has the same ontology definition as HerbOnt. We formulate the learning problem as follows. Let D RE {(x i , y i )} m i 1 be the self-developed herb RE dataset with m examples. Each input x i contains the title and content of an abstract and a pair of entities. The label y i specifies the relation type of the entity pair in x i . The task is to train a model to predict the relation type given x i as an input. It is noted that the problem belongs to document-level RE, where the head and tail entity mentions could span across multiple sentences in the abstract. Also, in the context of HerbKG, we have y i ∈ {"HHC", "HTD", "CAD", "CAG", "GID", "Neg."}, in which the first five are positive relation types defined in HerbOnt, and "Neg." represent the negative examples indicating a non-relation. Since not every entity pair of an abstract are related, it is essential to introduce negative examples into the dataset so that the model can be trained to make a distinction. Figure 2 shows a two-stage learning framework of building HerbKG. Since only a small fraction of PubMed articles are relevant to the subject of HerbKG, we develop a screening procedure to identify the matching abstracts. Specifically, only an abstract that contains a mention of either a herb or a chemical in a pre-defined domain vocabulary is considered as a match. In practice, it is straightforward to list the herbs and their contained chemicals for medicinal usage. This work is manually done by one of the authors with domain knowledge in herbal medicine. Each selected abstract is then passed through the PubTator Central (PTC) NER model , followed by a custom BERT-based RE model to produce a list of identified relation triplets, which are used for the HerbKG construction. In addition, two boosting strategies have been utilized for performance improvement, including fine-tuning BERT on domain resources and a generative data augmentation method. The former aims to inject domain knowledge into the BERT model, while the latter can generate synthetic samples to enhance the training set. The constructed HerbKG can support multiple downstream applications, such as descriptive analysis, evidence-based graph query, similarity analysis, and drug repurposing. FIGURE 2 | A two-stage learning framework for building HerbKG. Stage I is the NER task, which is done by the PTC NER model. Stage II is the RE task, which extracts relation triplets used to build the HerbKG. In addition, two boosting strategies have been utilized for performance improvement, including fine-tuning BERT on domain resources and a generative data augmentation method. The former aims to inject domain knowledge into the BERT model, while the latter can generate synthetic samples to enhance the training set. The constructed HerbKG can support multiple downstream applications, such as descriptive analysis, evidence-based graph query, similarity analysis, and drug repurposing. Numerous methods have been developed to solve NER. In this study, we adopt an existing model , referred to as PTC, proposed by Wei et al., who have provided an implementation hosted online at https://www.ncbi.nlm.nih. gov/research/pubtator/. PTC can identify and annotate six types of entities, including Gene, Disease, Chemical, Mutation, Species, and Cellline. At a high level, PTC works by feeding an article into a tagging module, which integrates a series of entity taggers including GNormPlus (Wei et al., 2015a) , AB3P (Sohn et al., 2008) , SimConcept (Wei et al., 2015b) , tmVar 2.0 (Wei et al. , 2018) , SR4GN (Wei et al., 2012) , TaggerOne (Leaman and Lu, 2016) , and Cellosaurus (Bairoch, 2018) . The tagging module also addresses several issues in bio-entity annotation, including abbreviation resolution, detection of composite/variant mentions, and entity normalization. The initial annotations are further processed by a concept disambiguation module to ensure that mentions referring to the same entity receive the same identifier. Figure 3 is a screenshot of bio-concept annotation using the PTC web interface for an article with PMID 12860272. To adapt PTC to suit our needs, we disable the detection of mutations and celllines that are not defined in HerbOnt. Also, a detected species is annotated as a herb if it matches any entity in the pre-defined domain vocabulary. The top three sections in Table 1 display an annotated sample by PTC. These intermediate samples are further annotated by an RE model, which is discussed in the next subsection. As defined in Section 2.2.2, the RE task in this study is a multiclass classification problem aiming to predict the relation type, given an abstract and a pair of annotated entities. Since there is no existing dataset, we have developed a dataset and trained a BERTbased model for the herb RE task. Figure 4 describes the dataset development process. First, we gather a collection of seed keywords of herb varieties that are commonly used in herbal medicine. We then use a self-developed crawler to search the PubMed dataset for the seed herb keywords. The search is only applied to the PubMed abstracts rather than the full texts, because an abstract contains the essential findings of a study; in most cases, the entities and their relations are clearly stated in an abstract, providing sufficient information that can be extracted to build our knowledge graph. The crawler is able to scrape a collection of relevant abstracts guided by the seed keywords. Next, the collected abstracts are fed into the PTC through its restful API at https://www.ncbi.nlm.nih.gov/research/ Frontiers in Genetics | www.frontiersin.org April 2022 | Volume 13 | Article 799349 5 pubtator/api.html (accessed on 7 July 2021). We display a complete sample with full annotation in Table 1 , in which the first three rows, including Title, Abstract, and Entity Mentions, are generated by the PubTator API; the last row, Relations, is completed by a human annotator, who is a Ph.D. student in molecular biology with sufficient domain knowledge for the annotation task. Each annotated abstract follows the PubTator format, as shown in Table 1 , which divides a sample into four sections, each starting with the PubMed article ID of the abstract. The Title and Abstract sections that are directly extracted from the PubMed database. Each entity mention is a six-tuple with a specified sequence of PubMed ID, entity start position, end position, entity text, entity type, and entity ID. Similarly, each relation in the Relations section is a four-tuple sequence including the PubMed ID, the relation type, the head entity ID, and the tail entity ID. Since we adopt BERT-based models for RE, the input can be split into two sentences, denoted by A and B. For our RE task, sentence A is a concatenation of the title and the abstract, and sentence B contains the head and tail entities, separated by a space. For each relation type, we need both positive and negative training examples. Also, each example in the dataset is a three-tuple, namely, sentence A, sentence B, and relation type that are separated by a tab. To simplify training, we group all negative examples together to create a new RE category, i.e., "Neg.". Therefore, the RE type for each example is one of the elements in {"HHC", "HTD", "CAD", "CAG", "GID", "Neg."}, leading to a six-class classification problem. Table 2 shows five instances in the herb RE dataset. Table 3 shows the stats of the herb RE dataset that contains a total of manually annotated 3,526 examples, split in the ratio of 3: 1 to obtain the training and test sets. It is observed that the six classes are highly imbalanced. HHC and Neg. combined account for nearly 70% of all relations, while CAG and GID only take 1.3 and 3.1%, respectively. Thus, a performance metric (see Section 3.1.1) should be carefully chosen to properly deal with the class imbalance issue. Abstract 9848396 Two kinds of cinnamaldehyde derivative, 2′-hydroxycinnamaldehyde (HCA) and 2′-benzoxy-cinnamaldehyde (BCA), were studied for their immunomodulatory effects. These compounds were screened as anticancer drug candidates from stem bark of Cinnamomum cassia for their inhibitory effect on farnesyl protein transferase activity. Ras activation, which is accompanied with its farnesylation, has been known to be important in immune cell activation as well as in carcinogenesis. Treatment of these cinnamaldehydes to mouse splenocyte cultures induced suppression of lymphoproliferation following both Con A and LPS stimulation in a dose-dependent manner. . . attention layer with multiple heads and a feed-forward layer. A self-attention head projects a sequence of input tokens into a latent space to capture the semantic dependency between the tokens. The output of the self-attention layer is normalized, aggregated, and passed through a feed-forward layer to produce a vector, namely, the hidden state, which is the output of the transformer encoder. The paths the tokens take to flow through the encoder can be partially executed in parallel, which allows the training and inference of the neural network to be accelerated. In this study, we adopt Bert base , which is pretrained on BooksCorpus (Pechenick et al., 2015) (800M words) and on English Wikipedia (2,500M words). BERT adopts two unique pre-training techniques, i.e., masked-language modeling (MLM) and next sentence prediction (NSP), both of which are self-supervised. MLM works by randomly masking a fraction of tokens in the input sentence and training the model to predict the missing ones, while NSP trains BERT to predict the follow-up sentence given an input sentence. Both MLM and NSP aim to help BERT better understand the style of unstructured human language by optimizing the loss function of self-training and adjusting the model parameters. BERT has achieved the state-ofthe-art (SOTA) performance in a variety of NLP tasks (Devlin et al., 2018) and yielded a spectrum of variants (Beltagy et al., 2019; Sanh et al., 2019; Lee et al., 2020) . Figure 5 describes the neural architecture of the BERT-based model for the herb RE task. Each input instance contains two strings, including sentences A and B, separated by a [sep] token. Sentence A is a concatenation of the title and abstract of a PubMed article, and sentence B contains an entity pair linked by a space character. The input passes through a stack of embedding layers for token, sentence, and positional embedding, transforming the original text to numeric vectors, which are further fed into a series of transformer encoders (Vaswani et al., 2017) , where the neural parameters are updated via its unique self-attention mechanism. Lastly, the output of the Nth transformer layer passes through a dense and classification (also softmax) layer to generate the prediction result, which is a six-dimensional normalized vector that encodes the confidence scores for each relation type. During training, the predicted and ground truth values are fed into a cross entropy loss function to calculate the loss for back-propagation. We investigated BERT and two BERT variants, namely, BioBERT and SciBERT, to develop the benchmark models for the herb RE task. A big challenge faced by the RE task is the lack of training resources due to the high annotation cost that has to involve a human expert. Therefore, the RE model cannot absorb sufficient domain knowledge to always make the correct prediction. Two strategies, including fine-tuning and data augmentation, have been adopted to enhance the model's learning capability in the context of herbal-molecular medicine. BERT, SciBERT, and BioBERT have been sufficiently pre-trained on different types of domain resources, namely, on general English texts, general scientific articles, and biomedical articles, respectively. Due to the disparity of domains, BERT and its two variants have learned different knowledge, which could lead to misclassification when applied to the domain of herbal-molecular medicine. Therefore, we conduct fine-tuning for all three models on the 516K PubMed abstracts to inject more domain knowledge into the models. The tuned models are then further trained to tackle the downstream RE task. We employ a GPT-2-based generative model to generate synthetic samples to enhance the quantity and diversity of the training data. Figure 6 depicts the data augmentation mechanism, which includes the following steps. • Corpus preparation. GPT-2 is a pre-trained generative model that can produce generic English sentences. To satisfy our requirements, GPT-2 needs to be fine-tuned on a corpus relevant to our RE task. Thus, the first is to prepare a corpus with textual resources that 1) are in the domain of herbal-molecular medicine and 2) present the entities and relations pre-defined in the herb ontology. Intuitively, we extract such sentences from the annotated dataset described in Section 2.5.1 based on one condition, namely, the sentence must contain at least a pair of entities that present one of the pre-defined relation categories. For example, the sentence "Cinnamaldehyde induces apoptosis via inhibition of the PI3K/Akt signaling pathway." presents a CAG relation, is an ideal candidate to be added to the corpus. Since there are five relation types, five corpora are needed. • Substitution I. For each sentence in each corpus, we apply a transformation by substituting each entity mention in the sentence with a type token. For example, the sentence in the above item becomes "[Chemical] induces apoptosis via inhibition of the [Gene] signaling pathway." after the substitution. This step allows GPT-2 to focus on the semantic relations between generic entity types rather than particular entity mentions. • Fine-tuning GPT-2. After substitution I, we send each corpus to fine-tune a GPT-2 model. The tuned GPT-2 model can generate sentences that are both semantically and syntactically similar to the ones in the input corpus. For instance, a generated sentence may look like "[Chemical] reduced myocardial infarction area and attenuated [Gene] production." • Substitution II. The generated samples are passed through another substitution block, which samples a pair of entities with known relations from an entity database (i.e., entity DB in the Figure) to replace the type token, namely, the placeholder, yielding the final augmented sample with the Frontiers in Genetics | www.frontiersin.org April 2022 | Volume 13 | Article 799349 8 correct entity and relation annotation. The generated sentence from the previous item can become "Flavonoids reduced myocardial infarction area and attenuated TNFalpha production." which encodes a CAG relation between the entities in bold. The output of this module can be used to train the BERT-like models for the RE task. All experiments were implemented using Python 3.6.7 and PyTorch 1.7.1 and conducted on a Windows 10 workstation with an i7-10875h CPU and a Tesla V100 16G GPU. We chose BERT base, which has 12 layers of encoders with a hidden size of 768, 12 attention heads, and 110M trainable parameters. As such, we choose the BERT base version for both SciBERT and BioBERT to conduct a comparable experiment. We focus on the evaluation of the RE module for two reasons: 1) we adopted an existing NER model whose performance has been extensively evaluated in the original paper ; 2) the outputs of the RE model are directly added into the HerbKG, which determines the quality of the KG. Given a highly imbalanced RE dataset, accuracy is not an adequate metric since the model is encouraged to classify examples into the major classes and can still achieve a decent accuracy. A better metric to deal with class imbalance is the F1 score, which is defined as the harmonic mean of precision (Pre) and recall (Rec). Also, Pre and Rec are useful metrics since they are two inverse indicators of false alarms and missed instances, respectively. With the given true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), the definitions of Pre, Rec, and F1 are given in Eqs. 1-3. In addition, we choose the precision-recall area under the curve (PR AUC) as another important performance indicator, which is usually calculated by the average precision (AP) metric given in Eq 4, where precision P is expressed as a function of recall R, and AP is the average value of precision over the interval from R = 0 to R = 1. In other words, AP summarizes the PR curve as the weighted mean of precisions at each threshold, with the increase in recall from the previous threshold used as the weight, namely, R n − R n−1 in Eq 4. Since the three benchmark RE models adopt the same neural architecture, it is convenient to tune the models with the same set of hyperparameters in the same predefined ranges, which are given in Table 4 . We tuned four hyperparameters, including epoch, learning rate, loss function, and optimizer, which are commonly used in prior efforts (Lee and Hsiang, 2019; Mosbach et al., 2020) . For the number of epochs, we chose the odd numbers less than ten. Since the models have been extensively pre-trained, fine-tuning them for a downstream task would be fast (Zhu et al., 2021) . For the learning rate, we chose four values including 0.0001, 0.0003, 0.001, 0.003, and 0.01. In practice, a large learning rate may bring difficulty in model convergence due to overshooting, and a small learning rate may lead to slow convergence (Goodfellow et al., 2016) . For the loss function, we examined two options, including the standard cross entropy (CE), and the weighted CE. Since the class labels are imbalanced, using a weighted CE allows the algorithm to trade off recall and precision by up-or down-weighting the cost of a positive error relative to a negative error. Lastly, for the optimizer, we explored three options, including stochastic gradient descent (SGD), Adam, and SGD with Momentum. During training, SGD is computationally fast due to its frequent updates to the parameters; the Adam optimizer improves SGD by combining the AdaGrad and RMSProp algorithms to handle sparse gradients; adding momentum to SGD allows the optimization algorithm to accumulate the gradient used by the previous steps to calculate a better direction for the next step. A grid search is conducted based on the given ranges of hyperparameters to determine the optimal setting. Table 5 shows the best hyperparameters for each model obtained from the grid search using F1 as the selection criterion. It is observed that all three models demonstrate an F1 of over 92%. The best model, BioBERT, presents an F1 of 94.37%. A detailed breakdown of the results is discussed in the next subsection. We report a performance comparison on the RE task for BERT, SciBERT, and BioBERT using their base models in Tables 6-8 respectively. Each The highest F1 score is marked in bold. The results presented in Tables 6-8 are from the base models of BERT, SciBERT, and BioBERT without any boosting strategy applied. Figures 7, 8 show the performance scores in F1 and AP for the three models. Specifically, for each model, we incrementally apply the two boosters, namely, fine-tuning and data augmentation, yielding six models. Our observations on the results are as follows. • Both fine-tuning and data augmentation have demonstrated consistent performance gains for all three models, validating the effectiveness of the boosting strategies in the given context. The best model, namely, the fine-tuned BioBERT further trained on the augmented dataset, presents an F1 score of 95.9%. • Fine-tuning has boosted the F1 by 1.1, 0.8, and 0.7%, for BERT, SciBERT, and BioBERT, respectively. It is observed that the gain reduces as the model moves from BERT to BioBERT, which can be explained from the perspective of the domain resources used for pre-training. BERT was pretrained on generic English texts, which is distant from the domain of herbal-molecular medicine in this study. On the other hand, BioBERT has already been trained on PubMed articles, which are highly relevant to the domain for our task. Whereas SciBERT is somewhere in the middle. Therefore, fine-tuning BERT led to the most performance gain since more domain knowledge can be injected and transferred to the downstream task. In contrast, BioBERT does not benefit too much from the fine-tuning, because most syntactic and semantic patterns (i.e., domain knowledge) have been seen and learned during pre-training. • Adding data augmentation on top of the three tuned models also brings consistent improvements, yielding a gain of 0.6, 0.9, and 0.5%, for BERT, SciBERT, and BioBERT, respectively. The gain is minor due to the strategy taken to generate the synthetic samples. As described in Section 2.6.2, a sentence is selected to fine-tune GPT-2 only if it contains two entity types with a known relation. As a result, the chosen sentences are generally short and only present intra-sentence relations. However, most of the hard cases are samples with inter-sentence relations. In other words, two entities may span multiple sentences to present a relation. Fortunately, these hard cases are rare in the abstracts of scientific papers. In fact, most authors tend to use concise and clear sentences to present scientific findings, which is good news for our RE task. • Similar observations can be obtained from Figure 8 regarding the effects of fine-tuning and data augmentation on AP. The addition of fine-tuning brings up the AP by 0.6, 1.5, and 1.5% for the three base models, and the addition of data augmentation leads to a gain of 1.7, 0.6, and 0.7% for the three fine-tuned models. The gains have been consistent across for both metrics with all three models, validating the efficacy of the two boosters. In this section, we discuss how BioBERT is trained to learn the relations using several qualitative results to gain a deeper understanding on the impacts of individual tokens on the determination of a relation. The process is as follows. For each instance in the test set, only sentences that contain both head and tail entities are kept and saved in a list. Let s = [t 1 , t 2 , . . . , t n ] denote an extracted sentence with n tokens t 1 , . . . , t n . Our goal is to calculate a score that measures the impact of each individual token on the relation of an entity pair. Specifically, we first pass s through the fine-tuned BioBERT model to obtain a score denoted by c * , which represents the probability that s is classified into the correct relation type. Thereafter a loop is employed to iterate through s token by token. For the ith iteration, token t i is replaced by a meaningless token t′, and the modified sentence is passed through the same BioBERT model once again to obtain another score denoted by c i . The score difference, denoted by d i = c * − c i , reflects the importance of token t i . In other words, the larger the d i , the more greatly the confidence score drops, thus, the more important t i is. Figure 9 shows an example where d i is obtained for each token i. The sentence is extracted from an instance in the test set and expresses a HHC relation between "cassia bark" and "cinnamaldehyde". It is observed that tokens that receive high importance scores fall into two categories: 1) tokens such as "cassia", "barks", and "cinnamaldehyde" are parts of the entities that are obviously important; 2) tokens such as "contained" and "contents" are the keywords that semantically determine the relation type. For the latter case, it is noted that BioBERT can effectively learn and quantify the impacts of tokens in a given instance, demonstrating its superior capability of semantic reasoning. The proposed system analyzed a total of 516,393 PubMed abstracts and identified 4,130 herbs, 6,331 chemicals, 2,187 diseases, and 2,641 genes, with 53,754 distinct relations, including 19,872 HHC, 13,627 HTD, 9,984 CAD, 3,353 CAG, and 6,918 GID relations. A subgraph of HerbKG (stored using the Neo4j graph database (Webber, 2012) ) is shown in Figure 10 , which consists of one herb entity, three chemical, eleven gene, and three disease entities. Also, the subgraph includes three HHC, eleven CAG, six GID, and three CAD relations. It is noted that only three abstracts were processed through the proposed learning pipeline to generate this subgraph. This subsection covers four categories of downstream applications with several case studies to demonstrate the potential of HerbKG to provide data-driven and evidencebased knowledge support in pharmacology. An advantage of a KG is that data, as stored in entities and relations, can be easily visualized and presented to end users. Thus, knowledge visualization has been a basic feature for KGbased applications (Yu et al., 2017; Liu et al., 2018; Wise et al., 2020; Zheng et al., 2020) . In addition, descriptive analysis, which helps describe, show or summarize data points, is desired in a dashboard interface that allows a user to quickly grasp a big FIGURE 9 | An examination of token importance in the determination of a relation type. The example shows in the figure is correctly classified by the BioBERTbased RE model, which outputs a triplet ("cassia bark", "cinnamaldehyde", "HHC") as an entry in the HerbKG. FIGURE 10 | The extracted graph for herb "Sophora flavescens" is a subgraph of HerbKG, in which Herb, Chemical, Gene, and Disease entities are marked in red, tan, green, and pink, repsectively. Frontiers in Genetics | www.frontiersin.org April 2022 | Volume 13 | Article 799349 picture of data. Statistical results can be customized, presented, and visualized (Su et al., 2021; Zheng et al., 2021) . The core mission of HerbKG is to investigate the molecular mechanism of herbal medicine. Therefore, it is crucial to understand the physiological behavior of herbs in the treatment of diseases by regulating gene expression/function. At a high level, the HerbKG can provide the top-ranked herbs with the most related genes ( Figure 11A , the genes regulated by the most herb extracts ( Figure 11B , the herbs that can treat the most diseases ( Figure 11C) , and the most treated diseases by herbs ( Figure 11D ). It is observed that five herbs, including Methyl Salicylatum, Allium sativum, Andrographis Paniculata, Panax Ginseng, and Rhizoma Curcumae, appear in both list (a) and (c), indicating that the extracted chemicals from these herbs have been extensively experimented to validate their effects on the diseases at the molecular level. Also, it is found that three heat shock proteins, namely HSP70, HSP90, and GRP78, are among the top-ten genes in list (b). Heat-related proteins have shown Frontiers in Genetics | www.frontiersin.org April 2022 | Volume 13 | Article 799349 13 significance in clinical trials, especially in cancer treatment. Other top-ranked genes include MCL-1 (related to Myeloid Leukemia), ACE2 (related to human coronavirus), SOD2 (related to idiopathic cardiomyopathy, premature aging, sporadic motor neuron disease, and cancer), and so on. In addition, the topranked diseases in list (d) include several deadliest diseases, such as various types of cancer, diabetes, Alzheimer's disease, and lower respiratory infections (e.g., MERS). Meanwhile, herbs are used to treat a wide range of common diseases, such as cold, obesity, hypertension, cough, and diarrhea. In all, these statistical results can be displayed in a dashboard to help users gain a highlevel understanding of the commonly studied herbs and their related genes and diseases. KGs that are built from scientific articles are supported by the research findings in the source articles. Since users of such KGs could be researchers, doctors, or clinical practitioners, providing an evidence for each query that points to the original source is a huge advantage. With this feature, users can be more convinced by the information found in the KG and can be easily re-directly to the first-hand resource (Miao et al., 2018; Zheng et al., 2021) . In this study, each extracted relation in HerbKG is based on a sentence in an abstract. The sentence where both entities appear becomes the evidence supporting the relation. For example, Table 9 shows a collection of relations learned from an abstract (PMID27882228), which is a study that investigated Oxymatrine (OMT), a component of Sophora flavescens, and its potential treatment for neurodegenerative diseases via the regulation of a set of genes. The test was done using our best model and achieved a satisfactory result, except that the CAG relation between OMT and myeloid differentiation factor (MYD) 88 (last row of the table) was not detected due to the fact that the PubTator-based NER model failed to classify it as a gene in the first place. Each sentence that contains a detected relation was marked as a piece of evidence, and the relation is reinforced if it is supported by multiple evidence from different articles, meaning that the relation has been validated by more than one study and becomes more convinced. With this setup, a wide range of graph queries can be made. Since HerbKG is stored in a Neo4j graph database, a question should be first translated into a query statement in Cypher and sent to the Neo4j database engine; then a resulting subgraph is returned. For example, if our query were "find the top three most studied chemicals of Sophora flavescens and their related entities," the result would be the entire graph in Figure 10 . Similarity analysis in a KG is useful analytical method that measures the entity-entity or relation-relation similarity. Wang et al. utilize the Pearson correlation (Benesty et al., 2009) to represent the semantic similarity of herbs (Wang et al., 2019) . Recent graph neural network (GNN) models facilitate this analysis by encoding a collection of relevant features into a node/relation embedding, which can be directly used for similarity calculation (Shen et al., 2019; Wise et al., 2020; Zheng et al., 2021) . For example, Fokoue et al. adopt a GNN to compute drug similarity, which is used as a feature to predict drug-drug interaction. A unique value provided by HerbKG is the interplay between a herb-extracted chemical and a gene, which affects a disease. To quantify the similarity between two herbs in their biological functions, we focus on two aspects, namely, the shared set of genes they regulate and the shared set of diseases they may treat. Let S i G and S i D denote the set of genes herb extract i can regulate and the set of diseases i may treat, respectively. The similarity between two herb extracts i and j is given as follows. Intuitively, the more related genes and diseases two chemicals share, the more similar they are. With this idea, we performed similarity analysis for "Nelumbo nucifera", and found the top-five most similar herbs as follows: Acanthopanax senticosus (0.56), Scutellaria barbata (0.41), Litchi chinensis (0.39), Myristica fragrans (0.31), Kaempferia galanga (0.28), where the values in the brackets indicate the similarity scores. Understanding the biofunction similarity between herbs is an important first step for drug repositioning, discussed in the following subsection. Drug repurposing (or repositioning) aims to discover an existing drug's new medical indications outside of the scope of its original usage (Zhu et al., 2020) . KG-based drug repurposing aims to CAG OMT was shown to suppress the levels of myeloid differentiation factor (MYD)88, nuclear factor (NF)-κB, caspase-3, inducible nitric oxide synthase, tumor necrosis factor-α, interleukin (IL)-1β and IL-6 discover potential drug-target or drug-disease relations that do not currently exist in the KG (Boudin, 2020) . Existing efforts have leveraged KGs to identify drugs for the treatment of Covid-19 (Al-Saleem et al., 2021) and rare diseases (Sosa et al., 2019) . It is also noted that a drug-disease relation may be either direct or indirect, investigated in (Zhu et al., 2020) , where different types of path between a drug and a disease are considered. On HerbKG, drug repositioning can be transformed to a link prediction problem that predicts a potential association between a chemical and a disease, which are not previously connected. We could use the following steps. First, a disease, say, the Parkinson's disease is selected. Second, we locate the Parkinson's disease in HerbKG as well as the its related herbs, chemicals, and genes. Third, the identified genes serve as starting point to find the associated chemicals that are not directly connected to the disease in HerbKG. These chemicals may be considered as drug candidates. However, if no such chemicals can be found, a signature matching process can be employed by comparing the biological profile (including information about structure, genetic and disease association, adverse effect, and graph properties, etc.) of a drug with that of another drug that is known to be used for treating the disease. In fact, matching is an operation that measures the similarity between chemicals, which can be done via machine learning. Specifically, such a learning model takes as input a chemical's biological profile and the disease name and predicts a score that indicates the likelihood that the chemical can treat the disease. The current version of HerbKG has only encoded the knowledge of genetic and disease association, whereas more profile information for chemicals is needed to build an accurate model. In addition to the chemicaldisease association, a more effective approach is to predict the chemical-gene interaction, given that most diseases are known to be caused by specific genes with aberrant expression patterns. Therefore, the top-ranked chemical-gene pairs can be generated by the predictive model and used for further validation to support the new usage of a drug. Recent advances have witnessed the prosperity of KGs, which can effectively represent knowledge in the physical world. Each KG is a semantic network that stores a collection of triplets, each of which encodes the relation between a pair of entities. KGs can support a wide range of applications, such as knowledge reasoning, information retrieval, question answering, and visualization. Also, domain specific KGs have received numerous interests from domain professionals and practitioners. A typical example is a biomedical KG, which allows doctors and researchers to mine and discover interplay between bio-entities, potentially accelerating the efficiency and improving the accuracy of current clinical practice. Traditional medicine that uses herbs for health care has been existing for over five thousand years. However, herbalism has been criticized for its insufficiently verified efficacy and safety in modern medicine research. Current herbal KGs mainly focus on the diagnosis and treatment side of herbalism that explore relations between herbs, symptoms, and treatments, rather than investigate it from the view point of molecular medicine. In the past decades, more researchers adopt modern approaches in molecular medicine to study how herbs and their extracted contents affect the biological functions of human body. Our investigation shows that this type of researches have not been extensively utilized in the KG community, opening a promising research direction. Our study aims to construct a KG to bridge herbal and molecular medicine. We propose HerbKG with four entities, namely, herbs, chemicals that are extracted from the herbs, diseases that can be treated by herb contents, and genes that are affected by the chemicals. Six relation types are defined to model the interplay between the entities. We develop a systematic framework to automate the construction of HerbKG by extracting relational triplets from PubMed abstracts. The proposed framework adopts an existing NER (i.e., PTC) model and a custom BERT-based RE model. The RE model is validated on a self-created herb RE dataset and demonstrates superior performance. The resulting HerbKG, after analyzing over 516K abstracts, is populated with 53,754 relations, offering valuable domain knowledge in herbalism from the molecular perspective. A key challenge for supervised learning in a domain-specific task is the lack of abundant training resource. This challenge is addressed with two unsupervised strategies in our work. The first strategy, fine-tuning on domain resources, is to encourage the BERT model to learn more domain knowledge. The second strategy, substitution-based generative augmentation, aims to generate synthetic training samples based on the existing expert-annotated ones. The mixing of supervised and unsupervised learning paradigms brings new opportunities to tackle the problem of KG construction. This study has the following limitations that will be addressed in future work. First, the RE task can be made more fine-grained. For example, the role of a chemical played in regulating a gene's activity or function can be divided into several subclasses such as inhibitor, activator, antagonist, and agonist, etc. Fine-grained relations can enhance the knowledge granularity encoded by HerbKG and better support the downstream applications. Another direction is to explore existing data resources. After all, several wellknown benchmarks that model either drug-drug, chemicalprotein, or chemical-disease relations have been studied and can be utilized in model training. Lastly, more advanced downstream applications in drug discovery can be developed. Opportunities are twofold. One idea is to adopt a GNN model to better encode a wider spectrum of data properties, such as multi-omics, molecular structural, and graph properties, for entities and relations in HerbKG and facilitate the link prediction tasks like drug repurposing and target inference. Also, predicting the effect of drug combination could be a significant task that comes with a unique advantage in herbal medicine since drug compound has been a typical way of prescription in traditional herblism. As such, abundant training data can be gathered to train learning models. Knowledge Graph-Based Approaches to Drug Repurposing for Covid-19 Dbpedia: A Nucleus for a Web of Open Data Cinnamaldehyde-A Potential Antidiabetic Agent The Cellosaurus, a Cell-Line Knowledge Resource Scibert: A Pretrained Language Model for Scientific Text Pearson Correlation Coefficient Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge Computational Approaches for Drug Repositioning: Towards a Holistic Perspective Based on Knowledge Graphs Cinnamaldehyde and Cinnamaldehyde Derivatives Reduce Virulence in Vibrio Spp. By Decreasing the Dna-Binding Activity of the Quorum Sensing Response Regulator Luxr Use of Alternative Medicine by Women with Early-Stage Breast Cancer A Review: Knowledge Reasoning Over Knowledge Graph Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding The Prevalence and Pattern of Complementary and Alternative Medicine Use in Individuals with Diabetes Knowlife: a Knowledge Graph for Health and Life Sciences Kgrn: Knowledge Graph Relational Path Network for Target Prediction of Tcm Prescriptions Deep Learning What Is an Ontology? Rdf Knowledge Graph Visualization from a Knowledge Extraction System Taggerone: Joint Named Entity Recognition and Normalization with Semi-markov Models 2-hydroxycinnamaldehyde Inhibits Sw620 colon Cancer Cell Growth through Ap-1 Inactivation Biobert: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining Patentbert: Patent Classification with fine-tuning a Pre-trained Bert Model Cinnamaldehyde Affects the Biological Behavior of Human Colorectal Cancer Cells and Induces Apoptosis via Inhibition of the Pi3k/akt Signaling Pathway T-Know: A Knowledge Graph-Based Question Answering and Infor-Mation Retrieval System for Traditional Chinese Medicine Yago3: A Knowledge Base from Multilingual Wikipedias Construction of Semantic-Based Traditional Chinese Medicine Prescription Knowledge Graph On the Stability of finetuning Bert: Misconceptions, Explanations, and strong Baselines Babelnet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution Use of Herbal Remedies Among Patients Undergoing Hemodialysis Complex Sequential Question Answering: Towards Learning to converse over Linked Question Answer Pairs with a Knowledge Graph Distilbert, a Distilled Version of Bert: Smaller, Faster, Cheaper and Lighter Kgdds: A System for Drug-Drug Similarity Measure in Therapeutic Substitution Based on Knowledge Graph Curation Abbreviation Definition Identification Based on Automatic Precision Estimates Enabling West African Herbal-Based Traditional Medicine Digitizing: the Watrimed Knowledge Graph Frontiers in Genetics | www.frontiersin.org A Literature-Based Knowledge Graph Embedding Method for Identifying Drug Repurposing Opportunities in Rare Diseases Cbkh: The cornell Biomedical Knowledge Hub. medRxiv Herbal Remedies for Asthma Treatment: Between Myth and Reality Attention Is All You Need Knowledge Graph Embedding: A Survey of Approaches and Applications A Knowledge Graph Enhanced Topic Modeling Approach for Herb Recommendation A Programmatic Introduction to Neo4j Pubtator Central: Automated Concept Annotation for Biomedical Full Text Articles Gnormplus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains Sr4gn: A Species Recognition Software Tool for Gene Normalization Simconcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedical Text Tmvar 2.0: Integrating Genomic Variant Information from Literature with Dbsnp and Clinvar for Precision Medicine Disease -Wikipedia, the Free Encyclopedia. Wikimedia Foundation Gene -Wikipedia, the Free Encyclopedia. Wikimedia Foundation Herbal Medicine -Wikipedia, the Free Encyclopedia. Wikimedia Foundation Covid-19 Knowledge Graph: Accelerating Information Retrieval and Discovery for Scientific Literature Knowledge Graph for Tcm Health Preservation: Design, Construction, and Applications. Artif Pharmkg: A Dedicated Knowledge Graph Benchmark for Bomedical Data Mining Tcmkg: A Deep Learning Based Traditional Chinese Medicine Knowledge Graph Platform Full-Abstract Biomedical Relation Extraction with Keyword-Attentive Domain Knowledge Infusion Knowledge-Driven Drug Repurposing Using a Comprehensive Drug Knowledge Graph The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://github.com/ FeiYee/HerbKG. Conceptualization and methodology, XZ, YG, and ZX; software, validation, and original draft preparation, XZ and YG; review and editing, XZ and ZX. All authors have read and agreed to the published version of the manuscript. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.Publisher's Note: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.Copyright © 2022 Zhu, Gu and Xiao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.