key: cord-0043197-mvvkkn3s authors: Kwak, Heeyoung; Lee, Minwoo; Yoon, Seunghyun; Chang, Jooyoung; Park, Sangmin; Jung, Kyomin title: Drug-Disease Graph: Predicting Adverse Drug Reaction Signals via Graph Neural Network with Clinical Data date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_48 sha: 393370b4ecc8a512cd673b64b3040d12ce30eb0a doc_id: 43197 cord_uid: mvvkkn3s Adverse Drug Reaction (ADR) is a significant public health concern world-wide. Numerous graph-based methods have been applied to biomedical graphs for predicting ADRs in pre-marketing phases. ADR detection in post-market surveillance is no less important than pre-marketing assessment, and ADR detection with large-scale clinical data have attracted much attention in recent years. However, there are not many studies considering graph structures from clinical data for detecting an ADR signal, which is a pair of a prescription and a diagnosis that might be a potential ADR. In this study, we develop a novel graph-based framework for ADR signal detection using healthcare claims data. We construct a Drug-disease graph with nodes representing the medical codes. The edges are given as the relationships between two codes, computed using the data. We apply Graph Neural Network to predict ADR signals, using labels from the Side Effect Resource database. The model shows improved AUROC and AUPRC performance of 0.795 and 0.775, compared to other algorithms, showing that it successfully learns node representations expressive of those relationships. Furthermore, our model predicts ADR pairs that do not exist in the established ADR database, showing its capability to supplement the ADR database. An adverse drug reaction (ADR) is considered to be one of the significant causes of morbidity and mortality, estimated to be the fourth to sixth highest cause of death in the United States [8] . Most ADR detection research has been aimed to predict ADRs in pre-marketing phases, using biomedical information sources such as chemical structures, protein targets, and therapeutic indications. Especially, studies using graph-structured data have demonstrated the superiority of modeling biomedical interactions as graphs. Nevertheless, capturing potential ADRs from the entire population in post-marketing phases is also essential to fully establish the ADR profiles [5] . The potential causal relationship between an adverse event and a drug is called a 'signal' when the relation is previously unknown or incompletely documented. Traditional ADR signal detection research in post-marketing phases mainly counts on a spontaneous and voluntary reporting system that collects spontaneous reports of suspected drug-related events, such as the WHO Uppsala Monitoring Center [1, 2, 4] . However, the spontaneous reporting system has inherent limitations such as underreporting [3, 16] , selective reporting [4] , and the lack of drug usage data. Therefore, recent studies have attempted algorithmic approaches to detect ADR signals on large clinical databases such as electronic health records (EHR) and healthcare claims data [5, 14] . Many of these studies apply basic machine learning techniques such as random forest, support vector machines, and neural networks. However, fewer studies are using graph-based approaches on the clinical databases in the field of post-marketing ADR signal detection. Due to the complex polypharmacy and multiple relations among drugs and diseases, we expect that graph structure can provide insights to potential ADRs, which may not otherwise be apparent using disconnected structures. In this study, we develop a novel graph-based framework for ADR signal detection using healthcare claims data to construct a Drug-disease graph. Specifically, we use National Health Insurance Service-National Sample Cohort (NHIS-NSC), the 12-year healthcare claims data that covers medical histories for one million population [10] . The constructed graph is a heterogeneous graph with drug and disease nodes, as it is depicted in Fig. 1 . The nodes represent the medicine prescription codes and disease diagnosis codes derived from the healthcare claims data. To represent the relations among these codes, we define edge weights using information from the data. For example, l2 -distance between two node embeddings, which are learned from the data, is used to define the drugdrug and disease-disease edge weights. Also, the conditional probability computed on the data is used for the drug-disease relationship. As Graph Neural Network (GNN) models have been demonstrated [7, 19] their power to solve many tasks with graph-structured data, showing state-of-the-art performances, we use GNN-based approach for ADR detection. We verify that GNNs can learn node representations that are indicative of various relations between drugs and diseases. Then our model makes a prediction on whether a drug node and a disease node will have an ADR relation based on the learned node representations. To evaluate the performance of the proposed approaches, we conduct experiments with the newly generated dataset using the side effect resource database (SIDER). The empirical results demonstrate the superiority of our proposed model, which outperforms other alternative machine learning algorithms with a significant margin in terms of the area under the receiver operating characteristic (AUROC) score and the area under the precision-recall curve (AUPRC) score. Furthermore, our method unveils ADR candidates that are examined to be very useful information to the medical community. Our model uses only simple data processing and well-established medical terminologies. Therefore, our work does not demand case-by-case feature engineering that requires expertise. There have been numerous studies on ADR prediction in pre-marketing phases, attempting graph-based approaches on biomedical information sources [12, 15, 18, 22] . These studies predicted potential side-effects of drug candidate molecules based on their chemical structures [15] and additional biological properties [12] . Although such studies may play important roles in preventing ADRs in premarketing phases, capturing potential ADRs in real-world use cases has been considered very important. A spontaneous and voluntary reporting system has been an important data source of the real world drug usages. Most of the traditional ADR signal detection research used voluntary reporting systems with disproportionality analysis (DA), which measures disproportionality of observed drug-adverse event pairs existing in data and the null expectations [1, 2, 4] . Recently, large-scale clinical databases such as EHR (Electronic Health Records) or healthcare claims data have gained popularity as an alternative or additive data source in ADR signal detection research. Much of the studies applied machine learning techniques such as support vector machine (SVM), random forest (RF), logistic regression (LR) and other statistical machine learning methods to model the decision boundary to detect ADR in post-marketing phases [5, 6, 11, 14, 21] . More recently, researchers explored neural network-based models over clinical databases. Shang et al. [17] combined graph structure with the memory network to recommend a personalized medication. The longitudinal electronic health records and drug-drug interaction information were embedded as a separate graph to be jointly considered for the recommendation. There also exists research for the recommendation, but the architectures are limited to the single use of instance symptoms [20, 23] , or patient history [9] . However, none of these research explored graph neural network model for predicting the ADR reactions in the post-marketing phase. In this section, we formulate our problem and describe how we apply graph structures for the task. We also present the process of training and prediction. The task is to predict the potential causal relationship between a given drug and a disease pair, which represents the prescription code and the diagnosis code in clinical data. To consider the various relationships between drugs and diseases, we convert our clinical data into a novel graph structure that consists of drug and disease nodes. The node representations and the edge weights are given according to the information retrieved from the clinical data NHIS-NSC in this study. We first learn a node embedding that reflects the temporal proximity between homogeneous nodes, i.e., drug-drug and disease-disease node pairs. In order to model the proximity between two codes, we form drug/disease sequences from patients' records. After the drug-disease graph is constructed, we build a GNN model that predicts the signal of side effects between any pairs of drug and disease. The side effect labels, which are taken from the SIDER database, are given to a subset of drug-disease pairs in graph G. We define the label function l : where V SIDER drug and V SIDER dis are the subsets of V drug and V dis registered in the SIDER database respectively, and E SIDER is the set of drug-disease pairs that are known to have side effect relation according to the SIDER database. Most large-scale clinical databases including NHIS-NSC, are collected in the form of longitudinal visit records of the patients. In this section, we explain how we process the patient's longitudinal records as sequential data and apply skipgram model to learn the code embeddings. In the patient's longitudinal records, each patient can be treated as a sequence of hospital visits {v Tn } where n represents each patient in the data, and T n is the total number of visits of the patient. The i th visit can be denoted as v is the set of prescribed codes and D i (n) is the set of diagnosed codes in the i th visit. Within a set of codes, codes are listed in arbitrary order. The size of each set is variable since the number of prescribed/diagnosed codes varies from visit to visit. With these sets of codes, we form a drug sequence Seq drug (n) and a disease sequence Seq disease (n) of n th patient by listing each of the codes in a temporal order, as it is described below (Here, we leave out the symbol n): where p x ∈ R Vp and d y ∈ R V d are the one-hot vectors representing each of the medical codes in the sequences. V p and V d are the vocabulary size of the whole prescription and diagnosis codes within the data, respectively. T p and T d represent the total number of prescription/diagnosis codes of the patient's record. In this way, we can build a corpus consisting of Seq drug or Seq disease , in which the proximity-based code embedding learning can be implemented. We use Skip-gram [13] model to learn the latent representation of medical codes in our data, in a way that captures the temporal proximity between them. With Seq drug or Seq disease , we use the context window size of 16, meaning 16 codes behind and 16 codes ahead, and apply the Skip-gram learning with negative sampling scheme. As a result, we project both diagnosis codes and prescription codes into the separate lower-dimensional spaces, where codes are embedded close to one another that are in close proximity to them. The trained Skip-gram vectors are then used as the proximity-based code embeddings. Here, we describe how we construct our unique Drug-disease graph from NHIS-NSC. In Definition 2, we explain the concept of the Drug-disease graph. Then, we explain the node representations and edge connections. We construct a single heterogeneous graph G = (V, E) consisting of drug and disease nodes, where V = V drug ∪ V dis is the union of drug and disease nodes, and E = E drug ∪ E dis ∪ E inter is the union of homogeneous edges E drug and E dis (i.e. consisting of same type of nodes) and heterogeneous edges E inter (i.e. consisting of different types of nodes). To represent v drug ∈ V drug and v dis ∈ V dis , we jointly use proximity-based node representation along with category-based node representation. Proximitybased node representation is obtained by initial Skip-gram code embedding as in Sect. 3.2. We denote a proximity-based drug node as v drug and a disease node as v dis . Category-based node representation is designed to represent the categorical information of medical codes. We utilize the hierarchical structure of categorical codes (i.e. ATC and ICD-10 codes) by adopting the one-hot vector format. Since there are multiple categories for each code, the category-based node representation is shown as a concatenation of one-hot vectors, thus, a multi-hot vector. Finally, the initial node representation of the Drug-disease graph are represented as the concatenation of the proximity-based node embeddings and the categorybased node embeddings. Following are the definitions for the drug and disease node representations. where v drug is a category-based drug node, v dis is a category-based disease node, v drug is an initial drug node, v dis is an initial disease node, and || is a vector concatenation function. Each v i drug represents the each level in the ATC code structure and v drug ∈ R 104 . Because the ATC code structure is represented in 5 levels, a drug node vector is also represented as the concatenation of 5 one-hot vectors. Similarly, each v i dis represents each of the first two levels in the ICD-10 code structure and v dis ∈ R 126 . We only use two classification levels of the ICD-10 code structure, therefore, the disease node vector is represented as the concatenation of 2 one-hot vectors. For homogeneous edges like E drug and E dis , we view the relationships between homogeneous nodes as the temporal proximity of two entities, meaning that two nodes are likely to be close together in the records. Therefore, using the proximity-based node embeddings, we compute l2 -distance between two node embeddings to estimate the temporal proximity. For heterogeneous edges, which are the edges connecting drug nodes and disease nodes, are given as the conditional probability of drug prescription given the diagnosed disease. The definitions of the two types of edges are given as follows: For any node i, j ∈ V drug (or V dis ), the edge weight w ij between two nodes are defined using Gaussian weighting function as follows: for some parameters threshold and θ. v i and v j are the proximity-based node embeddings of two nodes i and j. Later, we additionally use edge-forming thresholds to control the sparsity of the graph. For any drug node i ∈ V drug and any disease node j ∈ V dis , the edge weight w ij between two nodes are given as: where n ij is number of patients' histories in the NHIS-NSC database that is recorded with a diagnosis j and a prescription i in tandem. n j is the number of patients' histories with the diagnosis j. We aggregate neighborhood information of each drug/disease node from the constructed graph using the Graph Neural Network (GNN) framework. In each layer of GNN, the weighted sum of neighboring node features in the previous layer is computed to serve as the node features (after applying a RELU nonlinearity σ) as follows: where N (i) denotes the set of neighbors of i th node, z i (l) denotes feature vector of i th node at l th layer, W denotes a learnable weight matrix and α (l) ij denotes the normalized edge weight between i th and j th nodes at the l th layer. In the first layer, the initial drug/disease node representations are each passed through a nonlinear projection function to match their dimensions. We use two weighting schemes for α (l) ij . The first variant follows the definition in [7] , and the weight is defined as follows: where d i and d j are the degree of nodes i and j respectively, and w ij are the edge weights defined in Sect. 3.3. The weights are fixed for all layers. The second weighting scheme instead learns the weighting scheme using attention mechanism [19] as follows: where g is a single fully-connected layer with LeakyReLU nonlinearity that takes a pair of node features as input. In the rest of this paper, we call the network with the first weighting scheme as GCN and the network with the second scheme as GAT. We predict the ADR signal of a drug-disease pair using the learned embeddings from the GNN model with a single bilinear layer as follows: where W p , b are the learnable weights, and v are the node features of drug node i and disease node j at the last GNN layer. The whole model is trained by minimizing the cross-entropy loss. As we get the labels from the SIDER database and the edge weight from the NHIS-NSC database, we retrieve the drug and disease nodes over the joint set of two databases. The resulting dataset is composed of 607 drugs and 556 diseases, and the number of positive samples, indicating the drug-side effect relationships, are 28,746 pairs. A negative sample is defined as a combination of drugs and diseases over the dataset, where the known 28,746 positive samples are excluded. We randomly select negative samples, setting the size of negative samples same as the size of positive samples. Since we extract those combinations from the SIDER database, it is plausible to believe that they have not been reported as ADRs. Although the labels are only given to the drug-disease pairs over the joint set of two databases, we make use of all the drugs and diseases in NHIS-NSC as graph nodes to utilize the relations among the drugs and diseases. To predict the link between the drugs and diseases, we split drug-disease pairs from the ADR dataset into training, validation, and test sets, ensuring that the classes of diseases included in each set do not overlap. The reason we split the data without overlapping disease classes is to increase the usability of the ADR signal detection model. It is also because only a few classes of diseases exist in our dataset, and therefore there could be a data leakage if the same disease class exists in both training and validation. The class of disease means the classification up to the third digit of ICD-10 codes. Note that we make the inference very difficult by not letting the model know which classes of diseases are linked with drugs as ADRs. We use 80% of data for training, 10% for validation, and the remaining 10% for testing. To control the sparsity of a graph, we build two types of graphs where the edge-forming threshold is either low or high. When the edge-forming threshold is low, the graph has more edges, having more information as a result. We examine whether it is beneficial or detrimental to have more edge information. We distinguish two graphs by setting the thresholds for E drug differently. The summary statistics of the constructed graphs and datasets are provided in Table 1 . To verify the performance of the GCN-based approach, we compare GCN-based models with non-graph-based ML techniques. We apply vanilla GCN and its variants to examine the effect of considering the edge types. The followings are the models used for the graph embedding learnings. All the neural-network based models use two layers with a hidden dimension of 300. -LR is a logistic regression (LR) approach with information of the graph topology. The vector composed of initial node representations of the node itself and its neighbor nodes are input to the LR model. The number of neighbors is limited to 10. -NN is a 2-layer fully-connected neural network which is solely based on the initial node representations. -GCN low is a graph convolution network, a representative GNN model that uses graph convolutions [7] . -GAT low is a GNN that applies the attention mechanism on the node embeddings. Here we use GAT with two layers, where the number of heads are (4, 4) for each layer. -adrGCN low is an adapted version of GCN, that uses separate GCN layers according to the edge types and then aggregate them. -GCN high , GAT high , adrGCN high are the graph-based models applied to the sparser graph, i.e. the edge-forming threshold is high. As shown in Table 2 , the proposed graph-based approaches surpass all the non-graph-based approaches. The best AUROC performance is achieved when GCN is applied with the low edge-forming threshold. The results show that the GCN model efficiently leverages the information from sufficiently selected edges. To see the robustness of the proposed method, we also examine whether our model works well for the infrequent drug-ADR pairs. We evaluate model performance for the infrequent drug-ADR pairs, which are labeled as 'rare' or 'post-marketing' in SIDER. As a result, the best average test accuracy in infrequent drug-ADR pairs is achieved with adrGCN high (0.746), demonstrating that using multiple GCNs according to the edge types is useful to detect rare symptoms. According to the SIDER database, the ADRs with 'rare' or 'postmarketing' labels are reported with frequencies under 0.01. To verify the power of the graph-based approach to discover ADR candidates which are unseen in the dataset, we extract the drug-disease pairs which are predicted to be positive with high probability-over 0.97 but labeled as negative (false positive). To demonstrate the genuine power of graph-based methods, we exclude the candidates that are also positively predicted by the baseline neural network, which does not use relational information. As a result, clinical experts (M.D.) confirm that there exist pairs that are clearly considered to be real ADRs. The pairs are listed in Table 3 . Many of the discovered pairs, including umbrella terms like edema, are rather symptoms and signs than diseases. This can be explained by the fact that the SIDER database is less comprehensive to cover all the specific symptoms, that can be induced by taking medicine. Especially, cardiac murmur and abnormal reflex are frequent symptoms, but it is reasonable to say that the suggested pairs are ADRs. For example, Dasatinib is used to treat leukemia and can have significant cardiotoxicity, which can lead to cardiac murmurs. Hydroxycarbamide is a cytotoxic drug used for certain types of cancer, and it is known that cytotoxic medications can cause electrolyte imbalance leading to abnormal reflex. There are also significant pairs such as alendronic acid and tetany in the third row. Severe and transient hypocalcemia is a well-known side-effect of bisphosphonates, which can lead to symptoms of tetany. Alendronic acid is classified as bisphosphonates, and therefore, tetany can be described as ADR of alendronic acid. Ibandronic acid and etidronic acidin the last two rows are also bisphosphonates, and the paired symptoms are relevant to the usage of bisphosphonates. Unspecified edema may signify bone marrow edema caused by bisphosphonate use, and electrolyte imbalance, which can lead to abnormal reflex, can be caused by etidronic acid use. All these explanations show that the ADR pairs we extract are based on various relations among drugs and diseases. In this study, we propose a novel graph-based approach for ADR detection by constructing a graph from the large-scale healthcare claims data. Our model can capture various relations among drugs and diseases, showing improved performance in predicting drug-ADR pairs. Furthermore, our model even predicts drug-ADR pairs that do not exist in the established ADR database, showing that it is capable of supplementing the ADR database. The explanation by clinical experts verifies that the graph-based method is valid for ADR detection. In this study, we only make inferences within the labeled dataset, yet we plan to make inferences on unlabeled data to discover unknown ADR pairs, which will be a huge breakthrough in ADR detection. A data mining approach for signal detection and analysis Decision support methods for the detection of adverse events in post-marketing data Under-reporting of adverse drug reactions Time-to-signal comparison for drug safety data-mining algorithms vs. traditional signaling criteria Machine learning model combining features from algorithms with different analytical methodologies to detect laboratory-event-related adverse drug reaction signals Predicting adverse drug events by analyzing electronic patient records Semi-supervised classification with graph convolutional networks Incidence of adverse drug reactions in hospitalized patients: a meta-analysis of prospective studies Dual memory neural computer for asynchronous two-view sequential learning Cohort profile: the national health insurance service-national sample Cohort (NHIS-NSC), South Korea Comparative analysis of pharmacovigilance methods in the detection of adverse drug reactions using electronic medical records Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs Distributed representations of words and phrases and their compositionality A novel algorithm for detection of adverse drug reaction signals using a hospital electronic medical record database Predicting drug side-effect profiles: a chemical fragment-based approach Utilizing social media data for pharmacovigilance: a review GAMENet: graph augmented memory networks for recommending medication combination Network embedding in biomedical data science Graph attention networks Safe medicine recommendation via medical knowledge graph embedding Detection of adverse drug reaction signals using an electronic health records database: comparison of the laboratory extreme abnormality ratio (clear) algorithm Graph embedding on biomedical networks: methods, applications, and evaluations LEAP: learning to prescribe effective and safe treatment combinations for multimorbidity