key: cord-1045126-dz0t4ujj authors: Su, Xiaorui; You, Zhuhong; Wang, Lei; Hu, Lun; Wong, Leon; Ji, Boya; Zhao, Bowei title: SANE: A sequence combined attentive network embedding model for COVID-19 drug repositioning date: 2021-08-23 journal: Appl Soft Comput DOI: 10.1016/j.asoc.2021.107831 sha: 49a527516fb67d9f2a79a8e521597241d43400d2 doc_id: 1045126 cord_uid: dz0t4ujj The COVID-19 has now spread all over the world and causes a huge burden for public health and world economy. Drug repositioning has become a promising treatment strategy in COVID-19 crisis because it can shorten drug development process, reduce pharmaceutical costs and reposition approval drugs. Existing computational methods only focus on single information, such as drug and virus similarity or drug-virus network feature, which is not sufficient to predict potential drugs. In this paper, a sequence combined attentive network embedding model SANE is proposed for identifying drugs based on sequence features and network features. On the one hand, drug SMILES and virus sequence features are extracted by encoder–decoder in SANE as node initial embedding in drug-virus network. On the other hand, SANE obtains fields for each node by attention-based Depth-First-Search (DFS) to reduce noises and improve efficiency in representation learning and adopts a bottom-up aggregation strategy to learn node network representation from selected fields. Finally, a forward neural network is used for classifying. Experiment results show that SANE has achieved the performance with 81.98% accuracy and 0.8961 AUC value and outperformed state-of-the-art baselines. Further case study on COVID-19 indicates that SANE has a strong predictive ability since 25 of the top 40 (62.5%) drugs are verified by valuable dataset and literatures. Therefore, SANE is powerful to reposition drugs for COVID-19 and provides a new perspective for drug repositioning. Coronaviruses are systematically classified into the genus Coronavirus of the family Coronaviridae of the order Nidovirales [1] . They are a group of enveloped, poritive-strand RNA viruses [2] with non-segrnented genomes of about 30,000 nucleotides and diameter of about 80-120nm [3] . Coronaviruses are widespread in nature [4] and prone to mutate, but it only infects vertebrates. Coronavirus has been widely concerned since 2000 owing to it has caused three serious outbreaks in the world to date, including the Severe Acute (MERS-CoV) in 2012 [6] and the epidemic we are experiencing COVID-19 (SARS-CoV-2) in 2019 [7] . To date, the COVID-19 outbreak has now spread to 6 continents including more than 60 countries [8] , causing more than 112 millions of people infected, 2.5 millions of deaths [9, 10] and the economic losses of over 300 billion dollars [11, 12] . As a result, developing or discovering effective drugs is so important for preventing public health from being threaten by COVID-19 and reigniting global economy. However, traditional drug development experiments are often time-consuming, costly and risky, which decides that it is unrealistic to develop new effective drugs to treat COVID-19 in a short time. Therefore, it is urgent to adopt a new method to accelerate the drug discovery process and find effective drugs for COVID-19 treatment. Drug repositioning [13] as an effective method to find new uses of existing drugs has received much attention and a large number of studies have been conducted and grow exponentially [14] in recent years. According to the records [15] , there are in total 94 cases in which a repositioned drug made it to the market, such as Aspirin [16] , Thalidomide [17] and Sildenafil [18] . More importantly, in addition to the advantages in time and cost, drug J o u r n a l P r e -p r o o f repositioning still has a number of advantages in drug adverse effects avoiding and economy [19] . Therefore, drug repositioning is a promising strategy for accelerating drug identifying of COVID-19 and minimizing the translational gap in drug development. At the beginning of the development of drug repositioning, biological methods such as molecular docking, account for the majority. Molecular docking can directly determine drug targets, but it usually requires experiments on all drugs, which is lack of efficiency. In recent years, benefiting from the rapid development of machine learning and artificial intelligence, computational methods are gradually proposed and applied widely in the field of bioinformatics. According to the methods used in computational models, these methods can be grouped into two classifies: similarity-based and network-based. Similarity-based methods focus on drug similarity and protein similarity, and recommend drugs according to similarities. Though similarity-based methods utilized the basic information of drug and other molecules comprehensively, they only get the information or similarities in 2D space and have the limitation in high dimension network or graph structure data. With the considerable advancements in social network and knowledge graph, network representation learning becomes a significant research tool in many fields. In drug repositioning task, a host of network-based models have been proposed in the last several years. The network-based methods, including Random Walk, graph neural network and knowledge graph embedding model, are employed to learn the network topology feature and repurpose potential drugs for diseases. Compared with similarity-based methods, network-based methods have the stronger learning ability and can capture complex information in high dimension space. However, network-based methods still have limitations in computational efficiency and lacking of node basic information. To address J o u r n a l P r e -p r o o f the above limitations, in this paper, we proposed a sequence combined attentive network embedding model SANE to repurpose drugs for COVID-19 by integrating drug SMILES and virus sequence information into an attention-based pre -search network embedding. Firstly, we collected drug SMILES and virus sequence information as their basic information. Simultaneously, we also collected a valuable drug-virus interaction dataset HDVD. Secondly, encoder-decoder is adopted to extract sequence feature. Then, in order to improve the efficiency of SANE, an attention-based Depth-First-Search (DFS) is applied to decrease network scale and reduce noises. After that, an attentive network embedding is used to learn representation for each node by aggregating basic information. By this way, the final representation contains both basic information and network topology information. In summary, our main contributions are as follows: 1) we address the drug repositioning task from network perspective and design an efficient attention-based DFS network embedding to identify potential drugs against COVID-19; 2) we integrate the drug and virus basic sequence information into network embedding to enhance information granularity, which also contributes to accuracy improvement; 3) we use attention-based DFS to reduce redundancies and noises before representation learning; 4) we test proposed model SANE and the results show that SANE achieves the new state-of-the-art results with significant improvements over baselines. The article is organized as follows. In the next section, we introduce the related works and the motivation of our work. In Section 3, we introduce the datasets used in our work J o u r n a l P r e -p r o o f and elaborate the three sub-models contained in SANE. The experiment settings and detailed experimental results are provided in Section 4. We discuss the results obtained by SANE in Section 5 and Section 6 concludes this article. Our research work is inspired by two lines of research: the inherent limitations of traditional similarity-based methods and the remarkable success of network representation learning algorithm in the interaction prediction of molecular network. According to the resources used in similarity-based models, they can be divided into two groups: single-source similarity and multi-source similarity. The models using singlesource similarity are simple and understandable. For example, the research work conducted by Li et al. [20] calculated the drug pair similarity based on drug structural information and repurposed drugs according to drug similarity. Afterwards, with the development of bioinformatics, more interactions data related to drugs are available, such as drug-target interactions, drug-disease interactions, protein sequence and disease structural information. As a result, the similarity in that stage was calculated by multi-source. Zhang et al. [21] integrated drug chemical structures, drug target proteins and drug-disease associations to extract similarity matrices of drugs and diseases, respectively and predicted potential drugs based on collaborative filtering. Moreover, Azad et al. [22] addressed drug repositioning by compiling heterogenous information for an exhaustive set of small-molecule drugs and integrated multiple sources to calculate drug similarity. Currently, similarity-based methods are also used to repurpose drugs for COVID-19 and there are a large number of J o u r n a l P r e -p r o o f works [23, 24] have been done on COVID-19. Meng et al. [25] proposed a similarity constrained matrix factorization model to identify new drug-virus interactions by calculating drug SMILES (Simplified Molecular Input Line Entry System) similarity and virus sequence similarity. Though similarity-based methods have been widely used and achieve good performances, they are still limited in 2D space as the similarity is calculated between two molecules, which lead to the lacking of global perspective. Different from similarity-based methods, network-based methods address drug repositioning from a global perspective. This kind of methods concentrate on network topology feature learning based on network representation learning algorithms. For example, Luo et al. [26] proposed an efficient approach to capture global information of drug based on Random Walk for drug repositioning to prioritize candidate drugs for disease and the proposed approach outperformed similarity-based methods. Similarly, multisources are also used in network-based methods. In 2017, Luo et al. [27] integrated drugs, proteins, diseases and side-effects information to construct a heterogeneous network for drug-target interactions and repurposing existing drugs and validated experiment proved that the repositioned drug they found was able to prevent inflammatory disease. Networkbased methods are also used to repurpose drugs for COVID-19, Zeng et al. [28] integrated multi-source to build a comprehensive knowledge graph to discover drugs for COVID-19 In the study, we collected the recently constructed human drug-virus interactions network (HDVD) [25] as the training dataset to measure the model performance. HDVD assembled a significant number of experimentally validated drug-virus interaction entries from literature by text mining technology. The statistics of HDVD is shown in Table 1 and all the interactions in HDVD are supported by experiments. In addition, we also collected basic information of drugs and viruses contained in HDVD. Drug SMILES is one of the most popular molecular structure 1D representations. Thus, we downloaded the drug SMILES from DrugBank (V5.1.7) [29] . Generally, virus is represented by its RNA or DNA sequence. As a result, the genome nucleotide sequences of viruses are downloaded from the National Center for Biotechnology Information (NCBI). In data collecting stage, we have collected drug SMILES and virus genome nucleotide J o u r n a l P r e -p r o o f Journal Pre-proof sequences. Both drug SMILES and virus genome nucleotide sequences are the most representative molecular representations and have been widely used to extract features respectively. Different from previous work, we adopted encoder-decoder to extract sequence information. In fact, encoder-decoder can be viewed as a variant of Recurrent Neural Network (RNN). Long Short-Term Memory (LSTM) as one of the widely used basic units in RNN is also adopted in this work. We used Bi-LSTM in encoder layer and LSTM in decoder layer. The advantages of encoder-decoder structure in this work is that it can accept unequal length input and process the long sequence reasonably. We denote the input as , which represents the drug SMILES or virus sequence. Firstly, an embedding layer is needed to transfer the unequal sequence to a machine understandable vector ∈ ℝ . Secondly, the embedding vector is sent into encoder layer to obtain encoder output, hidden state and cell state. Thirdly, encoder output is sent into decoder layer as its input and initialize decoder layer using hidden state and cell state. Then, the final output is obtained after decoder layer processing. Specifically, both encoder layer and decoder layer are stacked by LSTM as shown in Fig. 2 and single LSTM is shown in Fig. 3 . Therefore, the encoder-decoder process can be formulated by: , Each LSTM unit mainly completes the following task: receive current data, transmit previous information, send information to next unit and update current cell state. Unlike the traditional RNN, LSTM controls the inflow and outflow of information by setting up J o u r n a l P r e -p r o o f three gates, which are input gate, forget gate and output date. In particular, input gate is responsible for processing the input of the current sequence position. Suppose the current time is and we denote the last time output as − and current time input as . The input gate as the pink lines shown in Fig. 3 . can be formulated by: Here, can be viewed as a probability that how much information will be passed in and represents the information that the current unit receiving. The second gate, forget gate as the cyan lines shown in Fig. 3 . is responsible for processing the hidden state from last time. The forget gate can be formulated by following and represents that how much previous hidden information can be transmitted into current unit. The third gate, output gate, controls the output of current unit, which can be denoted as and as the green lines shown in Fig. 3 . During this process, there is another parameter is added to the calculation, called cell state C. Cell state is updated by input gate and forget gate. Thus, the output of current unit can be represented by: In our work, encoder-decoder aims to learn sequence feature. As a result, the encoder output is the concatenation of forward LSTM representation and reverse LSTM Previous network representation learning models can be grouped into three groups (MF-based, DW-based and NN-based) according to learning strategy. Unlike these existing models, a preprocess is used in proposed model for accurate learning, called attentionbased DFS, which is shown in Fig. 4 . It has the advantage of reducing noises and redundancies and promoting accuracy of representation learning. In detail, there are two hyperparameters in this process, which are search depth and sampling number. Search depth represents the maximum layer and sampling number controls the selected neighbor number. At each depth, we select top N targeting neighbor nodes by its attention weight, where N is equal to sampling number. The attention weight is calculated by neighbor node sequence and head node sequence obtained from encoder-decoder layer, which is formulated by: In real training process, the attention-based DFS is implemented for each node. In order to decrease the computational complexity, the attention weight can be calculated as following before DFS. Suppose the interaction matrix is ∈ ℝ ( ) × ( ) , drug embedding matrix is ∈ ℝ ( ) × and virus embedding matrix is ∈ ℝ ( ) × , where d represents embedding dimension, the attention weight can be calculated by: After applying attention-based DFS to decrease graph scale and reduce noises, network embedding is used to learn network topology feature. Conceptually inspired by spatial-based GNN method [30] . Our work here can be regarded as that the targeted node is represented by nodes in its DFS network. The DFS network can be understood as the view of targeted node in different layers. For first layer, neighbor representation 1 ( ) of targeted node is calculated by attention weights and neighbor initial representation obtained from encoder-decoder layer, which can be formulated by: 1 ( ) represents receptive field of targeted node in first layer. Then, using first receptive field representation to update targeted node: With the increasement of layer, updating each node in DFS network using its receptive field in next layer. Therefore, the whole process is a recursive process and can be represented by: ∈ ( ) ) , 0 < < ℎ (16) Here, inspired by previous work [31] , we also designed three kinds of update or J o u r n a l P r e -p r o o f aggregation function, including sum, concatenation and neigh. After obtaining the node representation, the concatenation of drug representation and virus representation is sent into a forward neural network for predicting the probability of interaction and sigmoid function is used in forward neural network. In the experiment, Adam algorithm is adopted to optimize all trainable parameters and five-fold cross-validation is used to evaluate the performance of proposed method. The In addition, in order to prove the predictive ability of proposed model, we also select four types of computational models as baseline models, including similarity-based models and representation learning-based models (MF-based models, RW-based models and NN based models). We select two representative computational models for each type. For example, SCPMF [25] and IMCMDA [32] of similarity-based models, Laplacian [33] and Graph Factorization [34] of MF-based models, Deepwalk [35] and Node2Vec [36] of RWbased models and LINE [37] and SDNE [38] of NN-based models. The parameters used in these baseline models are the same as their initial work. In the experiment, five-fold cross-validation is used to test proposed model on HDVD To prove the effectiveness of proposed model, we compared it with above-mentioned baseline models using five-fold cross-validation for predicting drug-virus interactions. Table 3 showed the detailed results among four evaluation indicators and we also plotted ROC and PR curves to express the results intuitively (see Fig. 5 Table 4 and ROC curves and PR curves are shown in Fig. 7 and Fig. 8 . Hyperparameter sensitivity analysis is significant for the performance of a model in different scenarios [25] . There are four hyperparameters used in proposed model, including search depth, sampling number, learning rate and embedding dimension. Firstly, we mainly focused on the search depth and sampling number and conducted five-fold cross-validation on the HDVD dataset to select parameters. Specifically, we investigated the influence of search depth and sampling number by varying from 1 to 8. In this section, case study is conducted to estimate the ability of proposed model to [39] . Studies in vitro revealed that Zanamivir inhibits SARS coronavirus infection. The third drug, Ribavirin [40] was initially recommended in clinical practice for the China 2019-nCoV pneumonia diagnosis and Treatment Plan Edition 5-Revised. As for the fifth drug Oxymetholone, it was found to be effective against wasting associated with HIV infection in a clinical trial [41] . How it might be helpful in treating COVID19 infection is debatable [42] but it has been recorded in Excela COVID-19 drug repurposing dataset. The sixth drug, Camostat, can block SARS-CoV-2 infection of lung cells and could be considered for off-label treatment of COVID-19 infections [43, 44] . Owing to Equilin blocked the cellular entry of a pseudovirus formed by an HIV-core packed with the Zaire Ebola virus glycoprotein in an in-vitro experiment and Zaire Ebola and COVID-19 belong to coronavirus [45] . Therefore, Equilin is recognized as a potential drug to treat COVID-19 [42] . The eighth drug Melatonin has been shown to target pathological alterations associated with an Ebola infection [46] such as endothelial disruption, disseminated intravascular coagulation and multiple organ hemorrhage. Melatonin plays an inhibitory role on lung oxidative stress induced by respiratory syncytial virus infection in mice [47] and it is also considered as one of the potential drugs to treat COVID-19 [42] . Additionally, Ref. [48] indicated the combination of remdesivir and emetine therapy may provide better clinical benefits. These empirical results indicate that proposed model has a strong predictive ability and can narrow the scope of candidates for further biological experiments. Drug repositioning as an effective drug development method provides a far more rapid option on the clinic than traditional drug discovery. In this study, we proposed a networkbased drug repositioning method SANE to identify potential drugs for COVID-19. Unlike them, SANE embeds the attribute information into network representation learning. By doing this, the representation obtained by SANE contains not only network structural information but node attribute information. Third, SANE adopts attention-based learning strategy to ensure the stability and accuracy. Previous models select the nodes or paths randomly, but in our study SANE selects the neighbor nodes by attention weight, which is a more reasonable strategy. Then, the model can concentrate on the crucial part of network and reduce the random noises. Though SANE performs well in COVID-19 drug repositioning, it still faces some challenges. SANE will be limited by the incomplete sequences or information lacking due to the attribute feature is extracted from drug SMILES or virus sequences. Besides, though SANE is able to get reliable initial node embeddings, it is challenged to maintain the node attribute similarity in the low dimension space. J o u r n a l P r e -p r o o f However, there is still room for further improvement. In the future, we will adopt various optimization methods [49] to further improve the prediction performance. Moreover, we will enlarge the network by integrating associations related to drugs, such as drug-drug interactions and drug-target interactions, to further enhance the information dimension, and adopt probability-based negative set sampling strategy to enhance the stability of the model. J o u r n a l P r e -p r o o f The coronaviridae Coronavirus genomics and bioinformatics analysis MERS-CoV outbreak in Jeddah-a link to health care facilities The proximal origin of SARS-CoV-2 The SARS-CoV-2 outbreak: what we know Coronavirus disease ( COVID-19) An interactive web-based dashboard to track COVID-19 in real time Spillover of COVID-19: impact on the Global Economy, Available at SSRN 3562570 The global macroeconomic impacts of COVID-19: Seven scenarios Drug repositioning: identifying and developing new uses for existing drugs Drug repositioning and repurposing: terminology and definitions in literature, Drug discovery today Drug repositioning: a brief overview Inhibition of Prostaglandin Synthesis as a Mechanism of Action for Aspirin-like Drugs Thalidomide--a revival story Alzheimer's disease drug development pipeline Therapeutic drug repurposing, repositioning and rescue: Part loverview. Drug discov A new method for computational drug repositioning using drug pairwise similarity Computational drug repositioning using collaborative filtering via multi-source fusion A Comprehensive Integrated Drug Similarity Resource for In-Silico Drug Repositioning and Beyond A survey of current trends in computational drug repositioning A review on drug repurposing applicable to COVID-19 Drug repositioning based on similarity constrained probabilistic matrix factorization: COVID-19 as a case study Computational drug repositioning with random walk on a heterogeneous network A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information Repurpose Open Data to Discover Therapeutics using Deep Learning 0: a major update to the DrugBank database for Inductive representation learning on large graphs KGNN: Knowledge Graph Neural Network for Drug-Drug Interaction Prediction Predicting miRNA-disease association based on inductive matrix completion Laplacian Eigenmaps for Dimensionality Reduction and Data Graph Embedding Techniques, Applications, and Performance: A Survey, Knowledge-Based Systems DeepWalk: online learning of social representations Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining Line: Large-scale information network embedding Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining Inhibition of SARS coronavirus infection in vitro with clinically approved antiviral drugs Novel coronavirus treatment with ribavirin: Groundwork for an evaluation concerning COVID-19 Oxymetholone for the treatment of HIV-wasting: a double-blind, randomized, placebo-controlled phase III trial in eugonadal men and women Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2 SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor Protease inhibitors targeting coronavirus and filovirus entry Study of gonadal hormone drugs in blocking filovirus entry of cells in vitro, Yao xue xue bao= Ebola virus disease: potential use of melatonin as a treatment Inhibitory effect of melatonin on lung oxidative stress induced by respiratory syncytial virus infection in mice Of chloroquine and COVID-19 Adam Deep Learning with SOM for Human Sentiment Classification The authors would like to thank all anonymous reviewers for their constructive advice.