key: cord-0047455-3p0yieqv authors: Fan, Chunyan; Lei, Xiujuan; Tan, Ying title: Inferring Candidate CircRNA-Disease Associations by Bi-random Walk Based on CircRNA Regulatory Similarity date: 2020-06-22 journal: Advances in Swarm Intelligence DOI: 10.1007/978-3-030-53956-6_44 sha: 73b174c639dda7745d7f856a9cb8f7acfd03c1fd doc_id: 47455 cord_uid: 3p0yieqv Identification of associations between circular RNAs (circRNA) and diseases has become a hot topic, which is beneficial for researchers to understand the disease mechanism. However, traditional biological experiments are expensive and time-consuming. In this study, we proposed a novel method named BWHCDA, which applied bi-random walk algorithm on the heterogeneous network for predicting circRNA-disease associations. First, circRNA regulatory similarity is measured based on circRNA-miRNA interactions, and circRNA similarity is calculated by the average of circRNA regulatory similarity and Gaussian interaction profiles (GIP) kernel similarity for circRNAs. Similarly, disease similarity is the mean of disease semantic similarity and GIP kernel similarity for diseases. Then, the heterogeneous network is constructed by integrating circRNA network, disease network via circRNA-disease associations. Subsequently, the bi-random walk algorithm is implemented on the heterogeneous network to predict circRNA-disease associations. Finally, we utilize leave-one-out cross validation and 10-fold cross validation frameworks to evaluate the prediction performance of BWHCDA method and obtain AUC of 0.9334 and 0.8764 ± 0.0038, respectively. Moreover, the predicted hsa_circ_0000519-gastric cancer association is analyzed. Results show that BWHCDA could be an effective resource for clinical experimental guidance. Circular RNAs (CircRNAs) are a type of non-coding RNAs with closed loop structures formed by back splicing [1] . Recently, large number of circRNAs are widely found in various livings [2] , and they could regulate gene expression at transcriptional or posttranscriptional levels by titrating microRNAs (miRNAs) [3] , regulating transcription and splicing [4, 5] , even several circRNAs could translate to produce polypeptides [6] . Increasing researches have demonstrated that the mis-regulation of circRNAs may cause abnormal cellular functions and associated with various diseases [7] . Thus, diseaseassociated circRNAs are becoming a class of promising biomarkers for disease diagnosis and treatment. However, it is costly and laborious to identify the disease-related circRNAs with biomedical experiments. Recently, several computational approaches have been developed. Lei et al. [8] firstly designed a path weighted approach named PWCDA to predict circRNA-disease associations. Likewise, KATZHCDA [9] is developed based on KATZ model to measure the probability for each pair of circRNA-disease associations, in which the circRNA expression similarity and disease phenotype similarity matrices are used as priori knowledge to establish the circRNA-disease heterogeneous network. DWNN-RLS [10] is designed based on Kronecker regularized least squares to predict the associations between circRNAs and diseases. iCircDA-MF [11] is developed based on non-negative matrix factorization by integrating the circRNA-gene, gene-disease and circRNA-disease relationships. Wang et al. [12] utilized a recommendation algorithm PersonalRank to measure the relevance between circRNAs and diseases based on circRNA expression profiles and functional similarity. Although several methods have developed for the circRNA-disease association prediction, it is still a challenge to obtain sufficiently accurate results. In this study, we developed a novel framework for forecasting circRNA-disease associations named BWHCDA, which integrated multiple similarity measures and implemented bi-random walk algorithm (Fig. 1) . First, circRNA regulatory similarity is effective measured based on circRNAs may play essential roles in regulating miRNA function in disease occurrence and progression. Moreover, combined with Gaussian interaction profiles (GIP) kernel similarity for circRNAs, the integrated circRNA similarity is effectively measured. Similarly, disease similarity is denoted as the average of disease sematic similarity and GIP kernel similarity for diseases. Subsequently, the heterogeneous network is constructed by combing the circRNA network, disease network and circRNAdisease associations. Then, circular bigraph (CBG) patterns are introduced in bi-random walk algorithm to predict the missing associations based on the heterogeneous network. The results show that BWHCDA could be considered as a powerful tool for predicting circRNA-disease associations. The experimentally validated human circRNA-disease associations are extracted from the CircR2Disease database [13] . Then, we choose the associations that circRNAs have been recorded in circBase database [14] and disease name recorded in disease ontology database [15] . Finally, we retained 371 circRNA-disease associations between 325 cir-cRNAs and 53 diseases as the gold standard dataset. The circRNA-disease adjacency matrix A(i,j) is established, if there is an association between circRNA and disease, A(i,j) is set as 1, otherwise 0. The miRNA-circRNA interactions are downloaded from the CircBank database [16] , and the interactions overlapped with disease-related circRNAs are selected to measure the regulatory similarity of circRNAs. It is measured as follows: where the set of M i have relationship with circRNA c i and the set of miRNA M j have relationship with circRNA c j . The disease names are described as hierarchical directed acyclic graph (DAG) based on the Medical Subject Headings (MeSH) descriptions for diseases. And disease semantic similarity is calculated by the DOSE [17] tool with Wang et al. method. Based on the assumption that similar circRNAs (diseases) are tend to have similar interaction or non-interaction pattern with diseases (circRNAs) [18] , the GIP kernel similarity for circRNAs and diseases are respectively calculated as follows: where c(i) (or d(i)) denotes the circRNA (disease) interaction profiles, which is the i-th row (column) of the adjacency matrix A. The parameters γ c and γ d are used to control the kernel bandwidth. n c (or n d ) is the number of circRNAs (diseases). The new circRNA similarity scores (SC) are calculated with the average scores of the circRNA regulatory similarity and GIP kernel similarity for circRNAs. Similarly, the integrated disease similarity (SD) is denoted as the mean of the disease semantic similarity and GIP kernel similarity for diseases. Then, the integrated circRNA similarity and integrated disease similarity are adjusted with the logistic function [19] . where x is the value of element of matrix SC or SD. Parameters c and d control the adjustment effects, and we set c as −15 and set d as log(9999), respectively. According to the circRNA similarity and disease similarity measures, the circRNA network and disease network can be constructed. Next, the weighted heterogeneous circRNA-disease network is constructed based on the circRNA network, disease network via gold standard circRNA-disease associations. The heterogeneous network could be considered as a bipartite graph, the nodes represent circRNAs or diseases, the edges represent three types of interactions of circRNA-circRNA, disease-disease and circRNA-disease. Based on the topology and structure characteristics of circRNA network and disease network, the concept of CBG was introduced. A CBG is described as a subgraph of a circRNA path {c 1 , c 2 , …, c n } and a disease path {d 1 , d 2 , …, d m }, in which the two ends connected by circRNA-disease associations (c 1 , d 1 ) and (c n , d m ). The CBG indicates a vicinity relation between the two association (c 1 , d 1 ) and (c n , d m ), which is generalized by their distance to other associations in the circRNA network and disease network. The length of CBG patterns (l, r) is determined by the longer length of circRNA path and the disease path. In this study, we hypothesize that most potential associations tend to be covered by many shorter CBGs in the unknown circRNA-disease network. If there are more CBG patterns between circRNAs and diseases, the higher possibility of circRNA-disease associations are. By iteratively adding the circRNA path and disease path, we calculates the CBGs weighted by decay factor α that ranges from 0 to 1. Because of different structures and topologies in the circRNA network and disease network, disparate optimal number of random walk steps are generated. Therefore, parameters l and r are introduced to restrict the number of random steps in circRNA similarity network and disease similarity network, respectively. The iterative process of bi-random walk is described as follows: On the circRNA network: On the disease network: where α is the decay factor that controls the importance of CBG for different paths, SL L and SD L represent the normalized matrix by using Laplace regularization. where Dc(i,i) (or Dd(i,i)) is the diagonal matrix of circRNA similarity matrix S c (S d ). By combining the propagation scores of matrices Cc and Dd, the relevance scores of unknown circRNA-disease associations could be obtained. The BWHCDA algorithm is outlined as Table 1. To assess the performance of BWHCDA method, leave-one-out cross validation (LOOCV) and 5-fold cross validation (10-fold CV) framework are performed on the gold standard datasets. For LOOCV, each known circRNA-disease association is removed in turn as testing sample, and the other associations are regarded as training samples. Then, the unknown circRNA-disease associations are considered as candidate associations, and the prediction performance is assessed by the predicted rank of test sample. In the framework of 10-fold CV, circRNA-disease associations are randomly divided into ten subsets, and each subset is utilized in turn as test set and the remaining as the train set on each time. To decrease the sample division bias, we perform 100 times repetitions of 10-fold CV. The receiver operating characteristic (ROC) curves are plotted to show the prediction performance by calculating the true positive rate (TPR) and false positive rate (FPR). Furthermore, the area under the curves (AUCs) are calculated to evaluate the overall performance. There are three parameters in the BWHCDA method, including α, I l , I r . To test the effects of the three parameters, we set α value as {0.2, 0.4, 0.6, 0.8}, and I l , I r are set from 1 to 5, respectively. Then, we could calculate AUC values based on LOOCV and the effects of these parameters are shown in Tables 2, 3, 4 and 5. The results indicate that α has little effects on prediction performance. When α = 0.4, I l = 4 and I r = 5, the AUC value of LOOCV is the highest with step length less than five. When α = 0.4, I l = 3, I r = 4, AUC value of LOOCV is the highest within step length less than four. The AUC value of LOOCV is the highest within step length than three steps when α = 0.4, I l = 2, I r = 3. And when α = 0.6, I l = 1, I r = 2, the AUC value of LOOCV is the highest within two steps. Finally, we set three parameters as α = 0.4, I l = 2, I r = 3, respectively. To further evaluate the prediction performance of BWHCDA, we compare it with other five methods including KATZHCDA [9] , PageRank [20] , NCP [21] , BDSILP [22] and Table 2 . When α is set as 0.2, the effect of parameters I l and I r for LOOCV AUC. Table 3 . When α is set as 0.4, the effect of parameters I l and I r for LOOCV AUC. Table 4 . When α is set as 0.6, the effect of parameters I l and I r for LOOCV AUC. Table 5 . When α is set as 0.8, the effect of parameters I l and I r for LOOCV AUC. HeteSim [23] . Consequently, BWHCDA method achieve the best performance among these six approaches based on AUC values of LOOCV and 10-fold CV with the same datasets (Figs. 2 and 3) . Therefore, BWHCDA method is better than other five methods. To further assess the prediction performance of BWHCDA method, we analyze the predicted hsa_circ_0000519-gastric cancer association. As shown in Fig. 4 , hsa_circ_0000519 may interact with miRNAs including hsa-miR-1233, hsa-miR-1258, hsa-miR-1296, hsa-miR-146b-3p, hsa-miR-521 to play their biological roles. The miRNA targets gene of these miRNAs have been validated related with gastric cancer, including hsa-miR-1258 target HPSE, hsa-miR-146b-3p target PER1 and IRAK1, hsa-miR-521 target ERCC8, hsa-miR-1296-5p target ERBB2. In addition, hsa-miR-1296 has been validated associated with gastric cancer. Therefore, hsa_circ_0000519 may be a potential biomarker for gastric diagnosis and prognosis. Prioritizing the potential associations between circRNAs and diseases is benefit to the development of the understanding of the disease mechanism, diagnose and treatment for diseases. The reasons that why BWHCDA method has better performance is shown as following aspects. First, bi-random explored the CBG patterns with iteratively implement random walk on the circRNA similarity network and disease similarity network. In addition, BWHCDA is a multi-task learning method that could forecast potential circRNA-disease associations simultaneously rather than mine candidate circRNAs for specific diseases. Therefore, BWHCDA could be an effective method for biomedical research. Circular RNA biogenesis can proceed through an exon-containing lariat precursor Circular RNAs are a large class of animal RNAs with regulatory potency A ceRNA hypothesis: the Rosetta Stone of a hidden RNA language? Circular intronic long noncoding RNAs A circRNA from SEPALLATA3 regulates splicing of its cognate mRNA through R-loop formation CircCode: a powerful tool for identifying circRNA coding ability CircRNA: functions and properties of a novel potential biomarker for cancer PWCDA: path weighted method for predicting circRNAdisease associations Prediction of CircRNA-disease associations using KATZ model based on heterogeneous networks DWNN-RLS: regularized least squares method for predicting circRNA-disease associations iCircDA-MF: identification of circRNA-disease associations based on matrix factorization Predicting circRNA-disease associations based on circRNA expression similarity and functional similarity CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases circBase: a database for circular RNAs Disease ontology: a backbone for disease semantic integration Circbank: a comprehensive database for circRNA with standard nomenclature DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis Gaussian interaction profile kernels for predicting drug-target interaction Associating genes and protein complexes with disease via network propagation The PageRank citation ranking: Bringing order to the web Network consistency projection for human miRNA-disease associations inference The bi-direction similarity integration method for predicting microbe-disease associations Predicting the associations between microbes and diseases by integrating multiple data sources and path-based HeteSim scores Acknowledgements. This work was supported by the National Natural Science Foundation of China (61972451, 61672334, 61902230) and the Fundamental Research Funds for the Central Universities, Shaanxi Normal University (GK201901010).