key: cord-0025173-x3lhuvcr authors: Kang, Juanjuan; Tang, Qiang; He, Jun; Li, Le; Yang, Nianling; Yu, Shuiyan; Wang, Mengyao; Zhang, Yuchen; Lin, Jiahao; Cui, Tianyu; Hu, Yongfei; Tan, Puwen; Cheng, Jun; Zheng, Hailong; Wang, Dong; Su, Xi; Chen, Wei; Huang, Yan title: RNAInter v4.0: RNA interactome repository with redefined confidence scoring system and improved accessibility date: 2021-10-30 journal: Nucleic Acids Res DOI: 10.1093/nar/gkab997 sha: 2566ed753892cc49f9f9efa9b072d29da87d6e54 doc_id: 25173 cord_uid: x3lhuvcr Establishing an RNA-associated interaction repository facilitates the system-level understanding of RNA functions. However, as these interactions are distributed throughout various resources, an essential prerequisite for effectively applying these data requires that they are deposited together and annotated with confidence scores. Hence, we have updated the RNA-associated interaction database RNAInter (RNA Interactome Database) to version 4.0, which is freely accessible at http://www.rnainter.org or http://www.rna-society.org/rnainter/. Compared with previous versions, the current RNAInter not only contains an enlarged data set, but also an updated confidence scoring system. The merits of this 4.0 version can be summarized in the following points: (i) a redefined confidence scoring system as achieved by integrating the trust of experimental evidence, the trust of the scientific community and the types of tissues/cells, (ii) a redesigned fully functional database that enables for a more rapid retrieval and browsing of interactions via an upgraded user-friendly interface and (iii) an update of entries to >47 million by manually mining the literature and integrating six database resources with evidence from experimental and computational sources. Overall, RNAInter will provide a more comprehensive and readily accessible RNA interactome platform to investigate the regulatory landscape of cellular RNAs. Benefiting from the advances of small-scale experiments and high-throughput sequencing technology, a growing number of RNA-associated interactions have been revealed over the past decades. These interactions have been implicated in almost all physiological and pathological conditions and involve processes such as cell growth and development, tumorigenesis and tumor invasion. For example, long non-coding RNA (lncRNA) BCRT1 can interact with miR-1303 to promote breast cancer progression (1) , and RNA-binding protein activities can regulate mRNA processing of specific gene sets to affect developmental processes (2) . With the exception of some established RNA-RNA and RNA-protein interactions, a number of other types of RNA-associated interactions, which also play important biological functions, have been recognized. With use of modular domains to establish chromosome conformations, RNA can interact with DNA and then regulate gene expression (3) . Moreover, non-coding RNAs are associated with drug resistance and have provided a new class of targets for drug discovery (4) , while histone modifications have been reported to be involved in the transcriptional regulation of RNA through biological processes such as RNA splicing (5) . Hence, a prerequisite for system-level understanding of RNA biological functions requires that RNA-associated interactions be integrated under one common framework. Over the past few years, through integrating experimentally validated and computationally predicted RNAassociated interactions, we have generated three versions of RNA Interactome Databases, RAID (6), RAID v2.0 (7) and RNAInter v3.0 (8) . These databases have provided information on a number of interactions, such as RNA-RNA (RRI), RNA-protein (RPI), RNA-DNA (RDI), RNA-Histone modification (RHI) and RNA-Compound (RCI), and have been widely used in the scientific community (9, 10) . As a result of our efforts to continuously update the previous versions of RNAInter, >40 million interactions have now been included, which has provided a comprehensive RNA interactome platform for researchers. However, these interactions were derived from a variety of resources, such as manually text mining, database integration and computational prediction. Therefore, to more accurately evaluate each interaction, it will be critical to optimize the algorithm of confidence scores. With this goal in mind, we have now updated RNAInter to version 4.0 (http:// www.rnainter.org or http://www.rna-society.org/rnainter/). In this version, the confidence scoring system has been redefined, as achieved by integrating the trust of experimental evidence, the trust of the scientific community and the types of tissues/cells. Moreover, we re-designed the website frame and updated its entries by manually mining the literature and integrating six databases. These improvements will substantially facilitate the operation of this system as users can quickly and accurately retrieve and browse RNA associated interactions deposited in the database. RNAInter integrates experimentally validated and computationally predicted RNA interactome data from the literature and databases. Through Pubmed database searches using the same keywords as that in our previous work, over 20 000 new entries describing experimentally validated RNA-RNA and RNA-protein interactions from over 30 000 published reports were screened. The interaction information obtained includes 'Interactors', 'Species', 'Tissue or Cell Line', 'Target region' and 'Supportive evidence'. In addition, RNAInter integrated three experimentally validated databases LncTarD (11), NPInter v4.0 (12) and NoncoRNA (13) as well as three computational predicted databases miRDB v6.0 (14) , oRNAment (15) and tRFtarget (16) , which include RNA-RNA, RNA-protein, RNA-compound and RNA-DNA interactions (Table 1) . Accordingly, the number of integrated databases in RNAInter has increased from 35 to 39, and consists of more comprehensive RNA interactome data. All new data were compared with interactions as generated from RNAInter v3.0 to eliminate redundancies. For interactions that were repeated, 'Tissue or Cell Line', 'Target region', 'Methods' and 'References' were integrated to the old interactions. Moreover, as data were derived from different sources, we standardized the names of tissues (or cell lines) and methods for all interactions. For all new interactors, symbols and assigned IDs used by renowned databases were standardized: miRNA from miRBase (17) , circRNA from circBase (18) , transfer RNAderived fragments (tRFs) from tRFdb (19) , compound from PubChem Compound (20) and other interactors from the NCBI Gene (21) . For the convenience of users, aliases, descriptions and other IDs from DrugBank (22) , OMIM (23), Ensembl (24), HGNC (25) , HPRD (26) and Uniprot (27) were all included in the interactor information. Like that as performed in previous versions, we also obtained the RNA editing/localization/ modification/structure/expression patterns. 'RNA editing' consisted of information on editing position, base change and genetic region from DARDED (28), Lncediting (29) and RADAR (30) . 'RNA localization' included subcellular localization and tissue/cell lines from RNALocate (31) . 'RNA modification' involved position and type of modification and genomic context from RMBase (32) . 'RNA structure' displayed the putative RNA secondary structures for each transcript as calculated by RNAstructure software (33) . 'Expression pattern' showed RNA expression values in each stage during human (or mouse) spermatogenesis (34, 35) and hematopoietic stem cell (HSC) lineage commitment (36, 37) , and expression correlation coefficients of the two interactors were determined. Some additional functional modules were also provided. For example, disease associations from MNDR v3.0 (38) and tissue specific expression from the GTEx project (39) . Homology interactions were pre-calculated as based on the orthology/paralogy gene sets from NCBI Gene (21) . An overview of data integration and annotations are shown in Figure 1 . In RAID v2.0, the supported methods were divided into strong (e.g. RNA immunoprecipitation and luciferase reporter assay) and weak (e.g. ChIP-seq and CLIP-seq) experimentally validated evidence and computationally predicted evidence [e.g. miRanda (40) and TargetScan (41)]. A confidence score was calculated using a sigmoid function by integrating different sources of evidence. In this version, the new confidence scoring system was defined by integrating the trust of experimental evidence (E), trust of the scientific community (S) and types of tissues/cells (T). Such an approach increases the scoring reliability. With regard to trust of experimental evidence, a smallscale experiment is more reliable than a large-scale screening. Therefore, interactions supported by publications describing few interactions should have higher confidence score than those supported by publications describing many interactions (42) . This metric, E, can be calculated as follows: where i is the number of publications or prediction tools supporting the interaction and n i represents the interaction number described or predicted by the i-th publication or prediction tool. Trust of the scientific community can be reflected by the number of citations and publication years as derived from Google Scholar. The greater the number of citations the higher the confidence score (42) . This metric, S, can be cal-culated as follows: where i is the number of publications or prediction tools supporting the interaction, r i represents the citations of the i-th publication or prediction tool, and y i is the publication year of the i-th publication or prediction tool. For the types of tissues/cells, the greater the number of tissues/cells in which an interaction was detected, the higher the confidence score for that interaction. This metric, T, can be calculated as follows: T = type of the tissues/cells These three metrics were scaled individually in a range from 0 to 1, with the weighted Euclidean distance then calculated as the confidence score. where ␣ and ␤ range from 0-1. We evaluated the new confidence scoring system based on the three different levels of supporting evidence. The interactions can be divided into three levels according to the different sources of evidence from which they were derived. Interactions with strong evidence were supported by strong experiments, while those with weak evidence by weak experiments and those with predicted evidence were only supported by predicted methods. As interactions with strong evidence and those with predicted evidence are the two more explicit interaction sets. We set the interactions with strong evidence as positive dataset, and those with predicted evidence as negative dataset. Then the area under the ROC curve (AUC) were applied to select the optimal weight combination through iterative traversal with a 0.05 step size. Final confidence scores were then log2-transformed and scaled from 0 to 1. In this way, interactions as reported within highly cited papers and involving detection in greater numbers of tissues/cells would receive a higher confidence score. In this version, through the scanning of >30 000 published reports and comprehensively evaluating RNA interactome (RNA-RNA/Protein/Compound/DNA) data, >6 million new entries were added, and 2 million entries were up-dated. From these >8 million interactions, over 600 000 interactions were experimentally validated, while the remaining were computationally predicted. The AUCs calculated by different weight combinations were listed in Supplementary Table S2 . The AUC of distinguishing interactions with strong and predicted evidences was the highest and up to 0.9230 when the ␣ and ␤ were 0.85 and 0.25. Finally, we chose this weight combination to integrate the three metrics (E, S and T). As shown in Figure 3A , the optimal cutoff was 0.198 with the specificity of 0.903 and the sensitivity of 0.936 for distinguishing interactions with strong and predicted evidences. At the same time, the performance of the scoring system was also satisfactory for distinguishing interactions with weak and predicted evidences. The AUC was also the highest and up to 0.9163, and the optimal cutoff was 0.186 with the specificity of 0.897 and sensitivity of 0.936 ( Figure 3A ). In addition, we also evaluated the score distribution of interactions with strong, weak and predicted evidences. As shown in Figure 3B , the greatest enrichment score intervals of interactions with experimental evidence (blue and green bar) were from 0.2 to 0.3, while those of interactions with predicted evidence (red bar) were from 0.1 to 0.2. The mean scores of interactions with strong and weak evidences were 0.2886 and 0.2767, which was obviously >0.1814, mean of interactions with predicted evidence. These results demonstrate that this new confidence scoring system can effectively estimate the reliability of RNA-associated interactions with more objective metrics and can be used to filter interactions of interest from vast arrays of interactions. Due to the large number of interactions, it is essential that database searches be performed quickly and thoroughly browsed. The RNAInter database, which was redesigned as based on the Django Model-View-Controller (MVC) framework, achieves this goal by enabling a quick retrieval and browsing of interactions with use of an improved userfriendly interface. As a result, the speed of searching and browsing with this new version is considerably faster than that of the old version. We optimized the browse module to display all related entries without being limited by the number of entries, which resolves the problem that the browse module cannot display interactions when the amount of data is too large. A filter option has also been provided to further screen results by interactor, interactor category and species, which allows users to more readily locate interactions of interest. Thus, the result page consisted of query condition, interaction display and filter option. Increasing evidence has accrued suggesting that RNAs, especially ncRNAs, are involved in the progression of a number of diseases (43) . Most of these RNAs are involved with regulating pathways related to the occurrence and development of diseases by interacting with target molecules (44) . To promote research on RNA associated interactions in disease processes, we embedded disease associations derived from MNDR v3.0 into RNAInter. We also added human tissue expression data from the GTEx project in the 'Expression pattern' module. In this way, expressions of two interactors in different tissues are simultaneously displayed. In RNAInter v4.0, we have increased the number of RNA associated interactions from the literature and other databases and optimized the website to enhance its speed and convenience. Most importantly, by integrating multidimensional information, we generated a confidence scoring system which can accurately evaluate each interaction and enable researchers to more readily screen RNA interactions of interest. For example, interactions without experimental support but with high confidence scores could be reliable candidates for RNA functional studies. This confidence scoring system will also contribute to a better understanding of the biological functions associated with these RNA interactions. Nucleic Acids Research, 2022, Vol. 50, Database issue D331 With the technological advances in experimental measurements and computational techniques, an explosive amount of literature will be generated with regard to descriptions of RNA interactome. It will be time-consuming and virtually unmanageable to manually review and collect RNA associated interactions from literature. Accordingly, the development of automated text mining tools for full-scale RNA interactome scanning will be fundamental for advances in this field. The RNAInter described in the present work provides one such reliable RNA interactome corpus for the development of a text mining tool, which will facilitate the application of artificial intelligence-based text mining algorithms. Overall, RNAInter provides an upgraded, comprehensive platform which will substantially contribute to the efficacy of functional RNA interactome research. Finally, it should be noted that this RNAInter database represents the current, but not final, version as it will be continuously updated and improved, once new data on RNA interactome become available. Supplementary Data are available at NAR Online. LncRNA BCRT1 promotes breast cancer progression by targeting miR-1303/PTBP3 axis Roles for RNA-binding proteins in development and disease HiChIRP reveals RNA-associated chromosome conformation Non-coding RNAs as drug targets Emerging roles of histone modifications and HDACs in RNA splicing RAID: a comprehensive resource for human RNA-associated (RNA-RNA/RNA-protein) interaction RAID v2.0: an updated resource of RNA-associated interactions across organisms RNAInter in 2020: RNA interactome repository with increased coverage and annotation Disease severity-specific neutrophil signatures in blood transcriptomes stratify COVID-19 patients Predicting the interaction biomolecule types for lncRNA: an ensemble deep learning approach LncTarD: a manually-curated database of experimentally-supported functional lncRNA-target regulations in human diseases NPInter v4.0: an integrated database of ncRNA interactions NoncoRNA: a database of experimentally supported non-coding RNAs and drug targets in cancer 2020) miRDB: an online database for prediction of functional microRNA targets 2020) oRNAment: a database of putative RNA binding protein target sites in the transcriptomes of model species 2021) tRFtarget: a database for transfer RNA-derived fragment targets miRBase: from microRNA sequences to function circBase: a database for circular RNAs tRFdb: a database for transfer RNA fragments PubChem in 2021: new data content and improved web interfaces Entrez Gene: gene-centered information at NCBI DrugBank 5.0: a major update to the DrugBank database OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders Genenames.org: the HGNC and VGNC resources in 2021 Human Protein Reference Database-2009 update UniProt: the universal protein knowledgebase in 2021 DARNED: a DAtabase of RNa EDiting in humans LNCediting: a database for functional effects of RNA editing in lncRNAs RADAR: a rigorously annotated database of A-to-I RNA editing RNALocate v2.0: an updated resource for RNA subcellular localization with increased coverage and annotation RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data RNAstructure: Web servers for RNA secondary structure prediction and analysis Single-cell RNA sequencing analysis reveals sequential cell fate transition during human spermatogenesis Single-cell RNA-seq uncovers dynamic processes and critical regulators in mouse spermatogenesis Human haematopoietic stem cell lineage commitment is a continuous process Tracing haematopoietic stem cell formation at single-cell resolution 2021) MNDR v3.0: mammal ncRNA-disease repository with increased coverage and annotation The Genotype-Tissue Expression (GTEx) project Human MicroRNA targets Predicting effective microRNA target sites in mammalian mRNAs A scored human protein-protein interaction network to catalyze genomic interpretation Non-coding RNAs in human disease 2017) miRNAs in B cell development and lymphomagenesis