key: cord-0542294-yoc0wa9x authors: Mallet, Vincent; Oliver, Carlos; Broadbent, Jonathan; Hamilton, William L.; Waldispuhl, J'erome title: RNAglib: A Python Package for RNA 2.5D Graphs date: 2021-09-09 journal: nan DOI: nan sha: 65e80838f0035a7569324b38886714e91ce829dd doc_id: 542294 cord_uid: yoc0wa9x RNA 3D architectures are stabilized by sophisticated networks of (non-canonical) base pair interactions, which can be conveniently encoded as multi-relational graphs and efficiently exploited by graph theoretical approaches and recent progresses in machine learning techniques. RNAglib is a library that eases the use of this representation, by providing clean data, methods to load it in machine learning pipelines and graph-based deep learning models suited for this representation. RNAglib also offers other utilities to model RNA with 2.5D graphs, such as drawing tools, comparison functions or baseline performances on RNA applications. The method and data is distributed as a fully documented pip package. Availability: https://rnaglib.cs.mcgill.ca Recent developments in machine learning and deep learning techniques enable us to leverage the vast amount of biological data, such as sequencing data or biochemical assays, publicly released and organized by the community. This allowed breakthrough in many areas, including the prediction of protein 3D structures with AlphaFold [10] . These progresses allow the whole structural biology community to move on new critical challenges, previously out of reach and far from the spotlight. This is in particular the case for RNAs. RNA is a highly structured molecule that supports many regulatory and enzymatic functions beyond its well-known messenger role [13, 4] . As such, it is a promising class of therapeutic drug targets [22, 2] as illustrated by novel treatments of CMV retinitus patients with AIDS [6] or the production of self-amplifying vaccines, which has recently seen a high-level of success in clinical trials for COVID-19 [5] . Because of the limited (but growing) amount of structural data available for RNAs, the task of designing robust machine learning methods to predict RNA 3D structures is more challenging that for proteins. However, RNA folding rely on a remarkable hierarchical organization of its structure. From our capacity to efficiently * Equal contribution † To whom correspondence should be addressed. arXiv:2109.04434v1 [q-bio.QM] 9 Sep 2021 use this information will depend the success of machine learning applications. Representing objects as graphs is a strong prior knowledge and graph neural networks have shown to induce a tremendous performance boost in many applications. To capture the tertiary structure of RNA in a computationally feasible manner, a growing number of algorithms make use of 2.5D graph networks [19, 15, 9, 20, 18, 3] . These networks represent RNA molecules as topological graphs, whose nodes are nucleotides and whose edge types are structural categories of interactions. We have previously successfully combined the 2.5D graph representation with graph neural networks to predict small molecule binding [14] and believe that their wider adoption is limited by the lack of dedicated software to use this representation. To our knowledge, the Python package forgi is the only effort in streamlining research on RNA networks. However, it focuses on coarse-grained models based on secondary structure elements instead of base-pair interactions and does not include machine learning features. We present a PyPi package, RNAglib that aims to fill that gap by providing utilities to represent the structure of RNA as 2.5D graphs. RNAglib provides clean data available for download along with loading, encoding and splitting routines. It also provides structural comparison functions that help unsupervised pre-training, and default models to learn RNA properties such as small-molecules or protein binding : we offer a benchmark performance for these tasks. Finally, RNAglib offers utility scripts to save, preprocess or plot graphs so that the manipulation of the data for research is facilitated. The package includes the following modules: • prepare_data: Updates and builds annotated graph database. • loading: DGL data loaders for RNA graphs. • models: Pre-built GCN models. • learning: Learning routines for the easiest use of the package. • drawing: 2.5D graph visualization tools. • kernels: Subgraph comparison routines. • benchmark: Reproducible evaluation procedures for benchmarking We collect and update (on a bi-monthly basis) a set of all PDB structures containing RNA on our server. We include a redundancy reduced subset of graphs based on a list of integrated functional elements curated by the RNA BGSU group (version 3.145; or latest) [17] . We then use the annotation software x3DNA-DSSR [12] to compute annotations for all structures and complement these annotation using BioPython [1] . With this collection of annotations we construct annotated directed 2.5D graphs using the networkx module [7] . It contains information in each node (such as the nucleotide type or chemical modifications), edge (such as the Leontis-Westhof classification [11] ) and also at the level of the whole graph, such as resolution. Figure 1 shows an example of such a graph along with the different attributes it contains at the different levels. This data can be downloaded directly or through our python package. Equipped with this data, the user can specify which node or edge features need to be included in the graph representation. The user can also choose specific target for the machine learning algorithm, and we provide an automatic data splitting routine to test the trained models. Finally, pre-computations enabling fast subgraphs comparisons are run and also available for use for instance in unsupervised learning settings. In recent years, graph machine learning tasks are predominantly using graph neural networks. We build our tools into the two most established frameworks for this purpose : PyTorch [16] and DGL [21] . The first standardized pipeline we offer is the use of kernel functions for unsupervised machine learning [8] . Unsupervised machine learning settings rely on functions which can compare data points and circumvent the need for annotated data during training. We have implemented several structural kernels, such as subgraphs matching kernels, along with dedicated data loaders to conduct the data processing specifically for this unsupervised task. RNAglib then enables the users to easily combine supervised learning and unsupervised phase, by adding a classification head on the top of the model. Both parts of this training scheme can be conducted simultaneously, with a loss for each component. Along with these machine learning features, we also include several functions to facilitate the handling of RNA 2.5D graphs. For instance, RNAglib includes scripts to trim dangling nucleotides, perform statistics on the data or cut a big graph into smaller coherent ones. RNAglib also comes with an RNA Graph Edit Distance, the gold standard of graph comparisons, as well as drawing tools customized for 2.5D RNA graphs. Because RNAglib comes with a principled way of loading and learning on graphs, we anticipate it can become a reference benchmark for RNA bioinformatics, but also for graph machine learning practitioners. We include a first baseline on predicting several node level properties and present our results in Table 1 . We will keep a leader board of the current best solutions for each of the tasks along with the method used to solve it. We present RNAglib, a set of tools to manipulate graph representations of RNA 3D structures, and use it to conduct machine learning and visualization tasks. We provide the graph machine learning community with a novel and challenging data set to develop and benchmark new methodologies. Not only is it solving a real world problem, it is also a data set whose core signal lies in the graph topology and the edge types, an original setting compared to current graph data sets. Simultaneously, we provide the structural RNA community an interface to use graphs and machine learning, hopefully helping this community to better solve RNA challenges. Biopython: freely available python tools for computational molecular biology and bioinformatics Automated motif extraction and classification in rna tertiary structures Potent and specific genetic interference by double-stranded rna in caenorhabditis elegans Amplifying rna vaccine development A randomized controlled clinical trial of intravitreous fomivirsen for treatment of newly diagnosed peripheral cytomegalovirus retinitis in patients with aids Exploring network structure, dynamics, and function using networkx Inductive representation learning on large graphs Rnamotifcontrast: a method to discover and visualize rna structural motif subfamilies Highly accurate protein structure prediction with alphafold Geometric nomenclature and classification of rna base pairs 3dna: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures Non-coding rna Augmented base pairing networks encode rna-small molecule binding preferences Vernal: A tool for mining fuzzy network motifs in rna Pytorch: An imperative style, high-performance deep learning library Automated classification of rna 3d motifs and the rna 3d motif atlas Mining for recurrent long-range interactions in rna structures reveals embedded hierarchies in network families Automated, customizable and efficient identification of 3d base pair modules with bayespairing Fr3d: finding local and composite recurrent structural motifs in rna 3d structures Deep graph library: Towards efficient and scalable deep learning on graphs Rna drugs and rna targets for small molecules: Principles, progress, and challenges